Debugging LangGraph Agents: How to Stop Flying Blind in Cyclic Graphs

Loops, stuck states, and invisible failures are LangGraph's hardest debugging problems. Here's a toolkit to solve them.

The Debugging Problem

LangGraph's cyclic graphs are powerful -- agents can loop, retry, reflect, and route dynamically. They're also much harder to debug than linear pipelines. When something goes wrong in a cycle, the symptoms are often indirect: the agent gets stuck, the output is wrong, or the workflow runs forever and hits a token limit.

The core problem is visibility. Without the right tools, you can't tell which node executed, in what order, what state looked like at each step, or why the conditional edge took a particular path.

This article covers five debugging techniques that turn LangGraph from a black box into a traceable system.

Technique 1: Add Recursion Limits

The fastest win. LangGraph will execute cycles indefinitely by default. Set a recursion limit to cap the number of node executions per invocation -- this turns infinite loops into recoverable errors.

# Set a recursion limit to prevent infinite loops
config = {
    "configurable": {"thread_id": "session-1"},
    "recursion_limit": 25,  # max node executions before raising an error
}
 
try:
    result = graph.invoke(input_state, config)
except GraphRecursionError as e:
    print(f"Graph hit recursion limit: {e}")
    # Handle gracefully -- log, notify, return partial result

A good starting limit for most agents is 25–50. If a legitimate task requires more than 50 node executions, your graph design probably needs review -- break it into sub-graphs.

Technique 2: Stream Node Execution in Real Time

Instead of waiting for the final output, stream each node's execution as it happens. This shows you exactly which nodes ran, in what order, and what state they produced -- without any external tooling.

# stream_mode="updates" emits state changes after each node
for chunk in graph.stream(input_state, config, stream_mode="updates"):
    for node_name, state_update in chunk.items():
        print(f"Node: {node_name}")
        print(f"State update: {state_update}")
        print("---")
 
# stream_mode="values" emits the full state after each node (more verbose)
for state in graph.stream(input_state, config, stream_mode="values"):
    print(f"Full state: {state}")

This is the single most useful debugging tool for LangGraph. Run it locally whenever an agent behaves unexpectedly -- you'll see exactly where the problem is within seconds.

Technique 3: Inspect State at Any Point

With a checkpointer configured, you can fetch the current state of a thread at any time -- even after the graph has finished running. This lets you inspect what the agent decided, what tools it called, and what the final state was.

# Get the current state of a thread
state = graph.get_state(config)
print("Current state:", state.values)
print("Next nodes to execute:", state.next)
print("Metadata:", state.metadata)
 
# Get full history of all checkpoints for a thread
history = list(graph.get_state_history(config))
for checkpoint in history:
    print(f"Step {checkpoint.metadata.get('step')}: {checkpoint.values}")
 
# Rewind to a specific checkpoint and re-run from there
# (useful for testing different paths without re-running from scratch)
target_config = history[2].config
result = graph.invoke(None, target_config)  # resume from checkpoint

Technique 4: Add Explicit Debug Nodes

For persistent issues, add dedicated debug nodes to your graph that log state details to stdout or a file. These are temporary -- remove them before production -- but they're invaluable when streaming alone doesn't give you enough context.

import json
from datetime import datetime
 
def debug_node(state: dict) -> dict:
    """Drop this node anywhere in your graph to inspect state."""
    print(f"\n{'='*50}")
    print(f"DEBUG NODE @ {datetime.now().isoformat()}")
    print(f"Messages count: {len(state.get('messages', []))}")
    print(f"Last message: {state.get('messages', [{}])[-1]}")
    print(f"Tool calls pending: {state.get('tool_calls', [])}")
    print(f"{'='*50}\n")
    return state   # pass state through unchanged
 
# Add to your graph wherever you need visibility
builder.add_node("debug", debug_node)
builder.add_edge("suspicious_node", "debug")
builder.add_edge("debug", "next_node")

Technique 5: LangSmith Tracing

For production observability, LangSmith gives you a web UI showing every graph execution as a full trace: each node, its inputs and outputs, latency, token usage, and the path taken through the graph. It's the closest thing LangGraph has to a debugger.

# Enable LangSmith tracing -- just set environment variables
import os
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your_langsmith_key"
os.environ["LANGSMITH_PROJECT"] = "my-agent-project"
 
# That's it -- all graph invocations now appear in the LangSmith UI
result = graph.invoke(input_state, config)

LangSmith has a free tier. Even if you don't use it in production, running it locally during development cuts debugging time significantly.

Diagnosing Common Failure Patterns

Symptom	Likely cause	Fix
Agent loops forever	Conditional edge condition never returns END	Check edge logic; add recursion limit
Agent stops after 1 step	Edge condition returns END prematurely	Inspect state after step 1 with get_state()
Wrong tool called	Tool descriptions are ambiguous or overlapping	Rewrite tool descriptions; use stream to trace
State appears empty	Node returning None instead of updated state	Ensure every node returns a dict or TypedDict
Works locally, fails in prod	InMemorySaver used in prod -- state wipes on restart	Switch to SqliteSaver or PostgresSaver

Quick Reference

Set recursion_limit in config to prevent infinite loops -- start at 25
Use graph.stream(stream_mode='updates') to trace execution node by node
Use graph.get_state() and get_state_history() to inspect checkpointed state
Add temporary debug_node functions to inspect state at specific points
Enable LangSmith tracing with two environment variables for production observability
Every node must return a dict -- returning None silently corrupts state