How to Test a LangGraph Agent
A layered method for testing a LangGraph agent: unit-test nodes, verify routing and state, then run simulated users against the compiled graph.
Test a LangGraph agent in layers: unit-test each node and tool as a plain function with the LLM mocked, test the graph's routing and state transitions directly, integration-test full runs with a real model, then run simulated users against the compiled graph and evaluate whole trajectories — not just final answers.
LangGraph's structure is a testing gift
Most agent frameworks bury the agent's logic inside a loop you can't see into. LangGraph doesn't: a StateGraph is an explicit map of nodes (functions that take state and return updates), edges (what runs next), and a typed state object that flows through it. That explicitness means an unusually large share of your agent is ordinary code — and ordinary code can be tested with ordinary tests.
The layers below are ordered by cost and determinism: cheap and deterministic at the bottom, expensive and probabilistic at the top. The mistake most teams make is living entirely at one end — either unit tests that never catch conversational failures, or vibe-checking full conversations with no cheap tests underneath.
| Layer | What it checks | LLM involved? | Deterministic? |
|---|---|---|---|
| 1. Nodes & tools | Your functions do what they claim | Mocked | Yes |
| 2. Control flow | Routing, state transitions, resumption | Mocked | Yes |
| 3. Integration | Prompt + model + graph work end to end | Real | No |
| 4. Behavioral | Realistic users get good outcomes | Real | No (pin it — see layer 5) |
Layer 1: unit-test nodes and tools
Every node is a function from state to a partial state update. That signature is trivially testable — no graph, no framework harness, no model:
def apply_discount(state: OrderState) -> dict:
if state["customer_tier"] == "pro":
return {"total": state["total"] * 0.9}
return {}
def test_pro_discount():
assert apply_discount({"customer_tier": "pro", "total": 100})["total"] == 90
def test_no_discount_for_free_tier():
assert apply_discount({"customer_tier": "free", "total": 100}) == {}
For nodes that call an LLM, split the node: keep prompt construction and response parsing as pure functions you test exhaustively, and mock the model call itself. Malformed model output — the tool call with a missing argument, the JSON with a trailing comment — is one of the most common production failures, and it's fully testable at this layer by feeding your parser bad strings.
Tools deserve the same treatment plus one more case: failure. What does the node return when the API times out or the lookup finds nothing? If a tool error becomes an unhandled exception, the graph dies mid-conversation; that's a unit test, not a discovery for launch week.
Layer 2: test the graph's control flow
LangGraph's conditional edges are driven by router functions — plain functions that read state and return the name of the next node. Test them like the pure functions they are:
def route_after_triage(state: SupportState) -> str:
if state["needs_human"]:
return "escalate"
return "respond"
def test_frustrated_user_escalates():
assert route_after_triage({"needs_human": True}) == "escalate"
Beyond individual routers, verify the paths that matter end to end through the graph with the LLM mocked: compile the graph, stub the model nodes to return canned outputs, and assert the run visits the nodes you expect in the order you expect. This catches wiring bugs — an edge pointing to the wrong node, a branch that can never be reached, a loop with no exit — that no amount of node-level testing will find.
Two LangGraph-specific behaviors belong in this layer:
- State transitions. Assert that state accumulates correctly across nodes — especially reducer-managed keys like message lists, where a wrong annotation silently overwrites instead of appending.
- Checkpointing and resumption. If you compile with a checkpointer, a conversation's state persists per
thread_id. Test the resume path: invoke once, invoke again on the same thread, and assert the second turn sees the first turn's state. If you use interrupts for human-in-the-loop approval, test that a paused run actually resumes where it stopped rather than restarting.
app = graph.compile(checkpointer=InMemorySaver())
cfg = {"configurable": {"thread_id": "test-thread-1"}}
app.invoke({"messages": [{"role": "user", "content": "My order is late"}]}, cfg)
result = app.invoke({"messages": [{"role": "user", "content": "It's order 4412"}]}, cfg)
# assert the second turn's context includes the first turn
Layer 3: integration-test whole runs
Layers 1 and 2 prove your code is right. They prove nothing about whether your prompts, the real model, and the graph produce good behavior together — mocked responses are, by construction, the responses you expected. So run the compiled graph with the real model on a small, fixed set of scenarios: the happy path, one tool-failure path, one out-of-scope request, one ambiguous request.
Because the model is non-deterministic, assert on properties, not exact strings: the run terminated, the right tool was called with plausible arguments, the final state contains a booking ID, the response is in the user's language. Keep this suite small — it's slow, costs real tokens, and flakes — and treat a flaky property test as information: if "the agent asks a clarifying question" passes only 70% of the time, that's not a bad test, that's a measurement.
Layer 4: behavioral testing with simulated users
Everything so far tests the agent against inputs you wrote — which means inputs you thought of. The failures that hurt in production come from users you didn't think of: the person who gives information out of order, types in Spanish, changes their mind at turn five, or calls back tomorrow expecting to be remembered.
This layer runs realistic simulated users against your compiled graph, each with its own goal, phrasing, patience, and context, over full multi-turn conversations. (How user simulators work, and how they go wrong, is covered in User Simulation for AI Agents.) Your checkpointer earns its keep here: because LangGraph persists per-thread state, a simulated user can hold a genuine multi-session relationship with the agent — first contact today, follow-up tomorrow on the same thread — and you can test whether memory-dependent behavior actually holds up rather than assuming it does.
Evaluate trajectories, not just final answers. A LangGraph run gives you the full path — which nodes fired, which tools were called, how state evolved — and a correct final answer reached by calling the refund tool three times is a failure that answer-only scoring will grade as a pass. Metrics and methods for trajectory-level scoring are in the AI Agent Evaluation guide.
This is also where Synthetic Signals connects, for what it's worth: it's framework-agnostic — agents plug in over MCP or OpenTelemetry — so a LangGraph graph is tested the same way as any other agent, against a Census-grounded population of thousands of distinct synthetic users rather than a handful of hand-written test personas.
Layer 5: regression-test with pinned populations
A LangGraph agent changes constantly — prompts tuned, nodes added, model versions bumped — and each change can un-fix an old failure. The discipline that prevents this: pin your behavioral test inputs. Same simulated users, same scenarios, same seeds, re-run on every change; when a conversation fails and you fix it, that exact scenario joins the suite permanently. With reproducible populations, yesterday's one-off failure becomes tomorrow's automated gate. The general technique — separating input determinism, which you can control, from output determinism, which you can't — is the subject of Regression Testing Non-Deterministic Agents.
Honest limits
Layered testing narrows the gap between "compiles" and "works"; it doesn't close it. Mocked-LLM tests verify your code, not your prompts. Integration tests sample the model's behavior a handful of times. Simulated users approximate real ones and inherit an LLM's politeness unless deliberately grounded. And LangGraph's own moving parts — checkpointer backends, streaming, subgraphs — have version-specific behavior worth re-verifying against current docs when you upgrade. The stack's job is to make production surprises new surprises, not ones a cheap test could have caught.
Start at the bottom: an afternoon of node and router tests catches the embarrassing failures, and each layer above catches the class the previous one structurally can't.
FAQ
How do you unit test a LangGraph node?
A LangGraph node is a function that takes state and returns a state update, so you can call it directly in a plain test: construct an input state, invoke the function, and assert on the returned update. Mock the LLM call where the node's own logic is what you're testing.
How do you test conditional edges in LangGraph?
Conditional edges are driven by a router function that reads state and returns the name of the next node. Test that function directly: feed it states representing each branch and assert it returns the right destination, including the fallback path.
Should you mock the LLM when testing a LangGraph agent?
Mock it in unit and control-flow tests, where you're verifying your own code and determinism matters. Use the real model in integration and behavioral tests, because prompt-model interaction is precisely what those layers exist to check.
How do you regression test a LangGraph agent?
Pin the test inputs — the same simulated users, scenarios, and seeds — and re-run them against the graph on every change. When a run fails, keep that exact scenario in the suite permanently so the failure can't silently return.