Multi-agent systems moved from "research project" to "daily infrastructure" in 2026. Three frameworks are doing the most real-world work.
LangGraph — state machines for agents
LangChain's state-machine framework. Nodes are agents or tools; edges are state transitions. Best for:
- Deterministic workflows (this → that → maybe_this)
- Systems where you need to inspect agent state at every step
- Production workloads with observability requirements
The learning curve is steep because you have to think in graph terms from day one.
DSPy — prompts as code
Stanford's DSPy treats prompts as optimizable code. You declare a signature ("summarize this → 100-word summary"), pick a strategy (ChainOfThought, ReAct), and DSPy compiles prompt variations it benchmarks on your test set. Best for:
- Teams with an evaluation loop
- Complex multi-hop reasoning
- Anyone tired of hand-tuning prompts
Trade-off: the optimization loop needs real test data; on greenfield projects you bootstrap slower.
n8n — visual workflows for non-developers
n8n is the Zapier/Make.com of agents. Drag-and-drop nodes, visual triggers, 400+ integrations. Best for:
- Glue code between SaaS tools (Slack → GPT → Notion)
- Teams with mixed technical skill
- Quick automation you'd otherwise leave as a cron + python script
Less good when your workflow involves complex LLM chaining or stateful agent loops.
When to roll your own
All three of the above are overkill for simple cases. Sometimes "a Python script that calls OpenAI in a loop and writes to a database" is the right answer. The NSC Dashboard is a real-world example of this — no orchestration framework; just cron + Python + SQLite as a message bus.
Decision matrix
| Your situation | Pick |
|---|---|
| SaaS integrations | n8n |
| Production LLM chains with observability | LangGraph |
| Optimizing prompts against test data | DSPy |
| Simple script-and-DB orchestration | Roll your own |
| Multi-agent research project | LangGraph |
| Moving to production fast | n8n or roll your own |
Related
Detailed framework comparison
| Dimension | LangGraph | DSPy | n8n | Roll-your-own |
|---|---|---|---|---|
| Programming model | State machine (graph) | Prompt-as-code | Visual drag-drop | Whatever you like |
| Learning curve | Steep | Steep | Gentle | Depends on you |
| Best for | Deterministic flows | Optimisable prompts | SaaS glue | Known requirements |
| Observability out of the box | LangSmith integration | Partial | Execution log | Build it yourself |
| LLM provider support | Many via LangChain | Many | Many | You pick |
| Testing story | Unit-test each node | Built-in eval loop | Clunky | Your test framework |
| Production maturity | High | Medium | High (for SaaS) | Depends |
| Multi-agent / handoffs | Native | Awkward | Manual | Whatever you build |
When each wins
LangGraph — deterministic flows you need to observe
Best when you know the workflow shape (classify → route → handle → verify) and you need to trace every node's state. The integration with LangSmith gives you the observability most teams want. Use it when:
- You have compliance requirements (financial, healthcare).
- Workflow shape is stable — you're not iterating on "what are the nodes?" weekly.
- Team has prior LangChain experience.
DSPy — prompts you want to keep improving
DSPy's pitch is: stop hand-tuning prompts; define the signature, pick a strategy, let the framework find the best prompt variant against your test set. Use it when:
- You have a reliable eval set (or can build one quickly).
- Prompts are the bottleneck (not tool use, not multi-agent coordination).
- You're willing to buy into the DSPy DSL.
n8n — glue between SaaS
Not really an "agent orchestrator" but often used as one. Drag-drop nodes for Slack / Notion / HubSpot / OpenAI / Anthropic. Use it when:
- The workflow is mostly SaaS-to-SaaS with light LLM sprinkled in.
- Team includes non-developers.
- You want to prototype agent-like workflows without writing Python.
Roll your own — when the frameworks get in your way
For systems where the agent pattern is well-understood and stable, a FastAPI + Postgres + BullMQ (or Celery) stack is less code than learning LangGraph. Use it when:
- You know exactly what the agents need to do.
- Observability requirements are specific (you want OpenTelemetry, not LangSmith).
- Team prefers "boring" Python infrastructure.
How we tested and compared
The SpecPicks multi-agent pipeline (market-research → blueprint → roadmap → build → test-fix loop) runs a roll-your-own stack — Python + Postgres + openclaw cron jobs. We evaluated LangGraph and DSPy during the initial design and walked away from both for different reasons: LangGraph added observability overhead we didn't need; DSPy's eval loop didn't fit a pipeline with no pre-existing test set.
For framework-specific patterns we cross-reference the LangGraph documentation, DSPy documentation, and n8n docs. The roll-your-own pattern shown here is a simplification of what NSC Dashboard actually runs.
Roll-your-own pattern — minimal working skeleton
# agents/classify.py — decide what kind of request this is
def classify(req: dict) -> str:
resp = llm.complete(f"Classify this request: {req}. Choose one of: simple_query, complex_task, human_needed.")
return resp.strip()
# agents/complex_handler.py — multi-turn tool-using agent
def handle_complex(req: dict) -> dict:
ctx = {"req": req, "tools": TOOLS, "history": []}
for step in range(MAX_STEPS):
action = llm.tool_use(prompt=ctx)
ctx["history"].append(action)
if action.type == "final_answer":
return {"answer": action.content, "steps": step+1}
return {"error": "max_steps_exceeded"}
# orchestrator.py — wire them together
def orchestrate(req: dict):
route = classify(req)
if route == "simple_query":
return llm.complete(req["question"])
elif route == "complex_task":
return handle_complex(req)
elif route == "human_needed":
return {"needs_human": True, "request_id": queue_for_human(req)}
That's 30 lines and it handles classification, routing, and multi-turn agent loops. Add Postgres-backed state, replace queue_for_human with a real queue, and you have a production orchestrator.
Common failure modes across all four approaches
1. Infinite loops. Every agentic system eventually hits "agent decides to call the same tool forever." Cap steps; log when you hit the cap; review logs for patterns.
2. Stale context. Long-running agents accumulate irrelevant history. Periodically compact (summarise early turns) or restart with fresh context.
3. Cost blow-up. Multi-turn agents burn tokens fast. A LiteLLM proxy with per-agent spend caps is the single best guardrail.
4. Tool failures cascade. If web_search tool 500s, does the agent retry, fall back, or surface the error? Most frameworks punt on this; you build it.
Frequently asked questions
Do I need an orchestrator framework at all?
Often no. A single LLM call with tool use is enough for 80% of "agent-shaped" tasks. Only reach for an orchestrator when you have clear multi-step state or multi-agent handoffs.
What about AutoGen / CrewAI / others?
- AutoGen (Microsoft): conversational multi-agent pattern; good for "agents talking to each other" workflows.
- CrewAI: role-based agents, more opinionated than LangGraph. Good for small teams ramping up fast.
- smolagents (Hugging Face): minimal Python agent library; good middle ground between "roll your own" and full frameworks.
How do I add observability to a roll-your-own stack?
OpenTelemetry + a trace visualiser (Jaeger, Tempo, Phoenix by Arize). Add a span per LLM call; attach request/response as attributes. This is what NSC Dashboard does internally.
Can I use Claude Code as an orchestrator?
Yes — Claude Code is effectively an agentic orchestrator pre-wired for software-engineering tasks. For domain-specific orchestration (non-coding), it's overkill; use it as a reference implementation and build your own.
What language are most production agent stacks in?
Python dominates (LangGraph, DSPy, LangChain, AutoGen are all Python-first). TypeScript / Node is second (n8n, Mastra). Go and Rust are niche but growing. Ruby / Java are rare.
Sources
- Anthropic — Claude Code best practices — reference for production agentic patterns.
- LangGraph documentation — canonical LangGraph reference.
- r/LocalLLaMA — community agent-framework discussions.
- LiteLLM documentation — proxy layer commonly wrapped around agent stacks.
- Aider GitHub repository — reference for an opinionated agent loop.
Related guides
- NSC Dashboard — multi-agent orchestration stack
- AI coding assistants ranked
- Self-hosted Claude proxy
- Self-hosting an OpenAI-compatible LLM gateway
— SpecPicks Editorial · Last verified 2026-04-21
