Agent Frameworks and Best-Practice Architectures

The agent framework ecosystem is churning. New libraries appear monthly; existing ones pivot their architecture every few months. Picking the wrong framework doesn't doom your project, but picking with clear eyes — knowing what each one optimizes for and what it trades away — saves you from expensive migrations.

This article covers the major frameworks in production use today. For each: what it's actually good at, when you should choose it, what the observability story looks like, and what you give up. Then we close with the framework-agnostic principles that apply regardless of what you're running.

The Frameworks

LangGraph

What it is: A graph-based orchestration framework from LangChain. Your agent logic is expressed as a directed (optionally cyclic) graph: nodes are functions or LLM calls, edges are transitions. State flows through the graph; each node reads and writes to a shared state dict.

Positioning: LangGraph is the most expressive framework for complex, stateful, multi-step workflows. If your agent needs cycles (loops, retries, reflections), conditional branching, or human-in-the-loop interruption points, LangGraph's graph model fits naturally. It's the framework LangChain built specifically to replace their AgentExecutor abstraction, which was string-based and didn't handle cycles or complex branching well.

When to pick it:

Your workflow is genuinely a graph, not a linear chain
You need human-in-the-loop interruption at specific nodes
You're already in the LangChain/LangSmith ecosystem
You need persistent state across sessions (LangGraph's checkpointing handles this)

Observability: First-class LangSmith integration. Every node execution is a traced span with inputs and outputs. The graph structure is visible in the LangSmith UI. For non-LangSmith backends, LangGraph emits OTEL spans when configured with an OTEL exporter — but the graph metadata (which node, which edge was taken) is not always captured in the OTEL attributes by default. Verify your OTEL config before assuming full trace fidelity.

Tradeoffs: LangGraph's expressiveness comes with complexity. The graph definition, state schema, conditional edges, and interrupt points add cognitive overhead. Simple linear tasks feel overengineered in LangGraph. The LangChain ecosystem dependency is also real — if LangChain changes (and it has, frequently), your LangGraph code may need updates.

AutoGen (Microsoft)

What it is: A multi-agent conversation framework from Microsoft Research. Agents are modeled as participants in a conversation — they have roles, they can be LLM-backed or human-backed, and they communicate via structured messages. The "groupchat" model lets N agents talk to each other, with a manager agent (or round-robin) controlling the floor.

Positioning: AutoGen is the strongest choice for genuinely multi-agent conversational workflows — debate, critique-and-revise, multi-expert panels. The conversation model is intuitive and the human-in-the-loop story is the best of any framework (a UserProxyAgent seamlessly represents a human participant in the conversation).

When to pick it:

You're building a reflection or debate pattern (generator + critic)
You want human-in-the-loop at the message level, not just at interrupt points
You're prototyping quickly and want readable code
Your team thinks in terms of roles and conversations rather than graphs or chains

Observability: AutoGen's observability story has improved significantly with AutoGen 0.4+ (the actor-based rewrite, sometimes called AutoGen-Core). TODO: verify current version — the framework was in active architectural transition in 2025. The newer version supports OTEL instrumentation. Older AutoGen (pre-0.4) relies on callback-based logging that doesn't produce OTEL-compatible spans. Check which version your code targets before assuming trace compatibility.

Tradeoffs: The groupchat model can be opaque — it's not always clear which agent said what, in which order, and why the conversation ended when it did. Non-determinism is high: the same task can produce different conversation flows on different runs. This makes debugging harder and snapshot testing unreliable unless you control the conversation topology precisely.

CrewAI

What it is: A higher-level framework that models multi-agent systems as a "crew" — agents have roles, goals, and backstories; tasks have expected outputs; the crew executes tasks in a defined sequence or in parallel. It's built on LangChain under the hood (though the project has been moving toward framework independence).

Positioning: CrewAI prioritizes developer experience and fast prototyping. The role/task/crew abstraction is legible to non-engineers, which makes it useful for cross-functional teams where the product manager can read and understand the agent definition. It's the fastest path from "I have an idea for a multi-agent workflow" to a running demo.

When to pick it:

Fast prototyping is the priority
Your workflow maps naturally to "a team of people with defined roles working on defined tasks"
You want readable agent definitions without deep framework knowledge

Observability: CrewAI logs task execution and agent interactions. OTEL support is community-contributed and varies by version. TODO: verify current OTEL support status in CrewAI 0.x. The tracing fidelity in CrewAI is currently insufficient for span-level attribution without custom instrumentation — you'll likely need to add spans manually around key steps.

Tradeoffs: The high-level abstraction is also a ceiling. Complex control flow (conditional execution, error handling at the task level, partial retries) is awkward in CrewAI's task model. The framework makes simple things simple and complex things hard. When your requirements outgrow the role/task/crew metaphor, you'll feel it immediately.

OpenAI Agents SDK (formerly Swarm)

What it is: OpenAI's official framework for multi-agent systems. It was released as "Swarm" in 2024 (explicitly experimental and deprecated quickly), then superseded by the Agents SDK. The Agents SDK is built around OpenAI's function-calling and the concept of "handoffs" — structured transitions from one agent to another.

Positioning: If you're using OpenAI models exclusively and want a framework that's designed around OpenAI's API surface (function calling, structured outputs, the Assistants API), the Agents SDK is the obvious choice. The handoff mechanism is clean: one agent can hand control to another agent with a structured context transfer, and the transition is explicit in the trace.

When to pick it:

OpenAI-only stack
You want first-party support and long-term maintenance guarantees
The handoff pattern (explicit agent-to-agent transitions) fits your workflow

Observability: OpenAI provides tracing via the Agents SDK's built-in run API. Traces are available via the OpenAI platform dashboard. OTEL export is supported. Handoff spans are well-structured — each handoff creates a distinct span boundary, which makes attribution cleaner than frameworks that blur the agent transition.

Tradeoffs: OpenAI model lock-in is real. Switching to Anthropic or open-source models requires moving away from the framework. The Agents SDK's design is also somewhat opinionated around the OpenAI API surface — if you're using structured outputs or function calling features specific to OpenAI, porting to another framework means porting those patterns too.

LlamaIndex Agents

What it is: The agent layer of the LlamaIndex ecosystem. LlamaIndex's primary strength is retrieval-augmented generation (RAG) — its agent layer sits on top of that, letting you build agents that can query indexes, execute reasoning steps, and call tools, with native integration with LlamaIndex's data connectors and vector stores.

Positioning: LlamaIndex Agents is the strongest choice when your agent is fundamentally a RAG system with some agentic control flow. If the core of your agent is "retrieve relevant context, reason over it, produce an answer or action," LlamaIndex's native data layer gives you a significant productivity advantage over alternatives.

When to pick it:

Your agent is RAG-heavy
You're already using LlamaIndex for data ingestion
You want native integration with a wide range of data sources (PDFs, databases, APIs) without writing adapters

Observability: LlamaIndex has good tracing support via its CallbackManager and the newer instrumentation module. OTEL support is available. The RAG-specific spans (query, retrieve, synthesize) are well-instrumented, making it possible to trace exactly which chunk was retrieved and how it affected the final output. This is particularly valuable for debugging Poor Information Retrieval failures.

Tradeoffs: The LlamaIndex agent layer is more limited than LangGraph or AutoGen for complex multi-agent control flow. If your workflow has deep coordination requirements, you'll hit LlamaIndex's limits faster than you'd like.

Smolagents (Hugging Face)

What it is: Hugging Face's minimalist agent framework. The design philosophy is explicit: "small, readable, hackable." Agents are code-executing: they generate Python code and execute it, rather than selecting from a predefined tool list. This gives them significantly more flexibility than tool-based agents.

Positioning: Smolagents is the right choice for research teams and for tasks where the space of possible actions can't be enumerated in advance. A code-executing agent can compose ad-hoc solutions from Python libraries without requiring each action to be wrapped as a tool. This is powerful for exploratory tasks.

When to pick it:

Research, prototyping, or tasks where flexibility > reliability
You want to understand what's happening in your framework (the codebase is intentionally small)
You're using open-source or Hugging Face-hosted models

Observability: Smolagents' observability story is nascent. The framework is young enough that OTEL support is not mature. Code-executing agents present a unique tracing challenge: the "tool call" is arbitrary Python, not a structured function call, so the span attributes don't cleanly map to standard gen-ai semconv. Custom instrumentation is required for production use.

Tradeoffs: Code-executing agents are harder to sandbox (arbitrary code execution is a security surface), harder to trace (unstructured execution), and harder to make deterministic (the same prompt can produce different code). Smolagents is better for research and exploration than for production systems with reliability requirements.

Custom-Built

What it is: No framework — just your own orchestration code on top of the model API.

When to pick it: When your requirements are specific enough that every framework is fighting you. High-throughput production systems, specialized execution semantics, or cases where the framework's abstraction layer is a performance ceiling.

Observability: Whatever you instrument. The advantage is full control over span shapes; the disadvantage is that you're writing all the instrumentation yourself. Instrument to the OTEL gen-ai semconv so your traces are compatible with any backend.

Tradeoffs: You own the bugs. The framework ecosystem moves fast; you'll be implementing features that frameworks already have. Only custom-build when you've genuinely exhausted the alternatives.

Framework Comparison

Framework	Best for	Observability	Control flow	Lock-in
LangGraph	Complex, stateful, cyclic workflows	Strong (LangSmith native, OTEL)	Graph-native	LangChain ecosystem
AutoGen	Conversational multi-agent, human-in-loop	Improving (0.4+), OTEL supported	Conversation-based	Low
CrewAI	Fast prototyping, role-based teams	Weak (custom instrumentation needed)	Role/task linear	Low
OpenAI Agents SDK	OpenAI-native, clean handoffs	Good (platform + OTEL)	Handoff-based	OpenAI models
LlamaIndex Agents	RAG-heavy workloads	Good (retrieve/synthesize spans)	Limited	LlamaIndex data layer
Smolagents	Research, code-executing agents	Nascent	Flexible (code)	Low
Custom	Unique requirements	Full control	Full control	None

Framework-Agnostic Principles

These apply regardless of what you're running. They're the difference between a multi-agent system that you can confidently maintain and one that becomes a black box you're afraid to touch.

1. Idempotency at Every Step

Every step in your agent should produce the same output for the same input. This means: no side effects that aren't tracked, no state mutations that aren't logged, no randomness that isn't seeded.

Why this matters: replay-based debugging is only possible if you can replay. If step 4 has a non-idempotent side effect that changes the state that step 5 reads, replaying step 4 in isolation produces a different result than replaying the full trace. You can't isolate the failure.

Practical implementation: write tools as pure functions where possible. When side effects are unavoidable (database writes, API calls), make them idempotent via request IDs or conditional writes. Log every state mutation as a span attribute, not a side effect.

2. Partial-State Recovery

Your agent should be able to resume from any checkpoint, not just restart from scratch. This means: define what "state" is at each step, serialize it, and ensure the agent can be initialized from any prior state.

Why this matters: long-running agents fail in the middle. If you can only restart from the beginning, every failure means starting over. If you can resume from step 7 of a 20-step task, you recover 65% of the work.

Practical implementation: design your state schema before you design your agent logic. Use LangGraph's checkpointing, or implement your own state persistence with a simple key-value store. Test recovery by injecting failures at each step and verifying the agent resumes correctly.

3. Deterministic Replay

If you run the same trace twice with the same inputs, you should get the same output. This requires: fixed random seeds, deterministic tool call ordering, no environmental dependencies that aren't captured in the trace.

Why this matters: deterministic replay is the foundation of regression testing. If you can't replay a trace deterministically, you can't write a test that reliably catches the same failure twice. ProveAI Origin's snapshot and replay feature is built on this primitive — a snapshot stores the input state, and replay re-executes against that state. If your agent isn't deterministic, replay results are noise.

Temperature=0 is not sufficient for determinism. Tool call order, API response variance, and context window limits can all introduce non-determinism even at temperature=0. Audit each source of variance explicitly.

4. Observability Above Instrumentation

There's a difference between having tracing turned on and having traces that are useful. The common mistake is to instrument at the framework level (which gives you timing and status codes) and call it done. Production debugging requires content: the full inputs and outputs of every LLM call, the full payload of every tool result, the reasoning steps in between.

The principle: capture the content at every span boundary, not just the metadata. An LLM span without the prompt content and response content is a timer, not a trace. A tool span without the input parameters and output payload is a heartbeat, not a diagnostic.

This has cost implications — storing full trace content at scale is expensive. The pragmatic approach: capture full content in development and staging; capture content selectively in production (for traces that fail, or for a sampled percentage of traces). Do not discard content before you know whether a trace failed.

5. Explicit Failure Handling at Every Tool Call

Every tool call can fail. Every tool call should have an explicit failure handling path, not a generic catch block.

The common failure: a tool call raises an exception, the exception is caught by a generic handler, the agent retries the entire step, and the failure is never surfaced in the trace with enough information to attribute it.

The better pattern: classify the failure at the point of failure. Is this a transient error (retry)? A schema error (fix the tool call)? A business logic error (surface to the user)? Attach the failure classification to the span attribute before retrying or propagating. This gives your attribution system something to work with.

6. CI as the Regression Gate

Every agent change should be verified against a suite of historical failures before merging. This is not a new principle — it's standard software engineering. What's new is that the "tests" for an agent system are not unit tests or integration tests — they're snapshot replays of production failures.

The workflow: a production failure happens → ProveAI Origin attributes it → the attribution is saved as a snapshot → the snapshot suite runs on every PR as a CI check → a change that regressions on the snapshot is blocked from merging.

This closes the loop between production failures and development. Without it, you're making changes blind, hoping that fixes don't introduce new failures in adjacent cases.

Try It

Regardless of which framework you're running, ProveAI Origin ingests traces from all of them — OTEL gen-ai semconv, LangSmith exports, Langfuse exports, and custom JSON. Upload a trace at /app/new and see how it parses against the span schema. The /app/insights/heatmap is particularly useful for comparing failure distributions across different framework-generated traces.

Continue reading: Core Principles expands on the framework-agnostic principles above, from ProveAI Origin's perspective. Common Failure Modes maps specific failure patterns to the framework configurations that produce them most often.

Agent Frameworks and Best-Practice Architectures

Agent Frameworks and Best-Practice Architectures

The Frameworks

LangGraph

AutoGen (Microsoft)

CrewAI

OpenAI Agents SDK (formerly Swarm)

LlamaIndex Agents

Smolagents (Hugging Face)

Custom-Built

Framework Comparison

Framework-Agnostic Principles

1. Idempotency at Every Step

2. Partial-State Recovery

3. Deterministic Replay

4. Observability Above Instrumentation

5. Explicit Failure Handling at Every Tool Call

6. CI as the Regression Gate

Try It

Related articles

Try it