Smolagents: HuggingFace's 1,000-Line Agent Framework That Outperforms LangChain

Smolagents is HuggingFace's minimalist agent framework. 26K GitHub stars, ~1,000 lines of core code. Code agents write Python instead of JSON tool calls, using 30% fewer LLM steps and scoring 44.2% on GAIA (vs 7% for GPT-4-Turbo alone). Full technical breakdown with architecture, benchmarks, and sandbox options.

April 5, 2026 ยท 2 min read

What is Smolagents

Most agent frameworks solve complexity with more complexity. Smolagents goes the other direction. Built by HuggingFace, it's an open-source Python library where the entire agent loop fits in roughly 1,000 lines of code. No sprawling class hierarchies. No framework-specific DSLs. Just a loop that asks an LLM what to do next, executes the action, and feeds back the result.

The key design decision: agents write Python, not JSON. When a smolagents CodeAgent needs to call a tool, it generates a Python code snippet that invokes the tool as a function. The code gets executed (locally or in a sandbox), and the output becomes the next observation in the agent's memory. This sounds like a small detail. It turns out to be the most consequential architectural choice in the framework.

~1K
Lines of core agent code
26K+
GitHub stars (1 year)
30%
Fewer LLM steps vs JSON tool calling

Why 'Smol' Matters

The name is the thesis. Agent logic should be small enough that you can read the entire implementation in one sitting. When something breaks, you debug it by reading the source, not by tracing through abstraction layers. Smolagents has 207 contributors and 1,035 commits, but the core remains deliberately compact.

Architecture

Every smolagents agent follows the same loop. The framework calls this the "multi-step agent" pattern:

The Agent Loop (Pseudocode)

memory = [system_prompt, user_task]

while llm_should_continue(memory):
    action = llm_generate_next_action(memory)
    observations = execute(action)  # Python code or JSON tool call
    memory += [action, observations]

return final_answer

The execute step is where CodeAgent and ToolCallingAgent diverge. CodeAgent runs the LLM's output through a Python interpreter. ToolCallingAgent parses it as structured JSON and dispatches to the matching tool function. Everything else is shared.

Tools as Python Functions

Tools in smolagents are decorated Python functions with type hints and docstrings. The decorator extracts the function signature, argument types, and description, then injects them into the system prompt so the LLM knows what's available.

Defining a Tool

from smolagents import tool

@tool
def get_weather(city: str) -> str:
    """Gets current weather for a city.

    Args:
        city: City name, e.g. "San Francisco"
    """
    return requests.get(f"https://wttr.in/{city}?format=3").text

Tools can be shared on the HuggingFace Hub with tool.push_to_hub() and loaded by anyone with load_tool(). The Hub currently hosts hundreds of community-built tools for web search, image generation, database queries, API calls, and file operations.

Model Support

Smolagents is model-agnostic. Three integration paths:

HuggingFace Inference API

Use InferenceClientModel to connect to any model hosted on the HuggingFace Hub, including open models like Llama, Mistral, and DeepSeek.

Local Models

Run models locally via transformers or Ollama. Zero API cost. Good for development, testing, and air-gapped environments.

LiteLLM (100+ Providers)

Connect to OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, Azure OpenAI, Groq, and 100+ other providers through a unified interface.

CodeAgent vs ToolCallingAgent

This is the central design question in agent frameworks, and smolagents takes a clear position: code wins.

Most frameworks (LangChain, CrewAI, OpenAI's function calling) have the LLM output a JSON dictionary specifying which tool to call and with what arguments. Smolagents' CodeAgent writes Python instead. The LLM generates a code snippet that calls tools as functions, assigns results to variables, and can loop, branch, and compose logic arbitrarily.

DimensionCodeAgent (Python)ToolCallingAgent (JSON)
Output formatExecutable Python snippetsStructured JSON tool calls
LLM steps on benchmarks30% fewerBaseline
ComposabilityArbitrary: loops, variables, nestingOne tool call per step
GAIA benchmark score44.2%Not tested separately
Security riskHigher (arbitrary code execution)Lower (structured calls only)
DebuggingRead the generated PythonInspect JSON payloads
Training alignmentStrong (LLMs trained on code)Weaker (JSON is a subset)

Why Code Actions Work Better

The research backing this comes from Wang et al. (2024), who showed that code actions outperform JSON tool calling across multiple benchmarks. The core reason: LLMs are trained on terabytes of code. Python is deeply embedded in their probability distributions. When you ask a model to express multi-step logic, it's more fluent in Python than in a custom JSON schema.

Concrete advantages of code over JSON:

  • Variable assignment. A CodeAgent can store intermediate results: results = search("query"), then filtered = [r for r in results if r.score > 0.8]. JSON tool calling has no native way to reference previous outputs.
  • Control flow. Loops, conditionals, try/except. A CodeAgent can retry a failed API call or iterate over a list of items without needing the LLM to generate multiple separate actions.
  • Composition. Chaining tool calls in a single code block: summary = summarize(search("topic")[0].text). In JSON mode, this requires two separate LLM calls with the orchestration layer passing data between them.

The 30% reduction in LLM steps compounds. Fewer steps means fewer tokens generated, which means lower latency and lower cost. On complex tasks that require 10+ steps, a CodeAgent might complete in 7 steps what a ToolCallingAgent takes 10 to finish.

Benchmarks

Smolagents' performance claims rest on two data points: the GAIA benchmark and HuggingFace's internal agent benchmark.

GAIA Benchmark

GAIA is a benchmark for general AI assistants that tests multi-step reasoning, tool use, and real-world knowledge retrieval. The questions are intentionally difficult. GPT-4-Turbo alone scores roughly 7%.

44.2%
Smolagents on GAIA (validation)
40%
Previous best (Autogen multi-agent)
~7%
GPT-4-Turbo baseline

The smolagents submission used GPT-4o with no fine-tuning. It ranked #1 on the validation set (44.2%) and #2 on the test set (33.3%), with the best performance on Level 3 (hardest) questions. This was a single CodeAgent, not a multi-agent system, beating a complex Autogen setup that required multiple coordinating agents.

Code vs JSON Benchmark

On HuggingFace's internal agent benchmark (agents_medium_benchmark_2), CodeAgent consistently used 30% fewer steps than ToolCallingAgent across multiple models. Open-source models running as CodeAgents competed with closed-source models running as ToolCallingAgents, suggesting that the code execution format itself is a multiplier on agent capability.

Smolagents vs LangChain

LangChain has 122K GitHub stars, hundreds of integrations, and an enterprise product (LangSmith) for observability. Smolagents has 26K stars and fits in your head. They target different problems.

DimensionSmolagentsLangChain / LangGraph
Core philosophyMinimal code, code-first agentsComprehensive ecosystem, graph orchestration
Codebase size~1,000 lines coreHundreds of classes, multiple packages
GitHub stars26K122K
Agent paradigmCodeAgent (Python execution)Graph-based state machines (LangGraph)
Learning curveRead the source in an afternoonWeeks to master the ecosystem
Enterprise featuresMinimal (Hub for sharing)LangSmith, LangServe, LangGraph Cloud
Multi-agentManagedAgent wrapperLangGraph nodes and edges
Tool sharingHuggingFace HubLangChain Hub
ObservabilityBasic loggingLangSmith traces, visualizations
Sandbox supportE2B, Docker, Modal, PyodideDepends on implementation

When to Choose Smolagents Over LangChain

Use smolagents when you want to understand every line of your agent's execution path. When you're building a prototype that needs to work in hours, not days. When the task is naturally expressed as "write Python to solve this," not "traverse this graph of nodes." When you don't want to learn a framework's abstractions before you can build something.

Use LangChain/LangGraph when you need enterprise-grade observability, when your workflow genuinely requires cyclical graph execution with durable state, or when you're already invested in the LangSmith ecosystem for production monitoring.

Multi-Agent Orchestration

Smolagents supports multi-agent systems through ManagedAgent. You wrap a specialized agent as a ManagedAgent with a name and description, then pass it to a manager agent. The manager sees managed agents the same way it sees tools: as callable functions with descriptions.

Multi-Agent Setup

from smolagents import CodeAgent, ManagedAgent, DuckDuckGoSearchTool, InferenceClientModel

model = InferenceClientModel()

# Specialist: web search agent
search_agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=model)
managed_search = ManagedAgent(
    agent=search_agent,
    name="web_search",
    description="Searches the web and returns relevant results"
)

# Manager: orchestrates specialists
manager = CodeAgent(tools=[], model=model, managed_agents=[managed_search])
manager.run("Find the latest research on code agent benchmarks")

This design has a practical benefit: each agent maintains its own memory. The web search agent's context fills with page content and search results. The manager agent's context stays clean, containing only the search agent's final answers. Without this separation, a single agent doing everything would burn through its context window on intermediate web content, leaving less room for actual reasoning.

Tool Selection Scales Poorly

A single agent with 10 tools makes more selection errors than two agents with 5 tools each. The manager agent only needs to pick the right specialist. Each specialist only needs to pick from its own small tool set. This is the same principle behind Anthropic's finding that multi-agent architectures improve performance by up to 90% on complex tasks.

Sandbox and Security

CodeAgent executes arbitrary Python. If you run it without isolation, the LLM can read your filesystem, make network requests, install packages, and do anything your user account can do. This is the fundamental tradeoff of code agents: they're more capable because they can do anything, and more dangerous for the same reason.

Smolagents provides five sandbox options:

E2B (Cloud Sandbox)

Remote execution in isolated containers. Code never touches your machine. Managed service with per-execution billing. Install with pip install 'smolagents[e2b]' and set executor_type='e2b'.

Docker (Self-Hosted)

Run code in local Docker containers with memory limits (512MB), CPU quotas, process limits, no-new-privileges, and dropped capabilities. Full control over the execution environment.

Modal / Blaxel (Cloud)

Managed cloud sandboxes with serverless execution. Similar to E2B but with different pricing models and infrastructure. Good for teams already using these platforms.

Pyodide + Deno (WebAssembly)

Lightweight client-side sandbox using WebAssembly. Runs Python in a Deno runtime. No server infrastructure required. Limited to packages available in Pyodide.

LocalPythonExecutor Is Not a Security Boundary

The default executor restricts imports to a safe list (no os, subprocess, sys). HuggingFace's own documentation states this "provides best-effort mitigations only and is not a security boundary." For anything beyond local experimentation, use E2B or Docker.

Getting Started

A working agent in 5 lines:

Install and Run

# pip install "smolagents[toolkit]"

from smolagents import CodeAgent, WebSearchTool, InferenceClientModel

model = InferenceClientModel()  # Uses HF Inference API
agent = CodeAgent(tools=[WebSearchTool()], model=model, stream_outputs=True)

agent.run("How many seconds would it take a leopard at full speed to cross the Pont des Arts?")

The agent will search the web for the bridge length and leopard speed, write Python to compute the answer, and return the result. Each step is visible in the streamed output: the code the LLM wrote, the execution result, and the next reasoning step.

Using with Claude or GPT

LiteLLM Integration

from smolagents import CodeAgent, LiteLLMModel

# Use any LiteLLM-supported model
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")
# or: LiteLLMModel(model_id="gpt-4o")
# or: LiteLLMModel(model_id="deepseek/deepseek-chat")

agent = CodeAgent(tools=[...], model=model)

Sharing Agents on the Hub

Smolagents integrates with the HuggingFace Hub for sharing. Push your configured agent (tools, model, system prompt) with agent.push_to_hub("username/my-agent"). Anyone can load it with agent = CodeAgent.from_hub("username/my-agent"). Tools are shared the same way.

When Smolagents Fits

Rapid Prototyping

From zero to working agent in minutes, not hours. No boilerplate, no configuration files, no class inheritance. If you can write a Python function, you can build a smolagents tool.

Research and Experimentation

The codebase is small enough to modify directly. Swap the execution strategy, change the prompt template, or modify the agent loop. No fighting a framework to test a hypothesis.

Tasks That Are Naturally Code

Data analysis, API orchestration, math, file processing. Any task where the solution is 'write a script to do X' plays to CodeAgent's strengths, since the LLM is doing exactly that.

HuggingFace Ecosystem Users

If you're already using HuggingFace models, datasets, and Spaces, smolagents slots in naturally. Hub integration for tool sharing, Inference API for model access, Spaces for deployment.

Limitations

Smolagents makes real tradeoffs for simplicity. These matter:

  • No built-in observability. LangSmith gives you trace visualization, cost tracking, and latency analysis out of the box. Smolagents has basic logging. For production monitoring, you need to build or integrate your own solution.
  • E2B sandbox limitations with multi-agent. The E2B integration doesn't yet work with complex multi-agent setups. To use multi-agents in E2B, the entire agent system needs to run inside the sandbox, not just individual tool executions.
  • No built-in state persistence. Agent memory lives in a Python list. If the process dies, the context is gone. LangGraph offers durable checkpointing. Smolagents leaves persistence to you.
  • Smaller ecosystem. 26K stars vs LangChain's 122K. Fewer integrations, fewer community examples, fewer StackOverflow answers. The HuggingFace forums and GitHub issues are active, but the surface area is smaller.
  • Security is opt-in. The default executor runs code locally with import restrictions that are explicitly not a security boundary. You have to actively choose and configure a sandbox. Teams that forget this step are running LLM-generated code with full permissions.

Frequently Asked Questions

What is smolagents?

Smolagents is HuggingFace's open-source Python framework for building AI agents. The core logic fits in ~1,000 lines. Agents write executable Python to interact with tools instead of outputting JSON, which results in 30% fewer LLM steps and higher benchmark scores. It supports any LLM, multi-agent orchestration, and sandboxed code execution.

How does smolagents compare to LangChain?

Smolagents is minimalist (~1,000 lines core) while LangChain is comprehensive (hundreds of classes, enterprise tooling with LangSmith). Smolagents prioritizes code-first agents and rapid prototyping. LangChain/LangGraph provides graph-based orchestration, durable state, and production observability. Choose smolagents for simplicity and code-native workflows. Choose LangChain for enterprise pipelines with existing observability requirements.

Is smolagents safe to use in production?

With proper sandboxing, yes. The default local executor is not secure for untrusted inputs. For production, use E2B (cloud), Docker (self-hosted), Modal, Blaxel, or the Pyodide+Deno WebAssembly sandbox. Each provides process-level isolation so LLM-generated code cannot affect your host environment.

What is a CodeAgent in smolagents?

CodeAgent is smolagents' primary agent type. It generates executable Python to interact with tools, assign variables, loop, and branch. This differs from ToolCallingAgent (and most other frameworks) where the LLM outputs JSON specifying tool name and arguments. CodeAgent scored 44.2% on the GAIA benchmark with GPT-4o, no fine-tuning, beating the previous best multi-agent system.

Can smolagents work with Claude, GPT-4, and other models?

Yes. Smolagents supports HuggingFace Inference API, local models via transformers/Ollama, and 100+ cloud providers via LiteLLM (OpenAI, Anthropic, Google, AWS Bedrock, Azure, Groq, and more). Set the model with LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514") and provide your API key.

Does smolagents support multi-agent systems?

Yes, through ManagedAgent. Wrap specialized agents with a name and description, then pass them to a manager agent. The manager treats managed agents as callable tools. Each agent maintains separate memory, preventing context pollution. See the official multi-agent guide for examples.

Related Reading

Give Smolagents Safe Code Execution with Morph

Smolagents code agents write Python that needs to run somewhere safe. Morph Sandbox SDK provides isolated execution environments with sub-second cold starts. Code agents that edit existing files need Fast Apply for reliable, token-efficient file modifications.