What Structured Output Is
Every production application that calls an LLM needs to do something with the response. Parse it into an object. Store it in a database. Pass it to another function. Route it to a downstream service. All of these require the response to have a predictable shape. Free text doesn't have a predictable shape.
Structured output is a contract: you provide a JSON schema, and the model returns a response that conforms to it. Not "usually conforms." Not "conforms if you prompt carefully." The response is mathematically guaranteed to be valid against your schema, because the token generation process itself is constrained to only produce valid output.
Before constrained decoding, developers used three approaches to get JSON from LLMs. Prompt engineering ("Return your answer as JSON with fields name, age, and email") worked 90-98% of the time, but a 2% failure rate at scale is a production incident. JSON mode, introduced by OpenAI in late 2023, guaranteed syntactically valid JSON but not schema compliance. The model might return {"full_name": "John"} when you expected {"name": "John"}. Structured output is the third generation: both syntactically valid and schema-compliant.
| Approach | Guarantees | Failure Mode |
|---|---|---|
| Prompt engineering | None. Best-effort. | Malformed JSON, missing fields, wrong types, markdown wrapping |
| JSON mode | Valid JSON syntax | Wrong field names, missing required properties, unexpected types |
| Structured output | Full schema compliance | None. Output matches schema by construction. |
How Constrained Decoding Works
LLMs generate text one token at a time. At each step, the model produces a probability distribution over its entire vocabulary (typically 100,000+ tokens). Normally, the model samples from this distribution freely. Constrained decoding intervenes at this step: before sampling, it masks out every token that would make the output violate the target schema.
The process starts when you submit a JSON schema. The provider compiles that schema into a finite state machine (FSM) or context-free grammar (CFG) that represents all valid strings the schema accepts. At each generation step, the system checks which tokens are valid transitions from the current state. Invalid tokens get their probability set to zero. The model can only sample from tokens that keep the output on a valid path.
1. Schema compilation
Your JSON schema is compiled into a grammar or finite state machine. This happens once per unique schema and is cached. OpenAI reports slightly higher latency on the first request for a new schema, then cached performance afterward.
2. Token masking
At each generation step, the system computes which tokens are valid given the current generation state. If the grammar says the next valid tokens are digits (because we're inside an integer field), all non-digit tokens are masked to probability zero.
3. Constrained sampling
The model samples from the reduced token set. Because only valid tokens remain, the output is guaranteed to be schema-compliant. The model still controls the content (which string to put in a field, which number to assign), but the structure is locked.
4. Scaffolding bypass
Advanced implementations skip deterministic tokens entirely. If the grammar dictates the next characters must be a closing brace and comma, the system writes those directly without running the model. This reduces latency and token costs.
Performance impact
Constrained decoding does not slow generation in practice. The token masking computation is on the order of 50 microseconds per token for a 128K-token vocabulary (measured by llguidance). Scaffolding bypass actually speeds up generation by skipping deterministic tokens. SGLang reports an order-of-magnitude speedup for JSON generation compared to unconstrained generation, because structural tokens are written instantly rather than sampled.
What happens at each token position
// Schema: { type: "object", properties: { age: { type: "integer" } } }
// The model is generating: {"age": |
// ^ cursor here
// Without constrained decoding:
// All ~100K tokens are valid. Model might produce:
// "twenty-five" (string, not integer)
// 25.5 (float, not integer)
// null (null, not integer)
// With constrained decoding:
// Only digit tokens (0-9) and closing tokens are valid.
// Model MUST produce a valid integer.
// Token mask: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] + structural tokens
// Result: 25 (guaranteed integer)Provider Implementations
Every major LLM provider now supports native structured output through constrained decoding. The APIs differ in parameter names and schema configuration, but the underlying mechanism is the same. Here is the current state of each provider as of April 2026.
OpenAI: response_format with json_schema
OpenAI shipped structured outputs in August 2024. There are two integration points: response_format for shaping the model's direct response, and strict: true on tool definitions for shaping function call arguments. Both use constrained decoding.
OpenAI structured output with response_format
import OpenAI from "openai";
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-5",
messages: [
{ role: "user", content: "Extract: John Smith, john@acme.com, VP Engineering" }
],
response_format: {
type: "json_schema",
json_schema: {
name: "contact",
strict: true,
schema: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string" },
title: { type: "string" },
department: { type: ["string", "null"] }
},
required: ["name", "email", "title", "department"],
additionalProperties: false
}
}
}
});
const contact = JSON.parse(response.choices[0].message.content);
// { name: "John Smith", email: "john@acme.com",
// title: "VP Engineering", department: null }
// Guaranteed to match schema. No try/catch needed.Schema constraints in strict mode
OpenAI's strict mode requires additionalProperties: false on every object and all properties listed in the required array. Optional fields use type unions with null (e.g., ["string", "null"]). Maximum 100 object properties total with up to 5 levels of nesting. Some JSON Schema features like pattern, minItems, and conditional schemas are not supported.
Anthropic: output_config.format
Anthropic launched structured outputs in beta (November 2025) and made them generally available in early 2026. The API uses output_config.format with a json_schema type. Supported on Claude Opus 4.6, Sonnet 4.6, Sonnet 4.5, Opus 4.5, and Haiku 4.5.
Anthropic structured output with output_config
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-5-20250514",
max_tokens: 1024,
messages: [
{
role: "user",
content: "Extract: John Smith, john@acme.com, VP Engineering"
}
],
output_config: {
format: {
type: "json_schema",
schema: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string" },
title: { type: "string" },
department: { type: ["string", "null"] }
},
required: ["name", "email", "title", "department"],
additionalProperties: false
}
}
}
});
const textBlock = response.content.find(b => b.type === "text");
const contact = JSON.parse(textBlock.text);
// Schema-compliant. Same guarantee as OpenAI strict mode.Anthropic also supports structured output through strict tool use. Setting strict: true on a tool definition guarantees the model's tool input conforms to your schema. This is particularly useful for agent workflows where the final output of a multi-turn tool-use conversation needs to match a specific shape.
Google Gemini: controlled generation
Google calls it "controlled generation." The API uses response_mime_type set to application/json and response_json_schema for the schema definition. Supported on Gemini 2.5 Pro, 2.5 Flash, and the Gemini 3 series.
Gemini structured output with controlled generation
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash",
contents: "Extract: John Smith, john@acme.com, VP Engineering",
config: {
responseMimeType: "application/json",
responseJsonSchema: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string" },
title: { type: "string" },
department: { type: ["string", "null"] }
},
required: ["name", "email", "title", "department"]
}
}
});
const contact = JSON.parse(response.text);| Feature | OpenAI | Anthropic | Google Gemini |
|---|---|---|---|
| API parameter | response_format.json_schema | output_config.format | responseJsonSchema |
| Tool strict mode | strict: true per tool | strict: true per tool | Via function declarations |
| SDK helpers | zodResponseFormat (Zod), .parse() (Pydantic) | zodOutputFormat (Zod), .parse() (Pydantic) | zodToJsonSchema conversion |
| Streaming | Partial JSON chunks | Partial JSON chunks | Partial JSON chunks |
| Max schema depth | 5 levels, 100 properties | 5 levels, 100 properties | No documented limit |
| Schema caching | Automatic. First call compiles. | Automatic. First call compiles. | Automatic. |
Zod and Pydantic: Define Once, Validate Everywhere
Writing raw JSON schemas by hand is tedious and error-prone. Zod (TypeScript) and Pydantic (Python) let you define schemas in your native language, then convert to JSON Schema automatically. The provider SDKs have first-class support for both. Define a schema once, get compile-time type safety, runtime validation, and LLM schema enforcement from the same definition.
Zod with OpenAI (TypeScript)
zodResponseFormat: type-safe structured output
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";
// Define the schema once
const CodeReview = z.object({
file: z.string().describe("The file path reviewed"),
issues: z.array(z.object({
line: z.number().int().describe("Line number"),
severity: z.enum(["error", "warning", "info"]),
message: z.string().describe("Description of the issue"),
suggestion: z.string().describe("Suggested fix")
})),
summary: z.string().describe("One-line summary of findings"),
approved: z.boolean()
});
// TypeScript knows the return type
type CodeReviewResult = z.infer<typeof CodeReview>;
const openai = new OpenAI();
const completion = await openai.chat.completions.parse({
model: "gpt-5",
messages: [
{ role: "system", content: "Review the following code for bugs and style issues." },
{ role: "user", content: fileContents }
],
response_format: zodResponseFormat(CodeReview, "code_review")
});
// completion.choices[0].message.parsed is typed as CodeReviewResult
const review = completion.choices[0].message.parsed;
console.log(review.issues.length); // TypeScript autocomplete works
console.log(review.approved); // boolean, guaranteedZod with Anthropic (TypeScript)
zodOutputFormat with Anthropic SDK
import Anthropic from "@anthropic-ai/sdk";
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import { z } from "zod";
const ContactSchema = z.object({
name: z.string(),
email: z.string(),
company: z.string(),
role: z.enum(["engineer", "manager", "executive", "other"])
});
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-5-20250514",
max_tokens: 1024,
messages: [
{ role: "user", content: "Extract contact: Jane Doe, CTO at Stripe, jane@stripe.com" }
],
output_format: zodOutputFormat(ContactSchema, "contact")
});
// SDK parses and validates automatically
const contact = JSON.parse(
response.content.find(b => b.type === "text")?.text ?? "{}"
);Zod with Vercel AI SDK
generateObject: the cleanest Zod integration
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const { object } = await generateObject({
model: openai("gpt-5"),
schema: z.object({
tasks: z.array(z.object({
title: z.string().describe("Short task title"),
priority: z.enum(["high", "medium", "low"]),
estimatedHours: z.number().describe("Estimated hours to complete"),
dependencies: z.array(z.string()).describe("Task titles this depends on")
})),
totalHours: z.number()
}),
prompt: "Break down this project into tasks: Build a REST API for user management"
});
// object is fully typed. No JSON.parse. No validation code.
// The AI SDK handles schema conversion, API call, and parsing.
for (const task of object.tasks) {
console.log(`[${task.priority}] ${task.title} - ${task.estimatedHours}h`);
}Pydantic with OpenAI (Python)
Pydantic models with OpenAI .parse()
from pydantic import BaseModel
from openai import OpenAI
class Issue(BaseModel):
line: int
severity: str # "error" | "warning" | "info"
message: str
suggestion: str
class CodeReview(BaseModel):
file: str
issues: list[Issue]
summary: str
approved: bool
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-5",
messages=[
{"role": "system", "content": "Review the code for bugs."},
{"role": "user", "content": file_contents}
],
response_format=CodeReview
)
review = completion.choices[0].message.parsed
# review is a CodeReview instance with full type hints
print(f"{len(review.issues)} issues found, approved: {review.approved}")Zod .describe() matters
When using Zod schemas, call .describe() on fields that need context. The description string is sent to the model as part of the JSON schema and directly influences output quality. z.number().describe("Line number in the source file") gives the model better guidance than a bare z.number(). More descriptive schemas produce more accurate structured output.
The Format Tax: When Structured Output Hurts
Structured output is not free. Research published in April 2026 ("The Format Tax") measured the accuracy cost of requiring LLMs to produce structured formats instead of free text. The findings are nuanced and important for anyone building production systems.
Open-weight models (Llama, Mistral, Qwen) consistently lose 3-9 percentage points of accuracy on reasoning benchmarks when generating structured output. The worst case was MATH-500, where specific configurations lost up to 17.8 percentage points. Writing quality degraded similarly when LaTeX formatting was required.
Frontier closed-weight models tell a different story. Claude Haiku 4.5, Grok 4.1 Fast, and recent GPT variants showed near-zero or even positive deltas. This suggests the format tax is not inherent to structured generation. It can be trained away, or it correlates with model scale and instruction tuning quality.
The surprising root cause
The researchers found that format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. Simply telling a model "respond in JSON" degrades reasoning. The constrained decoder adds only minor additional degradation on top of that. This means the problem is cognitive, not mechanical: the model spends capacity on formatting concerns that competes with reasoning capacity.
Mitigation strategies
Two-pass generation
Generate a freeform answer first, then reformat into the target schema in a second pass. Recovers approximately 6.8 percentage points of lost accuracy on average. Costs 2x the tokens, but preserves reasoning quality.
Extended thinking
Enable chain-of-thought or extended thinking within a single generation, then constrain only the final output. Recovers approximately 9.2 percentage points on average. Higher variance but single-pass. Claude and OpenAI reasoning models support this natively.
Practical takeaway
If you're using frontier models (Claude Sonnet/Opus, GPT-5, Gemini 2.5 Pro), the format tax is negligible. Use structured output everywhere. If you're using open-weight models, measure your specific task. For classification and extraction, structured output often improves accuracy. For complex reasoning, consider two-pass generation or extended thinking to preserve quality.
Structured Output vs Free Text: When to Use Which
Structured output is not a universal upgrade. It solves a specific problem (reliable parsing for machine consumption) and introduces a specific tradeoff (reduced flexibility, potential reasoning degradation on weaker models). The decision is straightforward.
| Scenario | Use | Why |
|---|---|---|
| API responses consumed by code | Structured output | Downstream code needs predictable types and fields |
| Tool call arguments in agents | Structured output (strict) | Invalid arguments crash tool execution |
| Data extraction from documents | Structured output | Entity types, counts, and classifications need schema enforcement |
| Subagent communication | Structured output | Agent-to-agent messages must be parseable without ambiguity |
| Chat responses shown to users | Free text | Users read prose, not JSON |
| Creative writing, brainstorming | Free text | Structure constrains creative exploration |
| Complex multi-step reasoning | Free text (then reformat) | Format tax degrades reasoning on open-weight models |
| Summarization for human readers | Free text | Structured output forces artificial field boundaries on fluid content |
The general rule: if the output is consumed by code, use structured output. If it is consumed by humans, use free text. If it is consumed by code but requires complex reasoning to produce, consider a two-pass approach where the model reasons in free text and then reformats.
Patterns for Coding Agents
Coding agents are the most structured-output-intensive applications. Every tool call is structured output. Every code edit is structured output. Every file operation has a schema. An agent that makes 50 tool calls per task needs all 50 to be schema-valid, or the task fails. This is where strict mode and constrained decoding pay for themselves.
Tool call schemas
Every tool your agent exposes is defined by a JSON schema for its parameters. With strict mode enabled, the model's arguments are guaranteed to match. Without it, you need defensive parsing, type coercion, and fallback logic for every tool. Here are the core schemas coding agents use.
File operation tools with strict schemas
const tools = [
{
type: "function",
name: "read_file",
strict: true,
description: "Read the contents of a file at the given path",
parameters: {
type: "object",
properties: {
path: { type: "string", description: "File path relative to project root" },
start_line: { type: ["integer", "null"], description: "First line to read (1-indexed). Null for start of file." },
end_line: { type: ["integer", "null"], description: "Last line to read. Null for end of file." }
},
required: ["path", "start_line", "end_line"],
additionalProperties: false
}
},
{
type: "function",
name: "write_file",
strict: true,
description: "Write content to a file, creating it if it doesn't exist",
parameters: {
type: "object",
properties: {
path: { type: "string", description: "File path relative to project root" },
content: { type: "string", description: "The full file content to write" }
},
required: ["path", "content"],
additionalProperties: false
}
},
{
type: "function",
name: "search_code",
strict: true,
description: "Search the codebase using a regex pattern. Returns matching lines with file paths and line numbers.",
parameters: {
type: "object",
properties: {
pattern: { type: "string", description: "Regex pattern to search for" },
file_glob: { type: ["string", "null"], description: "Glob to filter files, e.g. '*.ts'. Null for all files." },
max_results: { type: "integer", description: "Maximum number of matches to return" }
},
required: ["pattern", "file_glob", "max_results"],
additionalProperties: false
}
}
];Code edit operations
The model needs to express code edits in a structured format that your application can apply reliably. There are three common patterns, each with different tradeoffs.
Full file rewrite
Model returns the entire file with edits applied. Simple to implement. Wasteful on tokens: editing one line in a 500-line file costs 500 lines of output tokens. Works for small files.
Search-and-replace
Model returns the old text and new text. Compact and unambiguous. Fails when the old text appears multiple times. Claude Code and most agents use this pattern because it scales well to large files.
Unified diff
Model returns a standard unified diff with line numbers and context. Compact. Fragile: off-by-one line numbers cause patch failures. Works best with a fuzzy apply step that handles minor misalignment.
Search-and-replace edit schema
{
type: "function",
name: "edit_file",
strict: true,
description: "Apply a targeted edit to a file by replacing old_text with new_text",
parameters: {
type: "object",
properties: {
path: {
type: "string",
description: "File path relative to project root"
},
old_text: {
type: "string",
description: "The exact text to find and replace. Must match uniquely."
},
new_text: {
type: "string",
description: "The replacement text"
}
},
required: ["path", "old_text", "new_text"],
additionalProperties: false
}
}Structured output for agent orchestration
When one agent dispatches work to another, the message format must be structured. The orchestrator needs to know which subagent to route to, what the task is, what context to provide, and what format the result should take. Free text between agents is a recipe for cascading failures.
Orchestrator dispatch schema
const DispatchSchema = z.object({
subtask: z.string().describe("Clear description of what the subagent should do"),
agent_type: z.enum(["coder", "reviewer", "tester", "researcher"]),
context: z.object({
files: z.array(z.string()).describe("File paths relevant to the subtask"),
constraints: z.array(z.string()).describe("Rules the subagent must follow"),
prior_results: z.string().describe("Summary of work done so far")
}),
expected_output: z.enum(["code_edit", "review_report", "test_results", "analysis"]),
timeout_seconds: z.number().int().describe("Max time before the subtask is killed")
});
// The orchestrator model generates this schema,
// your application routes to the correct subagent,
// and the subagent returns its own structured result.Morph and Structured Responses
Morph's APIs are built for agent workflows where structured input and output are non-negotiable. The Fast Apply API takes structured input (original code + edit description) and returns structured output (the edited file). No prompt engineering. No JSON parsing. The API contract is the schema.
For agent builders, this matters because code editing is the most latency-sensitive structured operation. A general-purpose LLM generating a full-file rewrite at 80 tok/s takes 6 seconds for a 500-line file. Morph's specialized model processes the same edit at 10,500 tok/s. The response is structured by construction: input schema, output schema, no ambiguity.
Morph Fast Apply as a structured tool
// Define as a tool in your agent
const fastApplyTool = {
type: "function",
name: "apply_code_edit",
strict: true,
description: "Apply a code edit using Morph Fast Apply. Faster and more accurate than full-file rewrites.",
parameters: {
type: "object",
properties: {
original_code: { type: "string", description: "Current file contents" },
edit_snippet: { type: "string", description: "The edit to apply (diff, snippet, or description)" },
filename: { type: "string", description: "Filename for language detection" }
},
required: ["original_code", "edit_snippet", "filename"],
additionalProperties: false
}
};
// Execute the tool call
async function executeFastApply(args: {
original_code: string;
edit_snippet: string;
filename: string;
}) {
const response = await fetch("https://api.morphllm.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.MORPH_API_KEY}`
},
body: JSON.stringify({
model: "morph-v3-fast",
messages: [
{ role: "user", content: args.edit_snippet }
],
original_code: args.original_code,
filename: args.filename
})
});
const data = await response.json();
return data.choices[0].message.content; // The edited file
}WarpGrep follows the same principle. The search query is structured input (query string, filters, scope). The results are structured output (file paths, line numbers, matched content, relevance scores). Your agent's search_code tool maps directly to WarpGrep's API. No parsing intermediate text. No extracting file paths from prose.
Best Practices
Always use strict mode in production
Best-effort JSON is a testing convenience, not a production strategy. Enable strict mode (or its equivalent) on every structured output call. The first-request schema compilation latency is negligible after caching.
Use .describe() on every Zod field
The description string is sent to the model as part of the schema. It directly impacts output quality. z.string().describe('ISO 8601 date') produces better results than a bare z.string(). Treat descriptions as prompts for individual fields.
Keep schemas flat when possible
Deeply nested schemas are harder for models to follow and hit provider depth limits faster. If your schema has more than 3 levels of nesting, consider flattening. Use arrays of objects at the top level rather than deeply nested hierarchies.
Use enums for constrained fields
If a field has a finite set of valid values, use an enum. z.enum(['error', 'warning', 'info']) is better than z.string() with a description. Enums reduce the token space and eliminate invalid values entirely.
Separate reasoning from formatting
For complex tasks on open-weight models, generate freeform reasoning first, then extract structured data in a second pass. This avoids the format tax on reasoning quality. For frontier models, single-pass structured output works fine.
Version your schemas
When your schema changes, downstream consumers break. Use explicit versioning or additive-only changes. Adding a nullable field is safe. Removing a field or changing its type requires coordination with consumers.
Schema design is prompt engineering
The JSON schema you send to the model is part of the prompt. Field names, descriptions, enum values, and nesting structure all influence output quality. A field named severity with description "How critical this issue is: error (blocks deployment), warning (should fix), info (nice to know)" produces more accurate classifications than a field named level with no description. Invest the same care in schema design that you invest in system prompts.
Frequently Asked Questions
What is structured output in LLMs?
Structured output forces an LLM to return responses conforming to a predefined JSON schema. It uses constrained decoding, which restricts token generation at inference time so only schema-valid tokens can be produced. This guarantees 100% schema compliance with no retries needed for malformed output.
What is the difference between JSON mode and structured output?
JSON mode guarantees syntactically valid JSON but not schema compliance. The model returns valid JSON, but fields might be wrong. Structured output (strict mode) guarantees both valid syntax and full schema conformance: correct field names, types, required properties, and enum values. Use structured output for production. JSON mode is a fallback for cases where you have no schema upfront.
How does constrained decoding work?
The provider compiles your JSON schema into a grammar or finite state machine. At each token generation step, the system computes which tokens are valid given the current state and masks invalid tokens to probability zero. The model samples only from valid tokens. This guarantees the output matches the schema by construction, not by validation.
Which LLM providers support structured output?
All major providers as of 2026. OpenAI uses response_format with type: "json_schema". Anthropic uses output_config.format. Google Gemini uses responseMimeType and responseJsonSchema. Open-source inference engines (vLLM, SGLang, llama.cpp) also support constrained decoding via grammar specifications. Provider-agnostic libraries like Instructor and the Vercel AI SDK abstract these differences.
Does structured output hurt LLM reasoning?
It depends on the model. Frontier closed-weight models (Claude Sonnet/Opus, GPT-5, Gemini 2.5 Pro) show near-zero reasoning degradation. Open-weight models lose 3-9 percentage points on reasoning benchmarks. The primary cause is the format-requesting instruction itself, not the constrained decoder. Two-pass generation (reason freely, then format) and extended thinking modes recover most of the lost accuracy.
Should I use Zod or Pydantic for LLM structured output?
Use Zod for TypeScript and Pydantic for Python. Both convert to JSON Schema, which is what providers consume. OpenAI provides zodResponseFormat and .parse() for Pydantic. Anthropic provides zodOutputFormat and .parse(). The Vercel AI SDK's generateObject accepts Zod schemas directly. The choice is a language decision, not a capability decision.
Related Guides
- OpenAI Function Calling: Complete Guide for Agent Builders
- LLM API Comparison 2026: Pricing, Speed, and Features
- Subagents: Why Intelligence Organizes into Hierarchies
- Context Engineering for Coding Agents
- Edit Formats: How Coding Agents Express Code Changes
- Fast Apply: Specialized Models for Code Editing
Structured APIs for agent builders
Morph's Fast Apply and WarpGrep APIs return structured responses by default. Define them as tools, wire them into your agent, and get guaranteed schema-compliant results at 10,500 tok/s.