OpenAI Batch API: 50% Cheaper Inference for Code Generation

How the Batch API Works

The standard OpenAI API is synchronous: send a request, wait for a response, get charged full price. The Batch API inverts this. You package requests into a JSONL file, upload it, and OpenAI processes everything within 24 hours at half the cost.

Three steps:

Upload a JSONL file via the Files API with purpose: "batch". Each line is one request with a unique custom_id, the HTTP method, the endpoint URL, and the request body.
Create a batch referencing the uploaded file. Set the completion_window to 24h (currently the only option).
Download results when the batch completes. The output is another JSONL file where each line contains the custom_id and the API response. Order is not guaranteed, so you match on custom_id.

Supported Endpoints

The Batch API supports /v1/chat/completions, /v1/embeddings, /v1/responses, /v1/completions, /v1/moderations, /v1/images/generations, /v1/images/edits, and /v1/videos. For code generation, you will use /v1/chat/completions or /v1/responses.

50%

Cost reduction vs standard API

24h

Maximum turnaround time

50K

Max requests per batch file

Pricing: 50% Off Every Model

The math is straightforward. Whatever the standard per-token price is for a model, the Batch API charges half. This applies to both input and output tokens.

Model	Standard (Input/Output per 1M tokens)	Batch (Input/Output per 1M tokens)
GPT-4.1	$2.00 / $8.00	$1.00 / $4.00
GPT-4.1 mini	$0.40 / $1.60	$0.20 / $0.80
GPT-4.1 nano	$0.10 / $0.40	$0.05 / $0.20
GPT-4o	$2.50 / $10.00	$1.25 / $5.00
GPT-4o mini	$0.15 / $0.60	$0.075 / $0.30
o3	$2.00 / $8.00	$1.00 / $4.00
o4-mini	$1.10 / $4.40	$0.55 / $2.20

At scale, these savings compound. A pipeline processing 100M output tokens per month through GPT-4.1 drops from $800 to $400. For GPT-4o, the same volume drops from $1,000 to $500. The discount is flat and predictable, no volume tiers to negotiate.

Separate Rate Limits

Batch API requests use a dedicated rate limit pool. Running a 50,000-request batch job does not consume your synchronous API quota. Production systems can run both batch and real-time workloads simultaneously without interference.

JSONL File Format

The input file is JSONL (JSON Lines), one request per line. Each request has four fields: custom_id, method, url, and body.

batch_input.jsonl

{"custom_id": "review-file-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4.1", "messages": [{"role": "system", "content": "You are a code reviewer. Find bugs, security issues, and suggest improvements."}, {"role": "user", "content": "Review this Python file:\n\ndef process_payment(amount, card):\n    charge = stripe.Charge.create(amount=amount, source=card)\n    return charge"}], "max_completion_tokens": 1000}}
{"custom_id": "review-file-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4.1", "messages": [{"role": "system", "content": "You are a code reviewer. Find bugs, security issues, and suggest improvements."}, {"role": "user", "content": "Review this TypeScript file:\n\nexport async function getUser(id: string) {\n  const res = await fetch('/api/users/' + id);\n  return res.json();\n}"}], "max_completion_tokens": 1000}}

The custom_id is how you match results to requests. Output order is not guaranteed. The response file contains the same custom_id alongside the full API response, including token usage.

Constraints

Maximum 50,000 requests per file
Maximum 200 MB file size
Upload with purpose: "batch" via the Files API
Completion window is fixed at 24 hours
Each custom_id must be unique within the batch

Code Generation Use Cases

The Batch API is a natural fit for code generation tasks that run offline, at scale, and where latency tolerance is measured in hours rather than seconds.

Bulk Test Generation

Generate unit tests for every function in a codebase. Feed each function signature and body as a batch request with a system prompt specifying your testing framework (Jest, pytest, Go testing). At $1/M input tokens with GPT-4.1 batch pricing, covering a 500-file codebase costs cents.

Codebase-Wide Code Review

Run every file in a PR or entire repository through a code review prompt. Batch lets you review thousands of files at once without hitting synchronous rate limits. The separate rate limit pool means this doesn't affect your production API traffic.

Framework Migration

Converting React class components to hooks, migrating Express to Fastify, upgrading Python 2 to 3. Each file conversion is an independent request. Package 10,000 files into a batch, pay half price, and get all conversions back within 24 hours.

Documentation Generation

Generate JSDoc, docstrings, or README content for every module. Pair the source code with a documentation prompt per file. The 50,000-request limit handles even large monorepos in a single batch.

Code Translation

Translate a codebase from one language to another: TypeScript to Python, Java to Kotlin, Ruby to Go. Each file is a standalone translation task with no cross-file dependencies, making it ideal for parallel batch processing.

Synthetic Training Data

Generate code examples, Q&A pairs, or instruction-following datasets for fine-tuning. The Batch API's 50% discount makes large-scale data generation economically viable where real-time pricing would blow the budget.

The Common Pattern

All these use cases share three properties: each request is independent (no cross-request dependencies), latency tolerance is high (hours, not seconds), and volume is large enough for the 50% discount to matter. If your workload fits this pattern, batch is almost always the right choice.

Batch vs Real-Time: When to Use Which

The decision is about latency tolerance, not quality. Batch and real-time use the same models and produce the same outputs. The only difference is when you get the result and what you pay.

Dimension	Batch API	Standard API
Cost	50% of standard price	Full price
Latency	Up to 24 hours	Seconds
Rate limits	Separate pool, much higher	Standard pool
Streaming	Not supported	Supported
Interactive use	No	Yes
Max requests/call	50,000 per file	1 per call
Output format	JSONL file download	JSON response
Tool use / function calling	Supported (no streaming)	Supported (with streaming)

Use Batch When

The task can wait hours for results (overnight pipelines, CI/CD jobs, weekly reports)
You are processing hundreds or thousands of requests (the 50% discount compounds)
Each request is independent (no conversational context between requests)
You want to avoid impacting production rate limits

Use Real-Time When

A user is waiting for a response (chatbots, coding agents, interactive tools)
You need streaming token output
The task involves multi-turn conversation with tool use
Latency matters more than cost

When Batch Won't Work: Real-Time Coding Agents

The Batch API cannot serve interactive coding agents. When a developer types a question and expects code back in seconds, a 24-hour queue is not an option. Real-time coding needs streaming responses, multi-turn context, tool execution (running tests, reading files, searching code), and sub-second time-to-first-token.

This is the fundamental split in AI-assisted development: offline pipelines (batch-friendly) vs. interactive coding (real-time only).

Batch: Offline Pipelines

Test generation across a repo, pre-commit code review, migration scripts, documentation generation, training data creation. These run in CI/CD or scheduled jobs. Cost is the primary concern, not latency.

Real-Time: Interactive Coding

Coding agents (Claude Code, Cursor, Copilot), chat-based code assistance, live code review during development, interactive debugging. These need sub-second responses and streaming. Cost optimization comes from prompt caching and model routing, not batch queuing.

Most engineering teams need both. Batch handles the high-volume background work at half price. Real-time handles the interactive developer experience where latency is non-negotiable.

Optimizing Real-Time Costs

For real-time coding agent workloads where batch is not an option, cost optimization comes from different techniques: prompt caching (reusing cached prefixes across requests), model routing (sending simple tasks to cheaper models), and speculative decoding (using a fast draft model to accelerate a strong model). These techniques can achieve 30-70% cost reduction on real-time traffic without sacrificing latency.

Implementation Guide

A minimal Python implementation using the OpenAI SDK:

1. Create the JSONL input file

import json

requests = []
for i, file_content in enumerate(files_to_review):
    requests.append({
        "custom_id": f"review-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4.1",
            "messages": [
                {"role": "system", "content": "Review this code for bugs and security issues."},
                {"role": "user", "content": file_content}
            ],
            "max_completion_tokens": 2000
        }
    })

with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

2. Upload the file and create a batch

from openai import OpenAI
client = OpenAI()

# Upload the input file
input_file = client.files.create(
    file=open("batch_input.jsonl", "rb"),
    purpose="batch"
)

# Create the batch
batch = client.batches.create(
    input_file_id=input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

3. Check status and download results

# Poll for completion
batch = client.batches.retrieve(batch.id)
print(f"Status: {batch.status}")
# Statuses: validating, in_progress, completed, failed, expired, cancelled

# Once completed, download results
if batch.status == "completed":
    result_file = client.files.content(batch.output_file_id)
    results = result_file.text

    for line in results.strip().split("\n"):
        result = json.loads(line)
        custom_id = result["custom_id"]
        response = result["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{custom_id}: {response[:100]}...")

Error Handling

If individual requests fail, they appear in a separate error file accessible via batch.error_file_id. The batch itself can also fail if the input file is malformed. Check batch.errors for batch-level issues and the error file for per-request failures.

Frequently Asked Questions

How much does the OpenAI Batch API save?

50% on both input and output tokens, across all supported models. GPT-4.1 drops from $2/$8 to $1/$4 per million tokens. GPT-4o drops from $2.50/$10 to $1.25/$5. The discount is flat, no volume commitments required.

How long does the Batch API take?

The guaranteed window is 24 hours. In practice, smaller batches (hundreds of requests) often complete in minutes to a few hours. OpenAI uses the 24-hour window to schedule processing during off-peak capacity, which is how they fund the 50% discount.

Can I use the Batch API for code generation?

Yes. It supports /v1/chat/completions, the same endpoint used for code generation. Bulk test generation, code review, migration, and documentation generation all work well. It does not work for interactive coding where a developer needs real-time responses.

What is the JSONL format for the Batch API?

Each line is a JSON object with four fields: custom_id (unique string), method ("POST"), url (e.g., "/v1/chat/completions"), and body (the request payload with model, messages, and parameters). Maximum 50,000 requests per file, 200 MB size limit.

Does the Batch API affect my regular API rate limits?

No. Batch uses a completely separate rate limit pool. You can run large batch jobs and real-time traffic simultaneously. This is one of the less obvious benefits: batch offloads volume that would otherwise compete with your synchronous quota.

Can I cancel a batch in progress?

Yes. Call the cancel endpoint with your batch ID. Requests already completed will be available in the output file. Requests not yet processed will be cancelled. You are only charged for completed requests.

Does the Batch API support structured outputs?

Yes. You can use response_format with JSON mode or JSON Schema in your batch request bodies, the same as you would with the synchronous API. This is useful for code generation tasks where you want the output in a specific structure (e.g., a JSON object with separate fields for code, explanation, and test cases).

For Real-Time Coding Agent Infrastructure, Try Morph

The Batch API handles offline pipelines at half price. For real-time coding agents that need sub-second inference, Morph optimizes cost through prompt caching, model routing, and speculative decoding. Batch for bulk. Morph for real-time.

Try Morph Free

View Pricing

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

OpenAI Batch API: 50% Cheaper Inference for Code Generation Pipelines