Modal vs DeepInfra (2026): GPU Seconds vs Per-Token Pricing, and the Break-Even

Modal and DeepInfra both run AI workloads on serverless GPUs, but they sell different things. Modal is a compute platform: you write Python, decorate a function, and Modal runs it on a B200 or H100 billed per second of active compute, with no idle charges. You bring the model and own the serving code. DeepInfra is a managed inference API: you call Llama 3.3 70B or DeepSeek V4 over an OpenAI-compatible endpoint and pay per token, with zero infrastructure to manage.

That difference drives everything below. Modal wins when you need control, custom models, batch jobs, fine-tuning, or code sandboxes. DeepInfra wins when you want a frontier open model behind one API call and never want to think about a GPU again.

This guide breaks down per-second versus per-token pricing, the utilization break-even, cold-start behavior, compliance, and which platform fits a coding agent. All numbers are as of June 2026 and change often, so treat them as a snapshot.

TL;DR

Pick Modal if you want to run your own model, a batch job, a fine-tune, or an agent sandbox on per-second GPU billing. H100 at about $3.95/hr, B200 at about $6.25/hr, scale to zero, and snapshot cold starts as low as 2 seconds.
Pick DeepInfra if you want a managed open-model API with no infrastructure. Llama 3.3 70B Turbo at about $0.10 input and $0.32 output per million tokens, OpenAI-compatible, SOC 2, ISO 27001, and HIPAA certified.

Who Wins Per Workload

The choice is not which platform is "better" but which shape of workload you have. Read down the left column to your case.

Who Wins Per Workload

Workload / decision	Modal	DeepInfra
Call a stock open model cheaply	DeepInfra	DeepInfra ($0.10/M, no GPU)
Run your own / custom model	Modal (bring any weights)	DeepInfra
Own the serving engine	Modal (your code, your stack)	DeepInfra
Lowest first-token latency	Modal	DeepInfra (0.35s TTFT warm)
Non-LLM or batch GPU work	Modal (transcription, embeds)	DeepInfra
Run untrusted generated code	Modal (gVisor sandboxes)	DeepInfra
Fine-tune full models	Modal (full lifecycle)	DeepInfra (LoRA only)
Zero infra to manage	Modal	DeepInfra (just an API call)
Spiky traffic, scale to zero	Modal (~2s snapshot starts)	DeepInfra

Quick Comparison

Modal sells GPU seconds; DeepInfra sells tokens; Morph sells the coding-agent inner loop.

Modal vs DeepInfra vs Morph at a Glance

Spec	Modal	DeepInfra	Morph
What it is	Serverless GPU platform	Per-token inference API	Coding-agent inference layer
Billing model	Per second of GPU compute	Per token (per inference time for non-LLM)	Per request / per token
Bring your own model	Yes, any model	Open models only (catalog)	Code-specific models
Code-specific apply	No	No	Yes (/v1/code/apply)
Semantic code search	No	No	WarpGrep (morph-warp-grep-v2.1)
Apply throughput	Depends on your code	200-317 tok/s (top models)	~10,500 tok/s
First-pass apply accuracy	N/A	N/A	98%
Cold start	~2s with GPU snapshots	Sub-second TTFT (warm pool)	Warm fleet
Fine-tuning / LoRA	Yes (full lifecycle)	LoRA adapters	N/A
Sandboxes	Yes (gVisor isolated)	No	No
OpenAI-compatible API	You build it	Yes	Yes
Best for	Custom workloads, batch, sandboxes	Drop-in open-model API	Search + apply + compact loop

What Each One Is

Modal is infrastructure; DeepInfra is a product built on infrastructure.

Modal: serverless compute for any GPU workload

Modal lets Python developers deploy inference endpoints, run large-scale batch jobs, fine-tune open models, and execute secure agent sandboxes using function decorators, with no YAML, Kubernetes, or cluster management. It bills per second of active compute and scales to zero when idle. The platform covers the full machine-learning lifecycle in one place: training, inference, batch, and sandboxes.

The catch is that you own the serving code. Modal hands you a GPU and a great developer experience; loading a model, batching requests, and quantizing weights are your job. That is the price of flexibility.

DeepInfra: managed open-model inference

DeepInfra runs a catalog of open models behind one OpenAI-compatible API. You pick a model, send a request, and pay per token. There is no GPU to provision, no serving code to write, and no scaling to manage. It runs on its own US data centers and prices most open models below the per-token rates of Fireworks, Together, and Baseten: DeepSeek V4 Pro is $1.30/$2.60 per M in/out on DeepInfra versus $1.74/$3.48 on Fireworks and $2.10/$4.40 on Together.

DeepInfra also offers dedicated GPU clusters (DeepCluster) and private deployments for teams that need a specific model isolated, but the default product is the shared per-token endpoint.

Pricing: Per Second vs Per Token

The two pricing models are not directly comparable, which is the whole point.

Modal: per-second GPU rates

Modal GPU Pricing (per second / per hour, early 2026)

GPU	Per Second	Per Hour (approx)
B200	$0.001736	~$6.25
H200	$0.001261	~$4.54
H100	$0.001097	~$3.95
A100 80GB	$0.000694	~$2.50
A100 40GB	$0.000583	~$2.10
L40S	$0.000542	~$1.95
L4	$0.000222	~$0.80
T4	$0.000164	~$0.59

CPU bills at $0.0000131 per core-second and memory at $0.00000222 per GiB-second, on top of GPU. The Starter plan is $0 with $30 in monthly credits and 10 GPU concurrency. The Team plan is $250/mo with $100 credits and 50 GPU concurrency. Enterprise is custom.

DeepInfra: per-token model rates

DeepInfra Token Pricing (per 1M tokens, early 2026)

Model	Input	Output
Llama 3.3 70B Turbo	$0.10	$0.32
Llama 3.1 8B Instruct	$0.02	$0.05
DeepSeek V3	$0.32	$0.89
DeepSeek V4 Flash	$0.10 ($0.02 cached)	$0.20
DeepSeek V4 Pro	$1.30 ($0.10 cached)	$2.60
Kimi K2.6	$0.75 ($0.15 cached)	$3.50
GLM-5.1	$1.05 ($0.205 cached)	$3.50
Qwen3-235B-A22B-Instruct	$0.09	$0.10

DeepInfra also sells dedicated GPUs by the hour with minute-level granularity: A100 80GB at about $0.89, H100 at about $1.79, H200 at about $2.19, B200 at about $2.79, and B300 at about $4.20. Those dedicated rates undercut Modal's, but you are renting a fixed GPU rather than scaling to zero, so the right choice depends on utilization.

Which is cheaper?

For a standard open model, DeepInfra's per-token pricing usually wins because you pay only for tokens and DeepInfra packs many users onto each GPU. Modal wins when the workload is not token-shaped (batch transcription, fine-tuning, embeddings at scale, sandboxes) or when you run a custom model at steady throughput. Cheaper is not "it depends": below the utilization where a rented GPU stays busy, per-token wins; above it, a dedicated GPU you own wins. The next section computes where that line sits.

DeepInfra vs Other Per-Token APIs

DeepInfra is one of several per-token open-model APIs. If you are choosing where to call a model rather than whether to rent a GPU, the comparison that matters is DeepInfra against Fireworks, Together, Baseten, and Novita. On the popular open models DeepInfra is consistently the cheapest of the managed APIs.

Per-Token Price by Provider (per 1M tokens, input / output, June 2026)

Model	DeepInfra	Fireworks	Together	Baseten
DeepSeek V4 Pro	$1.30 / $2.60	$1.74 / $3.48	$2.10 / $4.40	$1.74 / $3.48
Kimi K2.6	$0.75 / $3.50	$0.95 / $4.00	$1.20 / $4.50	$0.95 / $4.00
GLM-5.1	$1.05 / $3.50	$1.40 / $4.40	$1.40 / $4.40	$1.30 / $4.30
DeepSeek V4 Flash	$0.10 / $0.20	$0.14 / $0.28	n/a	n/a

On dedicated GPUs by the hour, DeepInfra also undercuts the field: H100 80GB at $1.79/hr versus Modal at about $3.95/hr, Replicate at $5.49/hr, Together at $6.49/hr, Baseten at $6.50/hr, and Fireworks at $7.00/hr. B200 180GB is $2.79/hr on DeepInfra versus about $6.25/hr on Modal and $9.98 to $11.95/hr elsewhere.

The Best Place to Run DeepSeek: 16-Bit Activations

Price is not the only axis. Most serverless providers, including DeepInfra, quantize activations to fp8 to cut serving cost. That lowers the per-token price but moves outputs away from the reference weights, so a quantized DeepSeek does not return exactly what the full-precision model would.

Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization. Responses match the reference weights, which makes Morph the best place to run DeepSeek when output fidelity matters. For coding agents specifically, Morph adds codegen-tuned speculative decoding plus custom low-level inference kernels, so it is the fastest and highest-quality option for code generation, not a general-purpose menu.

Running DeepSeek V4 Flash: Morph vs DeepInfra

	Morph (morph-dsv4flash)	DeepInfra (DeepSeek V4 Flash)
Input / output per 1M tokens	$0.139 / $0.278	$0.10 / $0.20
Activation precision	16-bit (bf16), unquantized	fp8 quantized
Codegen tuning	Speculative decoding + code kernels	General-purpose serving

See Morph Open Source Models and pricing.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Scenario: serve Llama 3.3 70B at 50M output tokens per day.

DeepInfra per-token: Llama 3.3 70B Turbo output is $0.32 per million tokens, so 50M/day = 50 × $0.32 = $16/day = about $480/mo, with no GPU to keep warm and nothing idle.

Modal serverless: Modal does not sell tokens, it sells GPU seconds. One H100 at $3.95/hr running flat out is about $2,844/mo. To hit 50M tokens/day on a single H100 you need roughly 580 sustained output tok/s every second of the day (50,000,000 / 86,400). At that rate the dedicated GPU is fully utilized but still costs about 6x the per-token bill, because DeepInfra spreads the same GPU across many tenants and you cannot.

Break-even: a rented GPU only beats per-token once you saturate it. $480/mo of DeepInfra's own dedicated H100 ($1.79/hr) buys about 268 GPU-hours, roughly 11 hours a day. So a dedicated H100 wins only if you keep it busy more than about 11 of every 24 hours at full throughput; below that, per-token wins. Modal's edge is not a lower token price, it is running the workloads that have no token price at all.

Cold Starts & Autoscaling

Modal engineered cold starts as a feature; DeepInfra hides them behind a warm pool.

Modal containers boot in about 1 second on its custom container stack, and memory snapshotting captures a container's memory state at user-controlled points so it can be reused across boots, cutting the cold-start penalty for large models that would otherwise reload weights and recompile. Functions scale to zero when idle, with a default idle window of 60 seconds via scaledown_window (configurable from 2 seconds to 20 minutes). min_containers keeps warm capacity and buffer_containers over-provisions during spikes. This is what makes scale-to-zero usable: you stop paying for idle GPUs without punishing the next user with a long wait.

DeepInfra does not expose cold starts to you at all. It keeps models warm across many tenants and reports sub-second time-to-first-token, as low as 0.35s on top models, with top models reaching 200 to 317 tokens per second output. You never provision capacity; DeepInfra autoscales behind the API. The tradeoff is that you cannot tune the serving stack, because it is not yours, and DeepInfra quantizes activations to fp8 on many models to hit those rates.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Features & Differentiators

Modal's differentiators are about owning compute; DeepInfra's are about a tuned serving stack you do not have to build.

Feature Comparison

Feature	Modal	DeepInfra
GPU memory snapshots	Yes (sub-2s restore)	N/A (managed)
Code sandboxes (gVisor)	Yes	No
Full fine-tuning	Yes	Dedicated only
LoRA adapters	You build it	Yes (text + image)
Warm multi-tenant pool	You build it	Yes (sub-second TTFT)
FP8 quantization	You build it	Yes (default on many models)
JSON mode / structured output	You build it	Yes
Function calling	You build it	Yes
Prompt caching discount	You build it	Yes (e.g. V4 Pro $0.10 cached)
Batch jobs / workflows	Yes (first-class)	No

DeepInfra's serving stack keeps models warm and uses fp8 quantization on many models to reach 200 to 317 tok/s. Function calling, JSON mode, and structured output ship by default, so a tool-using agent works out of the box. On Modal you get all of that only if you build it, which is exactly the point: it is a platform, not a product.

Modal's sandboxes are the standout. They are gVisor-isolated containers that run untrusted code from an LLM or user and return stdout, stderr, and exit codes. For agents that execute generated code, that is a real capability DeepInfra has no answer to.

Compliance & Deployment

Both clear the standard enterprise bars; DeepInfra publishes the longer certification list.

Modal offers SOC 2 from the Starter tier and HIPAA compatibility plus audit logs on Enterprise. DeepInfra is SOC 2 and ISO 27001 certified, with technical and organizational measures for GDPR and HIPAA compliance, and a zero-retention policy: prompts and completions are deleted from disk and memory after a short retention period, with only metadata (request ID, cost, sampling parameters) logged. The one exception is Google models, where Google logs prompts and responses for abuse detection. Both run on US data centers.

DeepInfra limits accounts to 200 concurrent requests by default, with postpaid billing (card or pre-pay), monthly invoices, and mid-month invoicing at usage-tier thresholds of $20, $100, $500, $2,000, and $10,000. There is no free tier. Modal has no per-token context limit because the context window is a property of whatever model you choose to run, and its Starter plan ships $30/month in free credits.

When to Use DeepInfra

Drop-in open-model API. Llama, DeepSeek, Qwen, and more behind one OpenAI-compatible endpoint. Swap a base URL and you are running.
Cost-sensitive token workloads. Llama 3.3 70B Turbo at $0.10 input and $0.32 output per million tokens, with shared GPUs amortizing cost across tenants.
Tool-using agents out of the box. Function calling, JSON mode, and structured output ship by default, so no serving code is required.
Regulated industries. SOC 2 and ISO 27001 certified, with GDPR and HIPAA measures and a zero-retention policy on inference (metadata only, except Google models).
You never want to touch a GPU. No provisioning, no scaling, no serving stack. DeepInfra handles all of it behind the API.

One caveat for DeepSeek users: DeepInfra quantizes activations to fp8 on many models to hit its rates. If you need outputs that match the reference weights, Morph Open Source Models serve DeepSeek with unquantized 16-bit (bf16) activations, and morph-dsv4flash (DeepSeek V4 Flash) is $0.139 input and $0.278 output per 1M tokens.

Related Comparisons

Frequently Asked Questions

What is the difference between Modal and DeepInfra?

Modal is a serverless compute platform: you write Python, deploy a function, and pay per second of GPU time (about $3.95/hr for an H100 as of June 2026), bringing your own model and serving code. DeepInfra is a managed inference API: you call open models like Llama 3.3 70B or DeepSeek V4 over an OpenAI-compatible endpoint and pay per token (Llama 3.3 70B Turbo is $0.10 input and $0.32 output per million tokens). Modal gives you control; DeepInfra gives you a frontier open model with no infrastructure.

Is Modal or DeepInfra cheaper for LLM inference?

For a standard open model at high utilization, DeepInfra usually wins because you pay only for tokens and it packs many users onto each GPU. Modal can be cheaper for steady high-throughput on a model you control, or for workloads that are not token-shaped such as batch jobs, fine-tuning, and sandboxes. The crossover is your utilization, not the list price.

How fast are Modal cold starts?

Modal containers boot in about 1 second on its custom container stack. Memory snapshotting captures a container's memory state at user-controlled points so it can be reused across boots, which cuts the cold-start penalty for large models that would otherwise reload weights and recompile. Functions scale to zero with a default idle window of 60 seconds (configurable 2 seconds to 20 minutes), so scale-to-zero stays practical.

Does DeepInfra support function calling and JSON mode?

Yes. DeepInfra is OpenAI-compatible and supports streaming, function calling, JSON mode, and structured output, with time-to-first-token as low as 0.35 seconds and 200 to 317 tokens per second output on top models. It uses fp8 quantization on many models to hit those rates, and supports LoRA adapters and dedicated GPU clusters for private deployments.

Can I run a model that isn't in DeepInfra's catalog?

Not on the shared per-token API. DeepInfra serves a fixed catalog of open models (Llama, DeepSeek, Qwen, and similar). For a proprietary model, a custom architecture, a non-standard quantization, or a non-LLM workload like batch transcription or embeddings at scale, you need Modal, where you bring the weights and the serving engine and pay per second of GPU compute. DeepInfra's dedicated GPU clusters can host a private model but still run DeepInfra's serving stack, not yours; Modal is the option when you must own the engine itself.

Where is the best place to run DeepSeek for output fidelity?

Most serverless APIs, DeepInfra included, quantize activations to fp8 to cut serving cost, which moves outputs away from the reference weights. Morph Open Source Models serve DeepSeek with unquantized 16-bit (bf16) activations, so responses match the reference weights, and add codegen-tuned speculative decoding plus custom inference kernels for code generation. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 input and $0.278 output per 1M tokens.

Is DeepInfra cheaper than Fireworks or Together?

On the popular open models, yes. DeepSeek V4 Pro is $1.30/$2.60 per M in/out on DeepInfra versus $1.74/$3.48 on Fireworks and $2.10/$4.40 on Together. Kimi K2.6 is $0.75/$3.50 on DeepInfra versus $0.95/$4.00 on Fireworks and $1.20/$4.50 on Together. Its dedicated H100 at $1.79/hr also undercuts Fireworks ($7.00/hr) and Together ($6.49/hr).

Running DeepSeek? Get 16-Bit Activations, Not fp8

Most serverless APIs quantize DeepSeek to fp8 to cut cost. Morph Open Source Models serve DeepSeek at full 16-bit (bf16), with codegen-tuned speculative decoding for coding agents. morph-dsv4flash is $0.139/$0.278 per 1M tokens.

See Morph Models

Pricing

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Modal vs DeepInfra (2026): $3.95/hr GPU Seconds vs $0.10/M Per-Token, and the Break-Even

Who Wins Per Workload

Quick Comparison

What Each One Is

Modal: serverless compute for any GPU workload

DeepInfra: managed open-model inference

Pricing: Per Second vs Per Token

Modal: per-second GPU rates

DeepInfra: per-token model rates

DeepInfra vs Other Per-Token APIs

The Best Place to Run DeepSeek: 16-Bit Activations

Cost on a Real Workload

Cold Starts & Autoscaling

Features & Differentiators

Compliance & Deployment

When to Use DeepInfra

Related Comparisons

Frequently Asked Questions

What is the difference between Modal and DeepInfra?

Is Modal or DeepInfra cheaper for LLM inference?

How fast are Modal cold starts?

Does DeepInfra support function calling and JSON mode?

Can I run a model that isn't in DeepInfra's catalog?

Where is the best place to run DeepSeek for output fidelity?

Is DeepInfra cheaper than Fireworks or Together?

Running DeepSeek? Get 16-Bit Activations, Not fp8

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Modal vs DeepInfra (2026): $3.95/hr GPU Seconds vs $0.10/M Per-Token, and the Break-Even

Who Wins Per Workload

Quick Comparison

What Each One Is

Modal: serverless compute for any GPU workload

DeepInfra: managed open-model inference

Pricing: Per Second vs Per Token

Modal: per-second GPU rates

DeepInfra: per-token model rates

DeepInfra vs Other Per-Token APIs

The Best Place to Run DeepSeek: 16-Bit Activations

Cost on a Real Workload

Cold Starts & Autoscaling

Features & Differentiators

Compliance & Deployment

When to Use Modal

When to Use DeepInfra

Related Comparisons

Frequently Asked Questions

What is the difference between Modal and DeepInfra?

Is Modal or DeepInfra cheaper for LLM inference?

How fast are Modal cold starts?

Does DeepInfra support function calling and JSON mode?

Can I run a model that isn't in DeepInfra's catalog?

Where is the best place to run DeepSeek for output fidelity?

Is DeepInfra cheaper than Fireworks or Together?

Running DeepSeek? Get 16-Bit Activations, Not fp8