Modal vs DeepInfra: Run Your Own Engine, or Just Call Open Models Cheaply

DeepInfra wins when you just want to call an open model cheaply per token (Llama 3.3 70B at $0.10/$0.32 per M, 0.35s TTFT). Modal wins when you must run your own model, engine, or non-LLM and code workloads on per-second GPUs. Here is the break-even.

June 3, 2026 · 1 min read

Modal and DeepInfra both run AI workloads on serverless GPUs, but they sell different things. Modal is a compute platform: you write Python, decorate a function, and Modal runs it on a B200 or H100 billed per second of active compute, with no idle charges. You bring the model and own the serving code. DeepInfra is a managed inference API: you call Llama 3.3 70B or DeepSeek V4 over an OpenAI-compatible endpoint and pay per token, with zero infrastructure to manage.

That difference drives everything below. Modal wins when you need control, custom models, batch jobs, fine-tuning, or code sandboxes. DeepInfra wins when you want a frontier open model behind one API call and never want to think about a GPU again.

This guide breaks down per-second versus per-token pricing, cold-start behavior, compliance, and which platform fits a coding agent. All numbers are as of early 2026 and change often, so treat them as a snapshot.

TL;DR

  • Pick Modal if you want to run your own model, a batch job, a fine-tune, or an agent sandbox on per-second GPU billing. H100 at about $3.95/hr, B200 at about $6.25/hr, scale to zero, and snapshot cold starts as low as 2 seconds.
  • Pick DeepInfra if you want a managed open-model API with no infrastructure. Llama 3.3 70B Turbo at about $0.10 input and $0.32 output per million tokens, OpenAI-compatible, SOC 2, ISO 27001, and HIPAA certified.

Who Wins Per Workload

The choice is not which platform is "better" but which shape of workload you have. Read down the left column to your case.

Workload / decisionModalDeepInfra
Call a stock open model cheaplyDeepInfraDeepInfra ($0.10/M, no GPU)
Run your own / custom modelModal (bring any weights)DeepInfra
Own the serving engineModal (your code, your stack)DeepInfra
Lowest first-token latencyModalDeepInfra (0.35s TTFT warm)
Non-LLM or batch GPU workModal (transcription, embeds)DeepInfra
Run untrusted generated codeModal (gVisor sandboxes)DeepInfra
Fine-tune full modelsModal (full lifecycle)DeepInfra (LoRA only)
Zero infra to manageModalDeepInfra (just an API call)
Spiky traffic, scale to zeroModal (~2s snapshot starts)DeepInfra

Quick Comparison

Modal sells GPU seconds; DeepInfra sells tokens; Morph sells the coding-agent inner loop.

SpecModalDeepInfraMorph
What it isServerless GPU platformPer-token inference APICoding-agent inference layer
Billing modelPer second of GPU computePer token (per inference time for non-LLM)Per request / per token
Bring your own modelYes, any modelOpen models only (catalog)Code-specific models
Code-specific applyNoNoYes (/v1/code/apply)
Semantic code searchNoNoWarpGrep (morph-warp-grep-v2.1)
Apply throughputDepends on your code200-317 tok/s (top models)~10,500 tok/s
First-pass apply accuracyN/AN/A98%
Cold start~2s with GPU snapshotsSub-second TTFT (warm pool)Warm fleet
Fine-tuning / LoRAYes (full lifecycle)LoRA adaptersN/A
SandboxesYes (gVisor isolated)NoNo
OpenAI-compatible APIYou build itYesYes
Best forCustom workloads, batch, sandboxesDrop-in open-model APISearch + apply + compact loop

What Each One Is

Modal is infrastructure; DeepInfra is a product built on infrastructure.

Modal: serverless compute for any GPU workload

Modal lets Python developers deploy inference endpoints, run large-scale batch jobs, fine-tune open models, and execute secure agent sandboxes using function decorators, with no YAML, Kubernetes, or cluster management. It bills per second of active compute and scales to zero when idle. The platform covers the full machine-learning lifecycle in one place: training, inference, batch, and sandboxes.

The catch is that you own the serving code. Modal hands you a GPU and a great developer experience; loading a model, batching requests, and quantizing weights are your job. That is the price of flexibility.

DeepInfra: managed open-model inference

DeepInfra runs a catalog of open models behind one OpenAI-compatible API. You pick a model, send a request, and pay per token. There is no GPU to provision, no serving code to write, and no scaling to manage. DeepInfra raised a $107M Series B to scale this inference infrastructure and runs on its own US data centers, including NVIDIA Blackwell B200 systems.

DeepInfra also offers dedicated GPU clusters (DeepCluster) and private deployments for teams that need a specific model isolated, but the default product is the shared per-token endpoint.

Pricing: Per Second vs Per Token

The two pricing models are not directly comparable, which is the whole point.

Modal: per-second GPU rates

GPUPer SecondPer Hour (approx)
B200$0.001736~$6.25
H200$0.001261~$4.54
H100$0.001097~$3.95
A100 80GB$0.000694~$2.50
A100 40GB$0.000583~$2.10
L40S$0.000542~$1.95
L4$0.000222~$0.80
T4$0.000164~$0.59

CPU bills at $0.0000131 per core-second and memory at $0.00000222 per GiB-second, on top of GPU. The Starter plan is $0 with $30 in monthly credits and 10 GPU concurrency. The Team plan is $250/mo with $100 credits and 50 GPU concurrency. Enterprise is custom.

DeepInfra: per-token model rates

ModelInputOutput
Llama 3.3 70B Turbo$0.10$0.32
Llama 3.1 70B$0.40$0.40
Llama 4 Maverick 17B-128E$0.15$0.60
DeepSeek V3$0.32$0.89
DeepSeek V4 Flash$0.10 ($0.02 cached)$0.20
DeepSeek V4 Pro$1.30 ($0.10 cached)$2.60
Qwen3-235B-A22B$0.071$0.10
Embeddings$0.005 - $0.01N/A

DeepInfra also sells dedicated GPUs by the hour with minute-level granularity: A100 80GB at about $0.89, H100 at about $1.79, H200 at about $2.19, B200 at about $2.79, and B300 at about $4.20. Those dedicated rates undercut Modal's, but you are renting a fixed GPU rather than scaling to zero, so the right choice depends on utilization.

Which is cheaper?

For a standard open model, DeepInfra's per-token pricing usually wins because you pay only for tokens and DeepInfra packs many users onto each GPU. Modal wins when the workload is not token-shaped (batch transcription, fine-tuning, embeddings at scale, sandboxes) or when you run a custom model at steady throughput. Cheaper is not "it depends": below the utilization where a rented GPU stays busy, per-token wins; above it, a dedicated GPU you own wins. The next section computes where that line sits.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Scenario: serve Llama 3.3 70B at 50M output tokens per day.

DeepInfra per-token: Llama 3.3 70B Turbo output is $0.32 per million tokens, so 50M/day = 50 × $0.32 = $16/day = about $480/mo, with no GPU to keep warm and nothing idle.

Modal serverless: Modal does not sell tokens, it sells GPU seconds. One H100 at $3.95/hr running flat out is about $2,844/mo. To hit 50M tokens/day on a single H100 you need roughly 580 sustained output tok/s every second of the day (50,000,000 / 86,400). At that rate the dedicated GPU is fully utilized but still costs about 6x the per-token bill, because DeepInfra spreads the same GPU across many tenants and you cannot.

Break-even: a rented GPU only beats per-token once you saturate it. $480/mo of DeepInfra's own dedicated H100 ($1.79/hr) buys about 268 GPU-hours, roughly 11 hours a day. So a dedicated H100 wins only if you keep it busy more than about 11 of every 24 hours at full throughput; below that, per-token wins. Modal's edge is not a lower token price, it is running the workloads that have no token price at all.

Cold Starts & Autoscaling

Modal engineered cold starts as a feature; DeepInfra hides them behind a warm pool.

Modal's GPU memory snapshots checkpoint a container's full state, including loaded CUDA kernels, captured CUDA graphs, and a compiled model, just before it accepts its first request. Restoring from a snapshot means torch.compile does not run again and weights do not reload. A function that took about 20 seconds to cold boot drops to as low as 2 seconds, roughly 10x faster. This is what makes scale-to-zero usable: you stop paying for idle GPUs without punishing the next user with a 20-second wait. The feature is in alpha as of early 2026.

DeepInfra does not expose cold starts to you at all. It keeps models warm across many tenants and reports sub-second time-to-first-token, as low as 0.35s on top models, with FP8 quantization delivering TTFT around 0.91s. Top models reach 200 to 317 tokens per second output. You never provision capacity; DeepInfra autoscales behind the API. The tradeoff is that you cannot tune the serving stack, because it is not yours.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Features & Differentiators

Modal's differentiators are about owning compute; DeepInfra's are about a tuned serving stack you do not have to build.

FeatureModalDeepInfra
GPU memory snapshotsYes (sub-2s restore)N/A (managed)
Code sandboxes (gVisor)YesNo
Full fine-tuningYesDedicated only
LoRA adaptersYou build itYes (text + image)
Speculative decodingYou build itYes (built-in)
FP8 quantizationYou build itYes (~0.91s TTFT)
JSON mode / structured outputYou build itYes
Function callingYou build itYes
KV-cache-aware routingYou build itYes (TensorRT-LLM)
Batch jobs / workflowsYes (first-class)No

DeepInfra's serving stack runs TensorRT-LLM with speculative decoding, multi-token prediction, FP8 quantization, and KV-cache-aware routing. Function calling, JSON mode, and structured output ship by default, so a tool-using agent works out of the box. On Modal you get all of that only if you build it, which is exactly the point: it is a platform, not a product.

Modal's sandboxes are the standout. They are gVisor-isolated containers that run untrusted code from an LLM or user and return stdout, stderr, and exit codes. For agents that execute generated code, that is a real capability DeepInfra has no answer to.

Compliance & Deployment

Both clear the standard enterprise bars; DeepInfra publishes the longer certification list.

Modal offers SOC 2 across all tiers and HIPAA compatibility on Enterprise, plus audit logs, Okta SSO, and a static IP proxy on Enterprise. DeepInfra is SOC 2 and ISO 27001 certified and HIPAA, PCI, GDPR, FedRAMP, and CSA STAR Level 1 compliant, with a zero-retention policy that keeps inputs, outputs, and user data private. Both run on US data centers; DeepInfra can place dedicated clusters in a preferred region on request.

DeepInfra limits accounts to 200 concurrent requests by default, raisable from the dashboard. Some FP4 models cap context at 66k tokens, while Llama 4 Maverick and DeepSeek V4 Flash variants reach 1M. Modal has no per-token context limit because the context window is a property of whatever model you choose to run.

When to Use Modal

  • Custom or proprietary models. If the model is not in a public catalog, Modal lets you run it on per-second GPUs without standing up Kubernetes.
  • Batch and async workloads. Batch transcription, large-scale embeddings, periodic fine-tunes, and workflows are first-class on Modal and awkward on a token API.
  • Agent sandboxes. gVisor-isolated containers run untrusted generated code and return stdout, stderr, and exit codes. This is the cleanest sandbox primitive of the two.
  • Spiky traffic with scale-to-zero. Snapshot cold starts near 2 seconds make scaling down to zero practical, so you stop paying for idle GPUs between bursts.
  • You want to own the serving stack. Quantization, batching, and routing are yours to tune, which matters when you have squeezed everything out of a managed API.

When to Use DeepInfra

  • Drop-in open-model API. Llama, DeepSeek, Qwen, and more behind one OpenAI-compatible endpoint. Swap a base URL and you are running.
  • Cost-sensitive token workloads. Llama 3.3 70B Turbo at $0.10 input and $0.32 output per million tokens, with shared GPUs amortizing cost across tenants.
  • Tool-using agents out of the box. Function calling, JSON mode, and structured output ship by default, so no serving code is required.
  • Regulated industries. SOC 2, ISO 27001, HIPAA, PCI, GDPR, and FedRAMP, plus a zero-retention policy and dedicated clusters in a preferred region.
  • You never want to touch a GPU. No provisioning, no scaling, no serving stack. DeepInfra handles all of it behind the API.

Related Comparisons

Frequently Asked Questions

What is the difference between Modal and DeepInfra?

Modal is a serverless compute platform: you write Python, deploy a function, and pay per second of GPU time (about $3.95/hr for an H100 as of early 2026), bringing your own model and serving code. DeepInfra is a managed inference API: you call open models like Llama 3.3 70B or DeepSeek V4 over an OpenAI-compatible endpoint and pay per token (Llama 3.3 70B Turbo is about $0.10 input and $0.32 output per million tokens). Modal gives you control; DeepInfra gives you a frontier open model with no infrastructure.

Is Modal or DeepInfra cheaper for LLM inference?

For a standard open model at high utilization, DeepInfra usually wins because you pay only for tokens and it packs many users onto each GPU. Modal can be cheaper for steady high-throughput on a model you control, or for workloads that are not token-shaped such as batch jobs, fine-tuning, and sandboxes. The crossover is your utilization, not the list price.

How fast are Modal cold starts?

Modal's GPU memory snapshots checkpoint a container's full state, including loaded CUDA kernels, captured CUDA graphs, and a compiled model, just before it accepts a request. Restoring drops a cold start that would take about 20 seconds to as low as 2 seconds, roughly 10x. This makes scale-to-zero practical. The feature is in alpha as of early 2026.

Does DeepInfra support function calling and JSON mode?

Yes. DeepInfra is OpenAI-compatible and supports streaming, function calling, JSON mode, and structured output. Its stack uses TensorRT-LLM, speculative decoding, multi-token prediction, FP8 quantization, and KV-cache-aware routing, with time-to-first-token as low as 0.35 seconds on top models. It also supports LoRA adapters and dedicated clusters for private deployments.

Can I run a model that isn't in DeepInfra's catalog?

Not on the shared per-token API. DeepInfra serves a fixed catalog of open models (Llama, DeepSeek, Qwen, and similar). For a proprietary model, a custom architecture, a non-standard quantization, or a non-LLM workload like batch transcription or embeddings at scale, you need Modal, where you bring the weights and the serving engine and pay per second of GPU compute. DeepInfra's dedicated GPU clusters can host a private model but still run DeepInfra's serving stack, not yours; Modal is the option when you must own the engine itself.

Call Open Models on DeepInfra, Run Your Own Engine on Modal

If applying model-generated code edits is the bottleneck instead, that is a separate, code-specific layer. See where Morph Fast Apply lands.