Baseten vs DeepInfra (2026): $1.79/hr vs $6.50/hr H100, and When to Pay More

You want to serve an open model and the question is whether to pay DeepInfra's serverless rate or stand up your own deployment on Baseten. The price gap is not subtle. DeepInfra's dedicated H100 80GB is $1.79/hr; Baseten's is $6.50/hr, 3.6x more. On tokens it is the same story: DeepSeek V4-Pro is $1.30/$2.60 per 1M on DeepInfra, $1.74/$3.48 on Baseten.

So why does anyone pay Baseten more? Because the platforms answer different questions. DeepInfra is "call a popular open model behind one OpenAI-compatible URL for the lowest price per token, no setup." Baseten is "run my own or fine-tuned model on a dedicated GPU with engine selection, autoscaling, scale-to-zero, and HIPAA." DeepInfra is a per-token menu. Baseten is a deployment platform that also sells per-token Model APIs.

This page has the full per-token tables for both, the dedicated GPU rates, the cold-start behavior, and a break-even calculation so you can decide from your own numbers. All prices are list prices as of June 2026.

$1.79/hr

DeepInfra H100

$6.50/hr

Baseten H100

$1.30/$2.60

DeepSeek V4-Pro (DeepInfra)

200 req

DeepInfra concurrency

TL;DR

Pick DeepInfra if you want the lowest price per token with zero setup. DeepSeek V4-Pro at $1.30/$2.60 per 1M, GLM-5.1 at $1.05/$3.50, Kimi K2.6 at $0.75/$3.50, a dedicated H100 at $1.79/hr, plus resold Claude (Sonnet 4.6 $3/$15) on one OpenAI-compatible API. No free tier; 200 concurrent requests per account.
Pick Baseten if you run your own or fine-tuned model and the premium buys something you use: dedicated GPUs billed per replica-minute (H100 $6.50/hr), per-model engine selection (TensorRT-LLM, SGLang), scale-to-zero with configurable autoscaling, free starter credits, SOC 2 Type II, and HIPAA.

Who Wins Per Workload

The decision is per workload, not per platform. DeepInfra wins on price and convenience; Baseten wins wherever running your own model with control over serving is the point.

Who Wins Per Workload

Workload / decision	Baseten	DeepInfra
Bursty / low volume	Pays for warm replicas	DeepInfra, no idle cost
Sustained high volume	Baseten, you keep batching margin	Per-token adds up
Cheapest per token	3x to 4x more	DeepInfra, V4-Pro $1.30/$2.60
Cheapest dedicated H100	$6.50/hr	DeepInfra, $1.79/hr
Run your own model / engine	Baseten, Truss + engine selection	Shared menu only
Scale-to-zero	Baseten, min_replica=0 default	Always-warm shared
Resold Claude / Anthropic	No	DeepInfra, Opus/Sonnet/Haiku
Fastest first call (no cold start)	Scale-to-zero delay	DeepInfra, warm endpoints
HIPAA / regulated	Baseten, HIPAA	SOC 2 + ISO 27001

Quick Comparison

Baseten is the control-and-deployment platform; DeepInfra is the price-and-convenience platform.

Baseten vs DeepInfra vs Morph at a Glance

Spec	Baseten	DeepInfra	Morph
Primary model	Dedicated GPU replicas	Serverless per-token	Code-specific endpoints
Billing	Per replica-minute	Per token (or GPU-hr)	Per request / per token
H100 80GB dedicated	$6.50/hr	$1.79/hr	N/A
B200 180GB dedicated	$9.98/hr	$2.79/hr	N/A
DeepSeek V4-Pro per 1M	$1.74 / $3.48	$1.30 / $2.60	N/A
Free tier	Free credits	None	WarpGrep $0/100k
Engine selection	TensorRT-LLM, SGLang	In-house, fixed menu	Code-tuned
Concurrency cap	You set replicas	200 req / account	N/A
Code-specific apply	No	No	Yes (/v1/code/apply)
Apply throughput	General serving	General serving	~10,500 tok/s
Best for	Custom prod models	Cheap per-token calls	Coding agents

Billing Models: Replica-Minute vs Per-Token

This is the core difference, and it decides which platform is cheaper for you.

Baseten dedicated deployments bill per replica-minute. Every running replica costs the underlying GPU rate continuously, whether it serves one request or a thousand. That is efficient at high, steady utilization where you pack a GPU with batched traffic, and expensive when traffic is low or spiky, because warm replicas still bill. Baseten also offers Model APIs for pre-optimized open models priced per token, so you can mix both inside one account.

DeepInfra serverless is the inverse: you pay per input and output token on shared infrastructure and pay nothing when you send nothing. There is no replica to keep warm and no idle cost, but there is no free tier and billing is postpaid. For dedicated needs, DeepInfra rents whole GPUs by the hour at roughly a third of Baseten's rate.

The Utilization Crossover

Per-token pricing wins until your traffic is high and steady enough that a dedicated replica stays near full utilization. Above that line, a per-minute GPU you saturate yourself can beat per-token rates because you capture the batching margin. Below it, serverless per-token is cheaper and simpler. The crossover math is in the Cost on a Real Workload section.

Per-Token Pricing: DeepInfra Undercuts on Every Shared Model

On serverless per-token rates, DeepInfra is cheaper than Baseten's Model APIs on the models both list. DeepInfra also resells Anthropic models, which Baseten does not.

Serverless Per-Token Pricing (per 1M tokens, input / output)

Model	Baseten	DeepInfra
DeepSeek V4-Pro	$1.74 / $3.48	$1.30 / $2.60
DeepSeek V4-Flash	Not listed	$0.10 / $0.20
Kimi K2.6	$0.95 / $4.00	$0.75 / $3.50
GLM 5.1	$1.30 / $4.30	$1.05 / $3.50
GPT-OSS 120B	$0.10 / $0.50	Not listed
Qwen3-235B-A22B	Not listed	$0.09 / $0.10
Llama 3.1 8B Instruct	Not listed	$0.02 / $0.05
Claude Sonnet 4.6 (resold)	No	$3.00 / $15.00

DeepInfra prices cached input separately on most models (DeepSeek V4-Pro cached input is $0.10/M, GLM-5.1 cached is $0.205/M), which matters for agentic workloads that resend a large system prompt each turn. Its catalog also covers DeepSeek V3 ($0.32/$0.89), DeepSeek R1-0528 ($0.50/$2.15), MiniMax-M2.5 ($0.15/$1.15), and Qwen3-Max ($1.20/$6.00). Baseten's Model API catalog includes NVIDIA Nemotron 3 Ultra ($0.60/$2.40) and Nemotron 3 Super ($0.30/$0.75), which DeepInfra does not headline.

GPT-OSS 120B: Baseten is the cheap one here

The one model where Baseten undercuts is GPT-OSS 120B at $0.10/$0.50 per 1M. Across the broader market that ties or beats Fireworks, Together, and Groq (all $0.15/$0.60). DeepInfra does not headline a GPT-OSS 120B rate, so on this specific model Baseten's Model API is the price leader of the two.

Best place to run DeepSeek: output fidelity, not just price

Price per token is one axis; output quality is the other. Most serverless providers quantize activations to fp8 to cut cost, which degrades output relative to the reference weights. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize activations to fp8, so output matches the reference model. morph-dsv4flash (DeepSeek V4 Flash) runs at $0.139 per 1M input and $0.278 per 1M output. For coding agents, Morph also runs codegen-specific speculative decoding plus custom low-level inference kernels, which makes it the fastest and highest-quality option for code generation. See Morph models and pricing.

Dedicated GPU Pricing: A 3.6x Gap on the H100

If you rent whole GPUs, DeepInfra is far cheaper on every comparable card.

Dedicated GPU Pricing (per GPU-hour)

GPU	Baseten	DeepInfra
A100 80GB	$4.00/hr	$0.89/hr
H100 80GB	$6.50/hr	$1.79/hr
H100 MIG 40GB	$3.75/hr	N/A
H200 141GB	Available	$2.19/hr
B200 180GB	$9.98/hr	$2.79/hr
B300	N/A	$4.20/hr ($270GB)
T4 16GB	$0.63/hr	N/A
L4 24GB	$0.85/hr	N/A
A10G 24GB	$1.21/hr	N/A

Baseten's posted GPU rates run roughly 3x to 4x DeepInfra's on comparable hardware (H100 $6.50 vs $1.79 is 3.6x; B200 $9.98 vs $2.79 is 3.6x). The gap reflects what you get: Baseten's rate bundles its inference stack, engine selection, autoscaling, and managed deployment. DeepInfra rents closer to raw capacity. Baseten also serves a deeper lineup of small GPUs (T4, L4, A10G, H100 MIG) for cost-tuning smaller models, which DeepInfra does not break out on its dedicated menu.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Serving DeepSeek V4-Pro at 50M output tokens/day, using only the prices above:

DeepInfra serverless: 50M tok/day x $2.60 per 1M output = $130/day = ~$3,900/mo. No idle cost, plus input tokens at $1.30/M.
DeepInfra dedicated H100: $1.79/hr x 24 x 30 = ~$1,289/mo per GPU, regardless of volume.
Baseten dedicated H100: $6.50/hr x 24 x 30 = ~$4,680/mo per GPU, regardless of volume.

Here a saturated DeepInfra dedicated H100 ($1,289/mo) already beats DeepInfra serverless ($3,900/mo) at 50M output tok/day, because V4-Pro output is priced at $2.60/M. Break-even against DeepInfra serverless output: $1,289/mo / $2.60 per 1M = ~16.5M output tok/day for a DeepInfra H100, and $4,680/mo / $2.60 per 1M = ~60M output tok/day for a Baseten H100. So below ~16.5M output tok/day serverless wins; a single saturated DeepInfra H100 wins between roughly 16.5M and 60M; you cross into Baseten-H100 territory only above ~60M output tok/day on one replica, where its engine selection and batching margin have to make up the 3.6x rate gap. The catch on dedicated: a real H100 will not sustain V4-Pro output at the same throughput it sustains a 70B model, so your true break-even depends on the tokens-per-second you actually get, not the list price alone.

Cold Starts and Scale-to-Zero

DeepInfra's popular models are always warm; Baseten's scale-to-zero trades idle cost for a first-request delay.

DeepInfra serves high-traffic open models on shared endpoints that never scale down, so there is no cold start for the popular models. When a model is saturated you can hit a limit, after which DeepInfra adds capacity. There is no replica for you to keep warm and no idle bill.

Baseten's autoscaling default is min_replica=0. Baseten's docs state that a deployment scaled to zero replicas incurs no charges, but the next request triggers a cold start that can take minutes for large models. During wake-up, billing is per minute even though the replica is not yet serving responses. To eliminate cold starts, the docs recommend min_replica of at least 2 for production, which means paying for warm idle capacity. That is the control-versus-cost tradeoff in one knob.

Baseten Autoscaling Defaults

Setting	Default	Range
Min replicas	0 (scale to zero)	0 and up
Max replicas	1	Configurable
Concurrency target	1 request / replica	1 to 256 (model-specific)
Autoscaling window	60s	10s to 3600s
Production guidance	min_replica >= 2	to remove cold starts

Neither platform is built for the coding-agent apply loop. If applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Rate Limits and Account Model

DeepInfra caps concurrency at the account level; Baseten's ceiling is whatever you provision in replicas.

Limits and Billing

Aspect	Baseten	DeepInfra
Concurrency cap	Set by your replica count	200 requests / account
Free tier	Free starter credits	None
Billing	Pay-as-you-go per replica-minute	Postpaid, monthly + mid-month
Mid-month invoice triggers	N/A	$20, $100, $500, $2k, $10k
Plan tiers	Basic $0, Pro, Enterprise	Single PAYG account

DeepInfra's 200 concurrent-request cap is the practical ceiling on serverless throughput per account; sustained high concurrency past that means either contacting them or moving to dedicated GPUs. Baseten has no shared cap because each deployment is your own replicas, and you scale the count to match load.

Compliance and Data Retention

Both are SOC 2 Type II; the differences are HIPAA, ISO 27001, and how each handles your prompts.

Compliance and Retention

Aspect	Baseten	DeepInfra
SOC 2 Type II	Yes	Yes
ISO 27001	Not stated	Yes
HIPAA	Yes	Measures in place
GDPR	Not stated	Measures in place
Prompt retention	Deployment-controlled	Zero retention on inference

DeepInfra states zero retention on inference: prompts and completions are deleted from disk and memory after a short retention period, with only metadata logged (request ID, cost, sampling parameters). The exception is Google models, where Google logs prompts and responses for abuse detection. DeepInfra is SOC 2 and ISO 27001 certified, with technical and organizational measures for GDPR and HIPAA. Baseten is SOC 2 Type II certified and HIPAA compliant, and because you run dedicated deployments, retention is governed by how you configure them.

When to Use Baseten

Custom or fine-tuned models in production. Package the model with Truss, pick the GPU, and control autoscaling and serving logic. This is the platform's core.
You need engine-level control. Per-model selection between TensorRT-LLM and SGLang for a specific model that has to be fast, instead of a fixed shared menu.
Scale-to-zero with a tunable cost knob. min_replica=0 means no idle bill when traffic stops; min_replica of 2 or more removes cold starts when it matters. You decide per deployment.
HIPAA and a free start. SOC 2 Type II, HIPAA, and free starter credits for experimentation before you commit.
High, steady utilization. When a dedicated replica stays near full, per-minute billing you control can beat per-token rates because you keep the batching margin.
GPT-OSS 120B specifically. Baseten's $0.10/$0.50 Model API rate is the cheaper of the two on this model.

When to Use DeepInfra

Lowest cost per token. DeepSeek V4-Pro at $1.30/$2.60, GLM-5.1 at $1.05/$3.50, and Kimi K2.6 at $0.75/$3.50 undercut Baseten's Model API on every shared model both list.
Cheapest dedicated GPUs. H100 at $1.79/hr and B200 at $2.79/hr, roughly a third of Baseten's posted rates.
Zero setup, no idle cost. Call a popular open model behind an OpenAI-compatible URL and pay only for tokens, with no replica to keep warm.
Resold Anthropic models. Claude Opus 4.8 ($5/$25), Sonnet 4.6 ($3/$15), and Haiku 4.5 ($1/$5) under the same account, which Baseten does not offer.
Zero-retention serverless. Prompts and completions deleted after a short window, only metadata logged, with SOC 2 and ISO 27001.

Frequently Asked Questions

Is Baseten or DeepInfra cheaper?

DeepInfra is cheaper on both dedicated GPUs and per-token calls. A dedicated H100 80GB is $1.79/hr on DeepInfra versus $6.50/hr on Baseten, a 3.6x gap. On tokens, DeepSeek V4-Pro is $1.30/$2.60 per 1M on DeepInfra versus $1.74/$3.48 on Baseten, and GLM-5.1 is $1.05/$3.50 versus $1.30/$4.30. The one model where Baseten is cheaper is GPT-OSS 120B at $0.10/$0.50.

What is the difference between Baseten and DeepInfra?

DeepInfra is serverless-first: call a shared per-token endpoint for popular open models with no infrastructure to manage and no free tier. Baseten is a deployment platform: ship a model, pick the GPU, pay per replica-minute, with TensorRT-LLM or SGLang engine selection and scale-to-zero. Baseten also sells per-token Model APIs for pre-optimized open models. DeepInfra gives the lowest price per token; Baseten gives control over how your own model is served.

Does Baseten have cold starts?

Yes, for scaled-to-zero deployments. Baseten's default is min_replica=0, so a deployment with no traffic incurs no charges, but the next request triggers a cold start that the docs say can take minutes for large models, and billing is per minute during wake-up even before the replica serves. The docs recommend min_replica of at least 2 for production to remove cold starts, at the cost of warm idle capacity. DeepInfra's shared endpoints stay warm.

What are DeepInfra's rate limits?

DeepInfra allows 200 concurrent requests per account. Billing is postpaid with monthly invoices plus mid-month invoicing at usage-tier thresholds of $20, $100, $500, $2,000, and $10,000. There is no free tier. Baseten has no shared concurrency cap because you run dedicated replicas and scale them yourself.

Does DeepInfra retain my prompts?

No, except for Google models. DeepInfra states zero retention on inference: prompts and completions are deleted from disk and memory after a short retention period, with only metadata (request ID, cost, sampling parameters) logged. The exception is Google models, where Google logs prompts and responses for abuse detection. DeepInfra is SOC 2 and ISO 27001 certified.

When is Baseten worth 3.6x DeepInfra's GPU price?

When you run your own or fine-tuned model and use what the premium buys: per-model engine selection between TensorRT-LLM and SGLang, custom-model deploys via Truss, scale-to-zero with configurable autoscaling, and HIPAA-eligible deployment. If you only need to call a popular open model behind one API for the lowest price, DeepInfra's $1.30/$2.60 DeepSeek V4-Pro and $1.79/hr H100 win and the Baseten controls go unused.

Related Comparisons

DeepInfra for Cheap Tokens, Baseten for Custom Deploys

Serve open models cheaply behind one API with DeepInfra, or pay Baseten's premium when running your own model with engine selection and HIPAA earns it. If applying model-generated code edits is your bottleneck, that is a separate tool.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Baseten vs DeepInfra (2026): DeepInfra's H100 Is $1.79/hr vs Baseten's $6.50/hr, and Why You'd Still Pay More

Who Wins Per Workload

Quick Comparison

Billing Models: Replica-Minute vs Per-Token

Per-Token Pricing: DeepInfra Undercuts on Every Shared Model

Dedicated GPU Pricing: A 3.6x Gap on the H100

Cost on a Real Workload

Cold Starts and Scale-to-Zero

Rate Limits and Account Model

Compliance and Data Retention

When to Use Baseten

When to Use DeepInfra

Frequently Asked Questions

Is Baseten or DeepInfra cheaper?

What is the difference between Baseten and DeepInfra?

Does Baseten have cold starts?

What are DeepInfra's rate limits?

Does DeepInfra retain my prompts?

When is Baseten worth 3.6x DeepInfra's GPU price?

Related Comparisons

DeepInfra for Cheap Tokens, Baseten for Custom Deploys