Baseten Alternatives (2026): Token Prices, GPU Rates, and Cold Starts Compared

Baseten dedicated H100s cost $6.50/hr. The same H100 80GB runs $1.79/hr on DeepInfra and about $3.95/hr on Modal. Baseten's scale-to-zero default bills per minute through a cold start that "can take minutes for large models," and its docs recommend two warm replicas for production, which cancels the savings. This page compares eight Baseten alternatives on the numbers that decide cost, all dated 2026-06-09 with sources.

$1.79/hr

DeepInfra H100 vs Baseten $6.50/hr

~1s

Modal container cold start

500 tok/s

Groq GPT-OSS 120B throughput

Alternatives compared on exact prices

The Baseten Cost Problem

Baseten is real infrastructure: dedicated deployments on isolated GPUs, your own autoscaling policy, and SOC 2 Type II plus HIPAA compliance. For a fixed, steady workload on a model you want to tune yourself, that control is worth paying for.

Two things push teams to look elsewhere. First, the GPU rate. Baseten bills dedicated hardware by the minute: H100 80GB at $0.10833/min ($6.50/hr), B200 180GB at $0.16633/min ($9.98/hr), A100 80GB at $0.06667/min ($4.00/hr). DeepInfra serves the same H100 at $1.79/hr and a B200 at $2.79/hr. Second, the scaling model. Baseten's default is min_replica=0: idle costs nothing, but the next request hits a cold start that the docs warn "can take minutes for large models," and billing runs per minute during wake-up even before the replica answers. The fix Baseten recommends is min_replica >= 2, which means paying for two warm GPUs around the clock.

Quick Comparison: 8 Baseten Alternatives

Baseten Alternatives at a Glance (June 2026)

Provider	Cheapest H100 $/hr	Model API	Free tier	Best for
Baseten	$6.50 dedicated	Per-token + GPU	Free credits	Dedicated, tuned deployments
DeepInfra	$1.79 dedicated	Per-token + GPU	No free tier	Cheapest GPUs + tokens
Modal	~$3.95 per-second	GPU per-second	$30/mo credits	~1s cold starts, serverless code
Replicate	$5.49 per-second	Per-run + per-token	Pay-as-you-go	Image/video, scale to zero
Together AI	$5.49 cluster / $6.49 endpoint	Per-token + GPU	Pay-as-you-go	Fine-tuning + clusters
Fireworks AI	$7.00 on-demand	Per-token + GPU	$1 credit	Serverless + fine-tuning
Groq	No GPU rental	Per-token	Free plan	Fastest tok/s
OpenRouter	No GPU rental	Pass-through	50 req/day free	No markup, 400+ models

Groq and OpenRouter do not rent GPUs by the hour; they sell tokens. DeepInfra, Modal, Replicate, Together, and Fireworks all offer both serverless tokens and dedicated/per-second GPU compute, the same two patterns Baseten sells.

Serverless Token Prices Head to Head

For serverless usage, the model you run sets the price. Here are the four most-requested models priced across providers, in dollars per million input/output tokens (cached input in parentheses where published).

DeepSeek V4 Pro, serverless $/M input / output (June 2026)

Provider	Input	Output
DeepInfra	$1.30 ($0.10 cached)	$2.60
Novita	$1.60	$3.20
Baseten	$1.74 ($0.145 cached)	$3.48
Fireworks AI	$1.74 ($0.145 cached)	$3.48
Together AI	$2.10 ($0.20 cached)	$4.40

Kimi K2.6, serverless $/M input / output (June 2026)

Provider	Input	Output
DeepInfra	$0.75 ($0.15 cached)	$3.50
Novita	$0.80	$3.40
Baseten	$0.95 ($0.16 cached)	$4.00
Fireworks AI	$0.95 ($0.16 cached)	$4.00
Together AI	$1.20 ($0.20 cached)	$4.50

GLM 5.1, serverless $/M input / output (June 2026)

Provider	Input	Output
DeepInfra	$1.05 ($0.205 cached)	$3.50
Baseten	$1.30 ($0.26 cached)	$4.30
Novita	$1.38	$4.40
Fireworks AI	$1.40 ($0.26 cached)	$4.40
Together AI	$1.40 ($0.26 cached)	$4.40

GPT-OSS 120B, serverless $/M input / output (June 2026)

Provider	Input	Output	Speed
Baseten	$0.10	$0.50	Not published
Fireworks AI	$0.15 ($0.015 cached)	$0.60	Not published
Together AI	$0.15	$0.60	Not published
Groq	$0.15 ($0.075 cached)	$0.60	500 tok/s

Reading the token tables

Baseten and Fireworks price DeepSeek V4 Pro and Kimi K2.6 identically. DeepInfra undercuts both on every model shown here. On GPT-OSS 120B, Baseten is the cheapest input at $0.10/M, but Groq is the only one that publishes throughput (500 tok/s). Cached input cuts cost sharply: Fireworks and Groq both halve input on supported models.

Dedicated GPU Rates Per Hour

If you want a reserved GPU instead of serverless tokens, the per-hour spread between providers is large. These are published on-demand rates for the same hardware.

H100 80GB, published on-demand $/hr (June 2026)

Provider	$/hr	Notes
DeepInfra	$1.79	Dedicated
Modal	~$3.95	$0.001097/s, per-second
Replicate	$5.49	$0.001525/s
Together AI	$6.49 endpoint / $5.49 cluster	Dedicated endpoint vs cluster
Baseten	$6.50	$0.10833/min dedicated
Fireworks AI	$7.00	On-demand

B200 180GB, published on-demand $/hr (June 2026)

Provider	$/hr	Notes
DeepInfra	$2.79	Dedicated
Modal	~$6.25	$0.001736/s, per-second
Together AI	$9.95 cluster / $11.95 endpoint	Cluster vs dedicated endpoint
Baseten	$9.98	$0.16633/min dedicated
Fireworks AI	$10.00	On-demand

Replicate publishes no B200 rate. DeepInfra is the cheapest published rate on both H100 and B200, roughly a quarter of Baseten's H100 price. Modal bills per second with no minimum, so a short burst pays only for the seconds it runs.

Cold Starts and Scale-to-Zero

Scale-to-zero is the feature that decides whether idle deployments cost money, and the cold start is what you pay for it in latency. The behaviors differ sharply.

Scale-to-zero and cold-start behavior

Provider	Scale to zero	Cold start	Billing during wake-up
Baseten	Default (min_replica=0)	"Can take minutes for large models"	Per minute, even before serving
Modal	Yes, idle scales to zero	~1 second (custom container stack)	Memory snapshotting cuts the penalty
Replicate	On by default	"Can take several minutes" for large models	Only running prediction time is billed

Baseten's own docs recommend min_replica >= 2 for production to eliminate cold starts, which means paying for two warm GPUs continuously. Modal keeps cold starts near one second through its container stack and memory snapshotting (it "captures the state of a container's memory at user-controlled points" to reuse across boots). Replicate's cold boots are slow for large models, but it only charges for prediction time, so a cold boot adds latency, not cost. Replicate's "fast booting fine-tunes" bill only active processing time with no idle charge.

Rate Limits and Free Tiers

Rate limits decide whether a bursty agent workload hits 429s. The ceilings and free tiers vary widely.

Rate limits and free credits (June 2026)

Provider	Rate limit	Free tier
Fireworks AI	10 RPM with no card; 6,000 RPM ceiling with a card	$1 credits
DeepInfra	200 concurrent requests per account	No free tier
Together AI	Dynamic per-model, scales with traffic; no fixed published limit	Pay-as-you-go
Groq	Free: 30 RPM, model-specific RPD/TPM caps	Free plan; Developer unlocks higher limits
Modal	100 containers + 10 GPU concurrency (Starter)	$30/mo credits (Starter)
OpenRouter	50 req/day free; 1,000 req/day after $10 credits; :free models 20 RPM	25+ free models

Fireworks gates monthly budget behind spending tiers: $50/mo with a valid card (Tier 1), up to $50,000/mo at Tier 4. Together recommends a dedicated endpoint when you need a known fixed throughput, since serverless limits float with sustained traffic. Modal's Starter tier is free with $30/month in credits, 100 containers, and 10 GPU concurrency.

Batch Discounts and Compliance

For offline jobs, batch APIs cut cost roughly in half. For regulated workloads, compliance certifications gate the decision before price does.

Batch discounts (June 2026)

Provider	Batch discount	Window
Fireworks AI	50% of serverless pricing	Async
Groq	50% lower cost	24h to 7 days
Together AI	Up to 50% off selected models	24h, 50k requests/batch

Compliance (June 2026)

Provider	SOC 2 Type II	HIPAA	Default retention
Baseten	Yes	Yes	Standard
Fireworks AI	Yes	Yes	Zero retention (open models)
DeepInfra	Yes (+ ISO 27001)	Measures in place	Zero retention
Together AI	Yes	Not stated here	Standard
Groq	Yes	BAA with exclusions	Standard
Modal	Yes (from Starter)	Enterprise only	Standard
OpenRouter	Not stated	Not stated	Zero logging by default

Fireworks does not log or store prompt or generation data for open models without explicit opt-in, with TLS 1.2+ in transit and AES-256 at rest. DeepInfra deletes prompts and completions after a short retention period and logs only metadata, except for Google models where Google logs for abuse detection. OpenRouter does no logging by default "even if an error occurs," and opting into logging earns a 1% usage discount.

Morph for Code Generation

If the workload is a coding agent rather than general inference, throughput on the code path is what decides whether the agent feels instant. Morph runs morph-v3-fast at about 10,500 tok/s using speculative decoding tuned to the token distribution of code, plus custom low-level inference kernels written for code generation. That combination makes it the fastest and highest-quality option for coding agents specifically. The endpoint is always warm, so there is no cold start to design around, and billing is per token, so there is no idle-replica bill.

For DeepSeek, output fidelity is the deciding factor. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves DeepSeek with full 16-bit (bf16) activations and does not quantize them, so output matches the reference weights. That makes Morph the best place to run DeepSeek when fidelity matters: morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. See the full Open Source Models lineup and pricing.

It is OpenAI-compatible at https://api.morphllm.com/v1, so switching is a one-string change. The lineup also includes morph-minimax27-230b (agentic, ~140 tok/s) and morph-qwen36-27b (dense, low latency). For code search inside an agent, WarpGrep is free up to 100k requests, then $1 per 1M requests on Pro.

Point an OpenAI client at Morph

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

const res = await client.chat.completions.create({
  model: "morph-glm52-744b",
  messages: [{ role: "user", content: "Patch this failing test..." }],
});

No cold starts

Always-warm endpoint; no min_replica to keep paid up and warm.

Per-token billing

Cost tracks tokens generated, not provisioned GPU hours.

OpenAI-compatible

Swap the base URL and model string; no migration project.

Which Baseten Alternative to Pick

Match your priority to a provider

Your priority	Pick	Why
Cheapest GPUs and tokens	DeepInfra	H100 $1.79/hr, undercuts on every serverless model shown
Fastest cold starts	Modal	~1s container boot, per-second billing, $30/mo free credits
Highest token throughput	Groq	500-1,000 tok/s published on GPT-OSS and Llama
No markup, widest model menu	OpenRouter	Pass-through pricing, 400+ models, 5.5% credit fee
Fine-tuning + clusters	Together AI	LoRA SFT from $0.48/M tokens, reserved clusters from $3.99/hr
Serverless + fine-tuning suite	Fireworks AI	Zero-retention open models, 6,000 RPM ceiling, batch at 50%
Image/video by output	Replicate	$0.04/output image, $0.09/sec video, scale to zero
Code-generation throughput	Morph	~10,500 tok/s on morph-v3-fast, always warm, per-token

Keep Baseten when you need a dedicated, tuned deployment of a custom model on isolated hardware with HIPAA, and your load is steady enough to keep replicas busy. Switch when the cold-start tax, the replica-hour bill, or the H100 rate stops matching your traffic shape.

Frequently Asked Questions

What is the cheapest Baseten alternative for dedicated GPUs?

DeepInfra: H100 80GB at $1.79/hr versus Baseten's $6.50/hr, plus H200 at $2.19/hr, B200 at $2.79/hr, and A100 80GB at $0.89/hr. Modal is next at ~$3.95/hr for an H100; Replicate $5.49/hr, Together $6.49/hr endpoint, Fireworks $7.00/hr.

How much does Baseten cost compared to Fireworks and Together?

On DeepSeek V4 Pro serverless ($/M in/out): DeepInfra $1.30/$2.60, Novita $1.60/$3.20, Baseten $1.74/$3.48, Fireworks $1.74/$3.48, Together $2.10/$4.40. Baseten and Fireworks price it identically; DeepInfra is cheapest, Together most expensive.

Does Baseten have cold starts?

Yes. The default min_replica=0 incurs no idle charge but triggers a cold start that "can take minutes for large models," billed per minute during wake-up. Baseten recommends min_replica >= 2 for production to eliminate it. Modal boots in ~1 second; Replicate cold boots are slow but bill only prediction time.

Which Baseten alternative is fastest?

Groq publishes the highest tok/s: GPT-OSS 120B at 500 tok/s, GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B Instant at 840 tok/s, Qwen3 32B at 662 tok/s. Cerebras claims up to 15x faster than NVIDIA GPUs. For code generation, Morph runs morph-v3-fast at ~10,500 tok/s.

Is there a Baseten alternative with no markup on inference?

OpenRouter charges pass-through pricing with no inference markup; its fee is 5.5% on credit purchases. The first 1M BYOK requests per month are free, then a 5% fee on the model's normal cost.

Which Baseten alternatives are SOC 2 and HIPAA compliant?

SOC 2 Type II: Baseten, Fireworks, Groq, Together, Modal (from Starter), DeepInfra (+ISO 27001). HIPAA: Baseten, Fireworks, Modal (Enterprise), Groq (BAA with exclusions). Zero retention by default: Fireworks (open models), DeepInfra, OpenRouter.

Related Resources

Private deployments

The fastest endpoints are private deployments

Morph's top speeds come from dedicated deployments, not shared public endpoints: speculators trained on your traffic, caching tuned to your workload, and volume discounts over public per-token rates. Over 100 billion tokens per day run this way.

Talk to us about a private deployment

Code-Generation Throughput Without Replica-Hours

Morph is an always-warm, per-token endpoint that runs morph-v3-fast at ~10,500 tok/s with no cold start and no RPM wall. OpenAI-compatible, so switching from Baseten is a one-string change.

Get API Key

Read the Docs

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers