Baseten dedicated H100s cost $6.50/hr. The same H100 80GB runs $1.79/hr on DeepInfra and about $3.95/hr on Modal. Baseten's scale-to-zero default bills per minute through a cold start that "can take minutes for large models," and its docs recommend two warm replicas for production, which cancels the savings. This page compares eight Baseten alternatives on the numbers that decide cost, all dated 2026-06-09 with sources.
The Baseten Cost Problem
Baseten is real infrastructure: dedicated deployments on isolated GPUs, your own autoscaling policy, and SOC 2 Type II plus HIPAA compliance. For a fixed, steady workload on a model you want to tune yourself, that control is worth paying for.
Two things push teams to look elsewhere. First, the GPU rate. Baseten bills dedicated hardware by the minute: H100 80GB at $0.10833/min ($6.50/hr), B200 180GB at $0.16633/min ($9.98/hr), A100 80GB at $0.06667/min ($4.00/hr). DeepInfra serves the same H100 at $1.79/hr and a B200 at $2.79/hr. Second, the scaling model. Baseten's default is min_replica=0: idle costs nothing, but the next request hits a cold start that the docs warn "can take minutes for large models," and billing runs per minute during wake-up even before the replica answers. The fix Baseten recommends is min_replica >= 2, which means paying for two warm GPUs around the clock.
Quick Comparison: 8 Baseten Alternatives
| Provider | Cheapest H100 $/hr | Model API | Free tier | Best for |
|---|---|---|---|---|
| Baseten | $6.50 dedicated | Per-token + GPU | Free credits | Dedicated, tuned deployments |
| DeepInfra | $1.79 dedicated | Per-token + GPU | No free tier | Cheapest GPUs + tokens |
| Modal | ~$3.95 per-second | GPU per-second | $30/mo credits | ~1s cold starts, serverless code |
| Replicate | $5.49 per-second | Per-run + per-token | Pay-as-you-go | Image/video, scale to zero |
| Together AI | $5.49 cluster / $6.49 endpoint | Per-token + GPU | Pay-as-you-go | Fine-tuning + clusters |
| Fireworks AI | $7.00 on-demand | Per-token + GPU | $1 credit | Serverless + fine-tuning |
| Groq | No GPU rental | Per-token | Free plan | Fastest tok/s |
| OpenRouter | No GPU rental | Pass-through | 50 req/day free | No markup, 400+ models |
Groq and OpenRouter do not rent GPUs by the hour; they sell tokens. DeepInfra, Modal, Replicate, Together, and Fireworks all offer both serverless tokens and dedicated/per-second GPU compute, the same two patterns Baseten sells.
Serverless Token Prices Head to Head
For serverless usage, the model you run sets the price. Here are the four most-requested models priced across providers, in dollars per million input/output tokens (cached input in parentheses where published).
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $1.30 ($0.10 cached) | $2.60 |
| Novita | $1.60 | $3.20 |
| Baseten | $1.74 ($0.145 cached) | $3.48 |
| Fireworks AI | $1.74 ($0.145 cached) | $3.48 |
| Together AI | $2.10 ($0.20 cached) | $4.40 |
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $0.75 ($0.15 cached) | $3.50 |
| Novita | $0.80 | $3.40 |
| Baseten | $0.95 ($0.16 cached) | $4.00 |
| Fireworks AI | $0.95 ($0.16 cached) | $4.00 |
| Together AI | $1.20 ($0.20 cached) | $4.50 |
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $1.05 ($0.205 cached) | $3.50 |
| Baseten | $1.30 ($0.26 cached) | $4.30 |
| Novita | $1.38 | $4.40 |
| Fireworks AI | $1.40 ($0.26 cached) | $4.40 |
| Together AI | $1.40 ($0.26 cached) | $4.40 |
| Provider | Input | Output | Speed |
|---|---|---|---|
| Baseten | $0.10 | $0.50 | Not published |
| Fireworks AI | $0.15 ($0.015 cached) | $0.60 | Not published |
| Together AI | $0.15 | $0.60 | Not published |
| Groq | $0.15 ($0.075 cached) | $0.60 | 500 tok/s |
Reading the token tables
Baseten and Fireworks price DeepSeek V4 Pro and Kimi K2.6 identically. DeepInfra undercuts both on every model shown here. On GPT-OSS 120B, Baseten is the cheapest input at $0.10/M, but Groq is the only one that publishes throughput (500 tok/s). Cached input cuts cost sharply: Fireworks and Groq both halve input on supported models.
Dedicated GPU Rates Per Hour
If you want a reserved GPU instead of serverless tokens, the per-hour spread between providers is large. These are published on-demand rates for the same hardware.
| Provider | $/hr | Notes |
|---|---|---|
| DeepInfra | $1.79 | Dedicated |
| Modal | ~$3.95 | $0.001097/s, per-second |
| Replicate | $5.49 | $0.001525/s |
| Together AI | $6.49 endpoint / $5.49 cluster | Dedicated endpoint vs cluster |
| Baseten | $6.50 | $0.10833/min dedicated |
| Fireworks AI | $7.00 | On-demand |
| Provider | $/hr | Notes |
|---|---|---|
| DeepInfra | $2.79 | Dedicated |
| Modal | ~$6.25 | $0.001736/s, per-second |
| Together AI | $9.95 cluster / $11.95 endpoint | Cluster vs dedicated endpoint |
| Baseten | $9.98 | $0.16633/min dedicated |
| Fireworks AI | $10.00 | On-demand |
Replicate publishes no B200 rate. DeepInfra is the cheapest published rate on both H100 and B200, roughly a quarter of Baseten's H100 price. Modal bills per second with no minimum, so a short burst pays only for the seconds it runs.
Cold Starts and Scale-to-Zero
Scale-to-zero is the feature that decides whether idle deployments cost money, and the cold start is what you pay for it in latency. The behaviors differ sharply.
| Provider | Scale to zero | Cold start | Billing during wake-up |
|---|---|---|---|
| Baseten | Default (min_replica=0) | "Can take minutes for large models" | Per minute, even before serving |
| Modal | Yes, idle scales to zero | ~1 second (custom container stack) | Memory snapshotting cuts the penalty |
| Replicate | On by default | "Can take several minutes" for large models | Only running prediction time is billed |
Baseten's own docs recommend min_replica >= 2 for production to eliminate cold starts, which means paying for two warm GPUs continuously. Modal keeps cold starts near one second through its container stack and memory snapshotting (it "captures the state of a container's memory at user-controlled points" to reuse across boots). Replicate's cold boots are slow for large models, but it only charges for prediction time, so a cold boot adds latency, not cost. Replicate's "fast booting fine-tunes" bill only active processing time with no idle charge.
Rate Limits and Free Tiers
Rate limits decide whether a bursty agent workload hits 429s. The ceilings and free tiers vary widely.
| Provider | Rate limit | Free tier |
|---|---|---|
| Fireworks AI | 10 RPM with no card; 6,000 RPM ceiling with a card | $1 credits |
| DeepInfra | 200 concurrent requests per account | No free tier |
| Together AI | Dynamic per-model, scales with traffic; no fixed published limit | Pay-as-you-go |
| Groq | Free: 30 RPM, model-specific RPD/TPM caps | Free plan; Developer unlocks higher limits |
| Modal | 100 containers + 10 GPU concurrency (Starter) | $30/mo credits (Starter) |
| OpenRouter | 50 req/day free; 1,000 req/day after $10 credits; :free models 20 RPM | 25+ free models |
Fireworks gates monthly budget behind spending tiers: $50/mo with a valid card (Tier 1), up to $50,000/mo at Tier 4. Together recommends a dedicated endpoint when you need a known fixed throughput, since serverless limits float with sustained traffic. Modal's Starter tier is free with $30/month in credits, 100 containers, and 10 GPU concurrency.
Batch Discounts and Compliance
For offline jobs, batch APIs cut cost roughly in half. For regulated workloads, compliance certifications gate the decision before price does.
| Provider | Batch discount | Window |
|---|---|---|
| Fireworks AI | 50% of serverless pricing | Async |
| Groq | 50% lower cost | 24h to 7 days |
| Together AI | Up to 50% off selected models | 24h, 50k requests/batch |
| Provider | SOC 2 Type II | HIPAA | Default retention |
|---|---|---|---|
| Baseten | Yes | Yes | Standard |
| Fireworks AI | Yes | Yes | Zero retention (open models) |
| DeepInfra | Yes (+ ISO 27001) | Measures in place | Zero retention |
| Together AI | Yes | Not stated here | Standard |
| Groq | Yes | BAA with exclusions | Standard |
| Modal | Yes (from Starter) | Enterprise only | Standard |
| OpenRouter | Not stated | Not stated | Zero logging by default |
Fireworks does not log or store prompt or generation data for open models without explicit opt-in, with TLS 1.2+ in transit and AES-256 at rest. DeepInfra deletes prompts and completions after a short retention period and logs only metadata, except for Google models where Google logs for abuse detection. OpenRouter does no logging by default "even if an error occurs," and opting into logging earns a 1% usage discount.
Morph for Code Generation
If the workload is a coding agent rather than general inference, throughput on the code path is what decides whether the agent feels instant. Morph runs morph-v3-fast at about 10,500 tok/s using ngram speculative decoding tuned to the token distribution of code, plus custom low-level inference kernels written for code generation. That combination makes it the fastest and highest-quality option for coding agents specifically. The endpoint is always warm, so there is no cold start to design around, and billing is per token, so there is no idle-replica bill.
For DeepSeek, output fidelity is the deciding factor. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves DeepSeek with full 16-bit (bf16) activations and does not quantize them, so output matches the reference weights. That makes Morph the best place to run DeepSeek when fidelity matters: morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. See the full Open Source Models lineup and pricing.
It is OpenAI-compatible at https://api.morphllm.com/v1, so switching is a one-string change. The lineup also includes morph-qwen35-397b (Qwen 3.5, 397B MoE, 262k context, ~120 tok/s), morph-minimax27-230b (agentic, ~140 tok/s), and morph-qwen36-27b (dense, low latency). For code search inside an agent, WarpGrep is free up to 100k requests, then $1 per 1M requests on Pro.
Point an OpenAI client at Morph
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: process.env.MORPH_API_KEY,
});
const res = await client.chat.completions.create({
model: "morph-qwen35-397b",
messages: [{ role: "user", content: "Patch this failing test..." }],
});No cold starts
Always-warm endpoint; no min_replica to keep paid up and warm.
Per-token billing
Cost tracks tokens generated, not provisioned GPU hours.
OpenAI-compatible
Swap the base URL and model string; no migration project.
Which Baseten Alternative to Pick
| Your priority | Pick | Why |
|---|---|---|
| Cheapest GPUs and tokens | DeepInfra | H100 $1.79/hr, undercuts on every serverless model shown |
| Fastest cold starts | Modal | ~1s container boot, per-second billing, $30/mo free credits |
| Highest token throughput | Groq | 500-1,000 tok/s published on GPT-OSS and Llama |
| No markup, widest model menu | OpenRouter | Pass-through pricing, 400+ models, 5.5% credit fee |
| Fine-tuning + clusters | Together AI | LoRA SFT from $0.48/M tokens, reserved clusters from $3.99/hr |
| Serverless + fine-tuning suite | Fireworks AI | Zero-retention open models, 6,000 RPM ceiling, batch at 50% |
| Image/video by output | Replicate | $0.04/output image, $0.09/sec video, scale to zero |
| Code-generation throughput | Morph | ~10,500 tok/s on morph-v3-fast, always warm, per-token |
Keep Baseten when you need a dedicated, tuned deployment of a custom model on isolated hardware with HIPAA, and your load is steady enough to keep replicas busy. Switch when the cold-start tax, the replica-hour bill, or the H100 rate stops matching your traffic shape.
Frequently Asked Questions
What is the cheapest Baseten alternative for dedicated GPUs?
DeepInfra: H100 80GB at $1.79/hr versus Baseten's $6.50/hr, plus H200 at $2.19/hr, B200 at $2.79/hr, and A100 80GB at $0.89/hr. Modal is next at ~$3.95/hr for an H100; Replicate $5.49/hr, Together $6.49/hr endpoint, Fireworks $7.00/hr.
How much does Baseten cost compared to Fireworks and Together?
On DeepSeek V4 Pro serverless ($/M in/out): DeepInfra $1.30/$2.60, Novita $1.60/$3.20, Baseten $1.74/$3.48, Fireworks $1.74/$3.48, Together $2.10/$4.40. Baseten and Fireworks price it identically; DeepInfra is cheapest, Together most expensive.
Does Baseten have cold starts?
Yes. The default min_replica=0 incurs no idle charge but triggers a cold start that "can take minutes for large models," billed per minute during wake-up. Baseten recommends min_replica >= 2 for production to eliminate it. Modal boots in ~1 second; Replicate cold boots are slow but bill only prediction time.
Which Baseten alternative is fastest?
Groq publishes the highest tok/s: GPT-OSS 120B at 500 tok/s, GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B Instant at 840 tok/s, Qwen3 32B at 662 tok/s. Cerebras claims up to 15x faster than NVIDIA GPUs. For code generation, Morph runs morph-v3-fast at ~10,500 tok/s.
Is there a Baseten alternative with no markup on inference?
OpenRouter charges pass-through pricing with no inference markup; its fee is 5.5% on credit purchases. The first 1M BYOK requests per month are free, then a 5% fee on the model's normal cost.
Which Baseten alternatives are SOC 2 and HIPAA compliant?
SOC 2 Type II: Baseten, Fireworks, Groq, Together, Modal (from Starter), DeepInfra (+ISO 27001). HIPAA: Baseten, Fireworks, Modal (Enterprise), Groq (BAA with exclusions). Zero retention by default: Fireworks (open models), DeepInfra, OpenRouter.
Related Resources
Code-Generation Throughput Without Replica-Hours
Morph is an always-warm, per-token endpoint that runs morph-v3-fast at ~10,500 tok/s with no cold start and no RPM wall. OpenAI-compatible, so switching from Baseten is a one-string change.
