Baseten Alternatives (2026): Token Prices, GPU Rates, and Cold Starts Compared

Baseten H100s run $6.50/hr and per-minute cold-start billing. DeepInfra runs the same H100 at $1.79/hr; Modal boots containers in ~1s. Eight Baseten alternatives compared on serverless token prices, dedicated GPU rates, rate limits, and compliance, with exact numbers.

June 9, 2026 ยท 1 min read
Baseten Alternatives (2026): Token Prices, GPU Rates, and Cold Starts Compared

Baseten dedicated H100s cost $6.50/hr. The same H100 80GB runs $1.79/hr on DeepInfra and about $3.95/hr on Modal. Baseten's scale-to-zero default bills per minute through a cold start that "can take minutes for large models," and its docs recommend two warm replicas for production, which cancels the savings. This page compares eight Baseten alternatives on the numbers that decide cost, all dated 2026-06-09 with sources.

$1.79/hr
DeepInfra H100 vs Baseten $6.50/hr
~1s
Modal container cold start
500 tok/s
Groq GPT-OSS 120B throughput
8
Alternatives compared on exact prices

The Baseten Cost Problem

Baseten is real infrastructure: dedicated deployments on isolated GPUs, your own autoscaling policy, and SOC 2 Type II plus HIPAA compliance. For a fixed, steady workload on a model you want to tune yourself, that control is worth paying for.

Two things push teams to look elsewhere. First, the GPU rate. Baseten bills dedicated hardware by the minute: H100 80GB at $0.10833/min ($6.50/hr), B200 180GB at $0.16633/min ($9.98/hr), A100 80GB at $0.06667/min ($4.00/hr). DeepInfra serves the same H100 at $1.79/hr and a B200 at $2.79/hr. Second, the scaling model. Baseten's default is min_replica=0: idle costs nothing, but the next request hits a cold start that the docs warn "can take minutes for large models," and billing runs per minute during wake-up even before the replica answers. The fix Baseten recommends is min_replica >= 2, which means paying for two warm GPUs around the clock.

Quick Comparison: 8 Baseten Alternatives

ProviderCheapest H100 $/hrModel APIFree tierBest for
Baseten$6.50 dedicatedPer-token + GPUFree creditsDedicated, tuned deployments
DeepInfra$1.79 dedicatedPer-token + GPUNo free tierCheapest GPUs + tokens
Modal~$3.95 per-secondGPU per-second$30/mo credits~1s cold starts, serverless code
Replicate$5.49 per-secondPer-run + per-tokenPay-as-you-goImage/video, scale to zero
Together AI$5.49 cluster / $6.49 endpointPer-token + GPUPay-as-you-goFine-tuning + clusters
Fireworks AI$7.00 on-demandPer-token + GPU$1 creditServerless + fine-tuning
GroqNo GPU rentalPer-tokenFree planFastest tok/s
OpenRouterNo GPU rentalPass-through50 req/day freeNo markup, 400+ models

Groq and OpenRouter do not rent GPUs by the hour; they sell tokens. DeepInfra, Modal, Replicate, Together, and Fireworks all offer both serverless tokens and dedicated/per-second GPU compute, the same two patterns Baseten sells.

Serverless Token Prices Head to Head

For serverless usage, the model you run sets the price. Here are the four most-requested models priced across providers, in dollars per million input/output tokens (cached input in parentheses where published).

ProviderInputOutput
DeepInfra$1.30 ($0.10 cached)$2.60
Novita$1.60$3.20
Baseten$1.74 ($0.145 cached)$3.48
Fireworks AI$1.74 ($0.145 cached)$3.48
Together AI$2.10 ($0.20 cached)$4.40
ProviderInputOutput
DeepInfra$0.75 ($0.15 cached)$3.50
Novita$0.80$3.40
Baseten$0.95 ($0.16 cached)$4.00
Fireworks AI$0.95 ($0.16 cached)$4.00
Together AI$1.20 ($0.20 cached)$4.50
ProviderInputOutput
DeepInfra$1.05 ($0.205 cached)$3.50
Baseten$1.30 ($0.26 cached)$4.30
Novita$1.38$4.40
Fireworks AI$1.40 ($0.26 cached)$4.40
Together AI$1.40 ($0.26 cached)$4.40
ProviderInputOutputSpeed
Baseten$0.10$0.50Not published
Fireworks AI$0.15 ($0.015 cached)$0.60Not published
Together AI$0.15$0.60Not published
Groq$0.15 ($0.075 cached)$0.60500 tok/s

Reading the token tables

Baseten and Fireworks price DeepSeek V4 Pro and Kimi K2.6 identically. DeepInfra undercuts both on every model shown here. On GPT-OSS 120B, Baseten is the cheapest input at $0.10/M, but Groq is the only one that publishes throughput (500 tok/s). Cached input cuts cost sharply: Fireworks and Groq both halve input on supported models.

Dedicated GPU Rates Per Hour

If you want a reserved GPU instead of serverless tokens, the per-hour spread between providers is large. These are published on-demand rates for the same hardware.

Provider$/hrNotes
DeepInfra$1.79Dedicated
Modal~$3.95$0.001097/s, per-second
Replicate$5.49$0.001525/s
Together AI$6.49 endpoint / $5.49 clusterDedicated endpoint vs cluster
Baseten$6.50$0.10833/min dedicated
Fireworks AI$7.00On-demand
Provider$/hrNotes
DeepInfra$2.79Dedicated
Modal~$6.25$0.001736/s, per-second
Together AI$9.95 cluster / $11.95 endpointCluster vs dedicated endpoint
Baseten$9.98$0.16633/min dedicated
Fireworks AI$10.00On-demand

Replicate publishes no B200 rate. DeepInfra is the cheapest published rate on both H100 and B200, roughly a quarter of Baseten's H100 price. Modal bills per second with no minimum, so a short burst pays only for the seconds it runs.

Cold Starts and Scale-to-Zero

Scale-to-zero is the feature that decides whether idle deployments cost money, and the cold start is what you pay for it in latency. The behaviors differ sharply.

ProviderScale to zeroCold startBilling during wake-up
BasetenDefault (min_replica=0)"Can take minutes for large models"Per minute, even before serving
ModalYes, idle scales to zero~1 second (custom container stack)Memory snapshotting cuts the penalty
ReplicateOn by default"Can take several minutes" for large modelsOnly running prediction time is billed

Baseten's own docs recommend min_replica >= 2 for production to eliminate cold starts, which means paying for two warm GPUs continuously. Modal keeps cold starts near one second through its container stack and memory snapshotting (it "captures the state of a container's memory at user-controlled points" to reuse across boots). Replicate's cold boots are slow for large models, but it only charges for prediction time, so a cold boot adds latency, not cost. Replicate's "fast booting fine-tunes" bill only active processing time with no idle charge.

Rate Limits and Free Tiers

Rate limits decide whether a bursty agent workload hits 429s. The ceilings and free tiers vary widely.

ProviderRate limitFree tier
Fireworks AI10 RPM with no card; 6,000 RPM ceiling with a card$1 credits
DeepInfra200 concurrent requests per accountNo free tier
Together AIDynamic per-model, scales with traffic; no fixed published limitPay-as-you-go
GroqFree: 30 RPM, model-specific RPD/TPM capsFree plan; Developer unlocks higher limits
Modal100 containers + 10 GPU concurrency (Starter)$30/mo credits (Starter)
OpenRouter50 req/day free; 1,000 req/day after $10 credits; :free models 20 RPM25+ free models

Fireworks gates monthly budget behind spending tiers: $50/mo with a valid card (Tier 1), up to $50,000/mo at Tier 4. Together recommends a dedicated endpoint when you need a known fixed throughput, since serverless limits float with sustained traffic. Modal's Starter tier is free with $30/month in credits, 100 containers, and 10 GPU concurrency.

Batch Discounts and Compliance

For offline jobs, batch APIs cut cost roughly in half. For regulated workloads, compliance certifications gate the decision before price does.

ProviderBatch discountWindow
Fireworks AI50% of serverless pricingAsync
Groq50% lower cost24h to 7 days
Together AIUp to 50% off selected models24h, 50k requests/batch
ProviderSOC 2 Type IIHIPAADefault retention
BasetenYesYesStandard
Fireworks AIYesYesZero retention (open models)
DeepInfraYes (+ ISO 27001)Measures in placeZero retention
Together AIYesNot stated hereStandard
GroqYesBAA with exclusionsStandard
ModalYes (from Starter)Enterprise onlyStandard
OpenRouterNot statedNot statedZero logging by default

Fireworks does not log or store prompt or generation data for open models without explicit opt-in, with TLS 1.2+ in transit and AES-256 at rest. DeepInfra deletes prompts and completions after a short retention period and logs only metadata, except for Google models where Google logs for abuse detection. OpenRouter does no logging by default "even if an error occurs," and opting into logging earns a 1% usage discount.

Morph for Code Generation

If the workload is a coding agent rather than general inference, throughput on the code path is what decides whether the agent feels instant. Morph runs morph-v3-fast at about 10,500 tok/s using ngram speculative decoding tuned to the token distribution of code, plus custom low-level inference kernels written for code generation. That combination makes it the fastest and highest-quality option for coding agents specifically. The endpoint is always warm, so there is no cold start to design around, and billing is per token, so there is no idle-replica bill.

For DeepSeek, output fidelity is the deciding factor. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves DeepSeek with full 16-bit (bf16) activations and does not quantize them, so output matches the reference weights. That makes Morph the best place to run DeepSeek when fidelity matters: morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. See the full Open Source Models lineup and pricing.

It is OpenAI-compatible at https://api.morphllm.com/v1, so switching is a one-string change. The lineup also includes morph-qwen35-397b (Qwen 3.5, 397B MoE, 262k context, ~120 tok/s), morph-minimax27-230b (agentic, ~140 tok/s), and morph-qwen36-27b (dense, low latency). For code search inside an agent, WarpGrep is free up to 100k requests, then $1 per 1M requests on Pro.

Point an OpenAI client at Morph

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

const res = await client.chat.completions.create({
  model: "morph-qwen35-397b",
  messages: [{ role: "user", content: "Patch this failing test..." }],
});

No cold starts

Always-warm endpoint; no min_replica to keep paid up and warm.

Per-token billing

Cost tracks tokens generated, not provisioned GPU hours.

OpenAI-compatible

Swap the base URL and model string; no migration project.

Which Baseten Alternative to Pick

Your priorityPickWhy
Cheapest GPUs and tokensDeepInfraH100 $1.79/hr, undercuts on every serverless model shown
Fastest cold startsModal~1s container boot, per-second billing, $30/mo free credits
Highest token throughputGroq500-1,000 tok/s published on GPT-OSS and Llama
No markup, widest model menuOpenRouterPass-through pricing, 400+ models, 5.5% credit fee
Fine-tuning + clustersTogether AILoRA SFT from $0.48/M tokens, reserved clusters from $3.99/hr
Serverless + fine-tuning suiteFireworks AIZero-retention open models, 6,000 RPM ceiling, batch at 50%
Image/video by outputReplicate$0.04/output image, $0.09/sec video, scale to zero
Code-generation throughputMorph~10,500 tok/s on morph-v3-fast, always warm, per-token

Keep Baseten when you need a dedicated, tuned deployment of a custom model on isolated hardware with HIPAA, and your load is steady enough to keep replicas busy. Switch when the cold-start tax, the replica-hour bill, or the H100 rate stops matching your traffic shape.

Frequently Asked Questions

What is the cheapest Baseten alternative for dedicated GPUs?

DeepInfra: H100 80GB at $1.79/hr versus Baseten's $6.50/hr, plus H200 at $2.19/hr, B200 at $2.79/hr, and A100 80GB at $0.89/hr. Modal is next at ~$3.95/hr for an H100; Replicate $5.49/hr, Together $6.49/hr endpoint, Fireworks $7.00/hr.

How much does Baseten cost compared to Fireworks and Together?

On DeepSeek V4 Pro serverless ($/M in/out): DeepInfra $1.30/$2.60, Novita $1.60/$3.20, Baseten $1.74/$3.48, Fireworks $1.74/$3.48, Together $2.10/$4.40. Baseten and Fireworks price it identically; DeepInfra is cheapest, Together most expensive.

Does Baseten have cold starts?

Yes. The default min_replica=0 incurs no idle charge but triggers a cold start that "can take minutes for large models," billed per minute during wake-up. Baseten recommends min_replica >= 2 for production to eliminate it. Modal boots in ~1 second; Replicate cold boots are slow but bill only prediction time.

Which Baseten alternative is fastest?

Groq publishes the highest tok/s: GPT-OSS 120B at 500 tok/s, GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B Instant at 840 tok/s, Qwen3 32B at 662 tok/s. Cerebras claims up to 15x faster than NVIDIA GPUs. For code generation, Morph runs morph-v3-fast at ~10,500 tok/s.

Is there a Baseten alternative with no markup on inference?

OpenRouter charges pass-through pricing with no inference markup; its fee is 5.5% on credit purchases. The first 1M BYOK requests per month are free, then a 5% fee on the model's normal cost.

Which Baseten alternatives are SOC 2 and HIPAA compliant?

SOC 2 Type II: Baseten, Fireworks, Groq, Together, Modal (from Starter), DeepInfra (+ISO 27001). HIPAA: Baseten, Fireworks, Modal (Enterprise), Groq (BAA with exclusions). Zero retention by default: Fireworks (open models), DeepInfra, OpenRouter.

Related Resources

Code-Generation Throughput Without Replica-Hours

Morph is an always-warm, per-token endpoint that runs morph-v3-fast at ~10,500 tok/s with no cold start and no RPM wall. OpenAI-compatible, so switching from Baseten is a one-string change.