Fireworks vs Baseten Pricing (2026): Same Per-Token Rates, $6.50/hr vs $7.00/hr H100

If you are pricing out Fireworks against Baseten, the surprise is how close the per-token numbers are. DeepSeek V4 Pro is $1.74/M input and $3.48/M output on both. Kimi K2.6 is $0.95/M input and $4.00/M output on both. The list prices barely move, so the decision is not really about the per-token rate on shared models.

The split is the deployment model. Fireworks sells a managed per-token API. Baseten sells a per-minute GPU you control, scaled to zero by default. Below are the exact numbers for both, the dedicated GPU rates where they diverge, the rate limits, and the cold-start behavior that decides whether scale-to-zero actually saves you money. All prices are as of June 2026; check each provider's page before committing.

TL;DR

Per-token, they tie. DeepSeek V4 Pro $1.74/$3.48 and Kimi K2.6 $0.95/$4.00 are identical on both. Baseten edges ahead on GPT-OSS 120B ($0.10/$0.50 vs $0.15/$0.60) and GLM 5.1 ($1.30/$4.30 vs $1.40/$4.40).
Dedicated GPUs go to Baseten. H100 80GB $6.50/hr vs Fireworks $7.00/hr; B200 $9.98/hr vs $10.00/hr. Baseten also lists cheap small GPUs (T4 $0.63/hr, L4 $0.85/hr, A100 80GB $4.00/hr) that Fireworks does not surface.
Pick Fireworks if you want a zero-ops per-token API with published rate limits (6,000 RPM ceiling) and no cold start.
Pick Baseten if you want to own the GPU per minute, scale to zero, or deploy a custom model the catalog does not include.

Serverless Per-Token Pricing: Model by Model

Both platforms run the same popular open models on a serverless API. On the models that overlap, the rates are within a few cents of each other.

Serverless $/M tokens (input / output), June 2026

Model	Fireworks AI	Baseten
DeepSeek V4 Pro	$1.74 / $3.48	$1.74 / $3.48
Kimi K2.6	$0.95 / $4.00	$0.95 / $4.00
GLM 5.1	$1.40 / $4.40	$1.30 / $4.30
GPT-OSS 120B	$0.15 / $0.60	$0.10 / $0.50
DeepSeek V4 Flash	$0.14 / $0.28	Not listed
MiniMax 2.7	$0.30 / $1.20	Not listed
Nemotron 3 Super	Not listed	$0.30 / $0.75

Cached input is discounted on both: Fireworks gives cached input a 50% discount (DeepSeek V4 Pro drops to $0.145/M cached), and Baseten lists cached rates per model (DeepSeek V4 $0.145/M, Kimi K2.6 $0.16/M, GLM 5.1 $0.26/M). Fireworks also bills batch inference at 50% of serverless, which Baseten does not surface as a serverless discount because batch on Baseten runs on the same per-minute dedicated GPU.

How they stack up against the cheap end

Both Fireworks and Baseten sit at the premium end of the serverless market on these models. For Kimi K2.6, DeepInfra lists $0.75/$3.50 and Novita $0.80/$3.40, undercutting the $0.95/$4.00 both charge here. For DeepSeek V4 Pro, DeepInfra is $1.30/$2.60 versus the $1.74/$3.48 on both. If raw per-token price on a shared open model is the only thing you care about, see Fireworks vs DeepInfra and Baseten vs DeepInfra. The reason to pay the Fireworks or Baseten rate is the serving stack and deployment control, not the sticker price.

Dedicated GPU Pricing: Where They Diverge

This is where the two platforms actually differ. Baseten bills dedicated GPUs per minute and publishes a full ladder from T4 to B200. Fireworks bills on-demand dedicated GPUs per second and lists only the large accelerators.

Dedicated GPU on-demand pricing ($/hr), June 2026

GPU	Fireworks AI	Baseten
T4 16GB	Not listed	$0.63 / hr
L4 24GB	Not listed	$0.85 / hr
A10G 24GB	Not listed	$1.21 / hr
A100 80GB	Not listed	$4.00 / hr
H100 80GB	$7.00 / hr	$6.50 / hr
H200 141GB	$7.00 / hr	Not listed
B200 180GB	$10.00 / hr	$9.98 / hr
B300 288GB	$12.00 / hr	Not listed

Baseten is per-minute billed: H100 80GB is $0.10833/min ($6.50/hr), B200 180GB is $0.16633/min ($9.98/hr), A100 80GB is $0.06667/min ($4.00/hr), and there is an H100 MIG 40GB slice at $0.0625/min ($3.75/hr) for models that fit in 40GB. CPU-only instances run from $0.00058/min. Fireworks dedicated GPUs are billed per second and scale to zero, but the catalog starts at the H100 tier.

$6.50/hr

Baseten H100 80GB (per minute)

$7.00/hr

Fireworks H100 80GB (per second)

$0.63/hr

Baseten T4, the cheapest dedicated GPU

Neither is cheap by market standards. DeepInfra lists H100 at $1.79/hr and B200 at $2.79/hr; Modal works out to about $3.95/hr for H100. Both Fireworks and Baseten charge for the serving runtime and tooling on top of the silicon, so the per-hour number is not the whole cost story.

Who Wins Per Workload

The deciding question is whether you want to own the GPU or never see it. Match your workload to the column.

Who wins per workload

Workload / decision	Fireworks AI	Baseten
Bursty / low volume on a catalog model	Fireworks, pay nothing when idle	no, GPU sits idle
Fastest first call (no cold start)	Fireworks, endpoints stay warm	no, scale-to-zero cold start
Published, predictable rate limits	Fireworks, 6,000 RPM ceiling	no, set by replica count
Run your own / non-catalog model	no, curated catalog	Baseten, any model deployed
Cheap small GPU (T4 / L4 / A10G)	no, H100 tier and up	Baseten, from $0.63/hr
Per-minute GPU you size and pin	per-second dedicated escape hatch	Baseten, dedicated by default
Scale-to-zero to kill idle spend	dedicated GPUs scale to zero	Baseten, scale-to-zero by default
Lowest dedicated H100 hourly rate	no, $7.00/hr	Baseten, $6.50/hr
Cheapest GPT-OSS 120B serverless	no, $0.15/$0.60	Baseten, $0.10/$0.50

Cost on a Real Workload

Take Kimi K2.6 serving 50M output tokens per day (and assume input is small enough to ignore for the estimate). At the shared serverless rate of $4.00/M output, that is the same bill on both: 50M x $4.00/M = $200/day, about $6,000/mo. Per-token, there is no gap to exploit.

When a dedicated GPU beats per-token (computed from list prices, June 2026)

Fireworks serverless (Kimi K2.6, $4.00/M output): 50M tokens/day = $200/day, ~$6,000/mo. Zero ops, scales with traffic.
Baseten serverless (same Kimi K2.6 rate): identical at ~$6,000/mo.
Baseten dedicated (one H100 at $6.50/hr, 24/7): $6.50 x 24 x 30 = ~$4,680/mo per GPU, with no idle charge under scale-to-zero.
Break-even: a dedicated H100 only wins if you can keep it saturated. The serverless bill scales with tokens, so once your sustained output exceeds roughly $4,680 / $4.00 = ~1.17B output tokens/month per GPU, the dedicated GPU is cheaper. A 50M-tokens/day workload is ~1.5B/mo, which clears that bar on a single saturated GPU, so a team that can pin one H100 should move to dedicated.

The crossover depends entirely on how saturated you keep the GPU. Spiky traffic that idles the card stays cheaper on serverless because there is no GPU to strand. Sustained, predictable load that holds a card busy belongs on a per-minute dedicated GPU, and Baseten's $6.50/hr H100 makes that math slightly better than Fireworks' $7.00/hr. Redo the arithmetic with your own model rate and duty cycle before committing.

Rate Limits

Fireworks publishes hard numbers; Baseten throughput is whatever your replica configuration allows.

On Fireworks you get 10 requests per minute with no payment method and a fixed 6,000 RPM ceiling once you add one. Spending tiers gate the monthly budget: Tier 1 is $50/mo with a valid card, Tier 2 is $500/mo after $50 spent or added, Tier 3 is $5,000/mo, and Tier 4 is $50,000/mo. So you can size your expected throughput against published limits before writing a line of code.

Baseten dedicated deployments do not have a shared RPM cap. Each replica targets a concurrency of 1 request by default (model-specific guidance ranges from 1 for image generation up to 256 for Whisper async batch), and the deployment autoscales from a min replica count to a max replica count you set. The default autoscaling window is 60 seconds, configurable from 10 to 3,600 seconds. Throughput is a function of how many replicas you allow, not a quota you share with other tenants.

6,000 RPM

Fireworks ceiling with a payment method

10 RPM

Fireworks without a payment method

1 req/replica

Baseten default concurrency target

Cold Starts and Scale to Zero

Scale-to-zero is the Baseten selling point, and the cold start is the catch.

Fireworks serverless endpoints stay warm on shared infrastructure, so there is effectively no cold start on a catalog model. The trade is that you do not control the GPU.

Baseten scales to zero by default (min_replica=0). The docs are explicit: a deployment scaled to zero incurs no charges, but the next request triggers a cold start that "can take minutes for large models," and during wake-up "billing is per minute even though the replica isn't yet serving responses." To eliminate cold starts in production, Baseten's docs recommend min_replica of at least 2, which keeps GPUs warm and removes the idle savings. So scale-to-zero is real, but for latency-sensitive production you are usually back to paying for warm capacity.

The scale-to-zero tradeoff

Scale-to-zero on Baseten saves money on genuinely bursty or part-time workloads where minutes-long cold starts are acceptable. For anything user-facing, the recommended min_replica of 2 means you pay for at least two warm GPUs around the clock, and at that point Fireworks serverless (no cold start, no GPU to keep warm) is the simpler path unless you specifically need a custom model.

Fine-Tuning

Fireworks runs fine-tuning as a managed per-token service; Baseten runs training on the same per-minute GPU you deploy to.

Fine-tuning cost (Fireworks, per 1M training tokens), June 2026

Model size	LoRA SFT	Full SFT
Up to 16B	$0.50	$1.00
16.1B - 80B	$3.00	$6.00
80B - 300B	$6.00	$12.00
Over 300B	$10.00	$20.00

On Fireworks, DPO is 2x the SFT rate at each size, and the fine-tuned model then serves at the base model's per-token rate with no inference premium. Baseten does not bill fine-tuning per token: you train on a dedicated GPU at the same per-minute rate you serve on (H100 $6.50/hr, A100 80GB $4.00/hr), so a fine-tune costs whatever the GPU costs to keep running. Fireworks is the cleaner choice for a LoRA on a catalog base model you call intermittently; Baseten wins when you want to train and serve any architecture on hardware you control.

Compliance and Data Retention

Both clear the standard enterprise bar. Fireworks adds an explicit zero-retention default.

Compliance & data handling

Capability	Fireworks AI	Baseten
SOC 2 Type II	Yes	Yes
HIPAA	Yes	Yes
Zero data retention	Default on open models (opt-in to log)	Not stated as default
Encryption	TLS 1.2+ in transit, AES-256 at rest	Not specified here
Deployment options	Managed serverless + dedicated	Cloud, dedicated GPUs
OpenAI-compatible API	Yes	Yes

Fireworks states it "does not log or store prompt or generation data for any open models without explicit user opt-in," with TLS 1.2+ in transit and AES-256 at rest. Both providers are SOC 2 Type II certified and HIPAA compliant. If a no-prompt-logging default is a hard requirement, Fireworks documents it on open models; Baseten documents SOC 2 Type II and HIPAA on its pricing page but does not advertise a zero-retention default in the same terms.

When to Use Fireworks AI

Zero-ops token APIs on catalog models. Call DeepSeek V4, Kimi K2.6, GLM 5.1, or GPT-OSS without touching a GPU, on warm endpoints with no cold start.
Published, predictable rate limits. 6,000 RPM ceiling and spending tiers from $50/mo to $50,000/mo let you size capacity before launch.
Managed fine-tuning. LoRA SFT from $0.50/M tokens, served at the base model rate with no inference premium.
Cost-sensitive batch jobs. Batch inference at 50% of serverless and cached input at 50% cut the bill on large offline runs.
Documented zero-retention. No prompt or generation logging on open models without opt-in, with TLS 1.2+ and AES-256.

When to Use Baseten

Custom or non-catalog models. Deploy any architecture to a dedicated GPU instead of being limited to a serverless catalog.
Cheap small GPUs. T4 at $0.63/hr, L4 at $0.85/hr, A10G at $1.21/hr, A100 80GB at $4.00/hr, all per-minute billed, none of which Fireworks surfaces.
Slightly cheaper big GPUs. H100 80GB at $6.50/hr beats Fireworks' $7.00/hr; B200 at $9.98/hr beats $10.00/hr.
Scale-to-zero on bursty workloads. No idle charge at min_replica=0, as long as minutes-long cold starts are acceptable.
Cheaper open-weight serverless. GPT-OSS 120B at $0.10/$0.50 and GLM 5.1 at $1.30/$4.30 undercut Fireworks on those models.

Neither is built for the coding-agent apply loop. If applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Running DeepSeek or codegen? Morph Open Source Models

If you are running DeepSeek for output that has to match the reference weights, Morph serves it with 16-bit (bf16) activations. Fireworks, Baseten, and most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph keeps full 16-bit activations so the output matches the reference weights, which makes it the best place to run DeepSeek when fidelity matters. morph-dsv4flash (DeepSeek V4 Flash) is $0.139/M input and $0.278/M output.

For coding agents specifically, Morph runs codegen-tuned speculative decoding plus custom low-level inference kernels built for code generation, so it is the fastest and highest-quality option for codegen. See Morph Open Source Models and pricing.

Frequently Asked Questions

Is Fireworks AI or Baseten cheaper?

On serverless per-token APIs they are nearly identical: DeepSeek V4 Pro is $1.74/M input and $3.48/M output on both, and Kimi K2.6 is $0.95/$4.00 on both. Baseten is slightly cheaper on GPT-OSS 120B ($0.10/$0.50 vs $0.15/$0.60) and GLM 5.1 ($1.30/$4.30 vs $1.40/$4.40). On dedicated GPUs Baseten is cheaper per hour: H100 80GB at $6.50/hr vs Fireworks $7.00/hr, and B200 at $9.98/hr vs $10.00/hr. Both sit well above DeepInfra, which lists H100 at $1.79/hr.

What is the main difference between Fireworks AI and Baseten?

Fireworks is serverless-first: a per-token API across a curated catalog, plus on-demand dedicated GPUs from $7.00/hr that scale to zero. Baseten is dedicated-first: deploy any model to per-minute GPUs ($0.63/hr T4 to $9.98/hr B200) with scale-to-zero by default, plus a serverless Model API. Fireworks optimizes for zero-ops token serving; Baseten optimizes for control over the GPU and custom deployments.

What are the rate limits on Fireworks AI vs Baseten?

Fireworks publishes hard numbers: 10 RPM without a payment method, a 6,000 RPM ceiling with one, and spending tiers from $50/mo (Tier 1) to $50,000/mo (Tier 4). Baseten does not publish a flat RPM cap because dedicated deployments scale on a concurrency target (default 1 request per replica) up to your configured max replicas, so throughput is set by replica count rather than a shared quota.

Does Fireworks AI or Baseten handle cold starts better?

Fireworks serverless endpoints stay warm on shared infrastructure, so there is effectively no cold start. Baseten scales to zero by default (min_replica=0), which means the first request after idle triggers a cold start that the docs say can take minutes for large models, and billing is per minute during wake-up. Baseten's docs recommend min_replica of at least 2 for production to eliminate cold starts, which removes the idle savings.

What does Baseten cost to fine-tune and serve a custom model?

Baseten bills the GPU per minute for both training and serving on dedicated deployments, so a fine-tune costs whatever the GPU costs to keep running: H100 at $6.50/hr, A100 80GB at $4.00/hr. Fireworks runs fine-tuning as a managed per-token service: up to 16B is $0.50/M tokens LoRA SFT or $1.00/M full SFT, 16.1B to 80B is $3.00/$6.00, and over 300B is $10.00/$20.00, with DPO at 2x the SFT rate. On Fireworks the fine-tune then serves at the base model's per-token rate.

Related Comparisons

Own the GPU or never see it

Pick Fireworks for hands-off per-token serving with published rate limits, Baseten for per-minute GPUs you control with scale-to-zero. If applying model-generated code edits is your bottleneck, that is a separate problem Morph Fast Apply solves at ~10,500 tok/s.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers