Most people searching for a Fireworks AI alternative want one of three things: a lower per-token price, higher throughput, or a fixed rate limit they can plan around. Fireworks sits in the middle on price (Kimi K2.6 at $0.95/M input, DeepSeek V4 at $1.74/M input) and caps free accounts at 10 RPM, 6,000 RPM with a card. Below is every published number across the eight providers buyers compare next, then where Morph fits for code-heavy agent traffic.
Fireworks AI Pricing and Rate Limits
Fireworks serverless per-token rates: Kimi K2.6 is $0.95/M input ($0.16 cached) and $4.00/M output; Kimi K2.5 is $0.60/$3.00; DeepSeek V4 Pro is $1.74/M input ($0.145 cached) and $3.48/M output; DeepSeek V4 Flash is $0.14/$0.28; GLM 5.1 is $1.40/$4.40; Qwen 3.6 Plus is $0.50/$3.00; GPT-OSS 120B is $0.15/$0.60; GPT-OSS 20B is $0.07/$0.30; MiniMax 2.7 is $0.30/$1.20. Cached input tokens get a 50% discount and batch inference runs at 50% of serverless pricing.
On-demand GPUs are billed by the hour: H100 80GB and H200 141GB both at $7.00/hr, B200 180GB at $10.00/hr, B300 288GB at $12.00/hr. New accounts get $1 in free credits.
The rate limit most people hit first
Without a payment method, Fireworks caps you at 10 requests per minute. A card raises that to a fixed 6,000 RPM ceiling. Spending tiers separately gate the monthly budget: Tier 1 $50/mo (valid card), Tier 2 $500/mo ($50 spent or added), Tier 3 $5,000/mo, Tier 4 $50,000/mo. A coding agent that fans out dozens of parallel calls per task burns through 10 RPM in seconds.
Per-Token Pricing Across 5 Providers
For the four models buyers compare most, here is every published serverless rate (input / output per million tokens). DeepInfra is the floor on three of four; Fireworks and Baseten track each other closely.
| Model | DeepInfra | Novita | Fireworks | Baseten | Together |
|---|---|---|---|---|---|
| Kimi K2.6 | $0.75 / $3.50 | $0.80 / $3.40 | $0.95 / $4.00 | $0.95 / $4.00 | $1.20 / $4.50 |
| DeepSeek V4 Pro | $1.30 / $2.60 | $1.60 / $3.20 | $1.74 / $3.48 | $1.74 / $3.48 | $2.10 / $4.40 |
| GLM 5.1 | $1.05 / $3.50 | $1.38 / $4.40 | $1.40 / $4.40 | $1.30 / $4.30 | $1.40 / $4.40 |
| GPT-OSS 120B | — | — | $0.15 / $0.60 | $0.10 / $0.50 | $0.15 / $0.60 |
DeepInfra runs zero retention on inference (prompts and completions deleted after a short window, only metadata logged) and resells Anthropic models directly: Claude Haiku 4.5 at $1.00/$5.00, Sonnet 4.6 at $3.00/$15.00, Opus 4.8 at $5.00/$25.00. OpenRouter passes through provider pricing with no markup and a 5.5% credit-purchase fee, fronting 400+ models from 60+ providers behind one OpenAI-compatible key.
GPU Pricing by the Hour
If you run dedicated capacity instead of serverless, the spread is wide. DeepInfra is the cheapest published H100 at $1.79/hr; Fireworks is the most expensive at $7.00/hr. Modal and Replicate bill per second, normalized to hourly below.
| Provider | H100 80GB | B200 180GB | A100 80GB |
|---|---|---|---|
| DeepInfra | $1.79 | $2.79 | $0.89 |
| Modal | ~$3.95 | ~$6.25 | ~$2.50 |
| Together (cluster) | $5.49 | $9.95 | — |
| Replicate | $5.49 | — | $5.04 |
| Baseten (dedicated) | $6.50 | $9.98 | $4.00 |
| Fireworks | $7.00 | $10.00 | — |
Baseten, Modal, and Replicate all scale to zero by default, so idle deployments cost nothing, but the next request pays a cold start that Replicate and Baseten warn "can take minutes for large models." Modal boots containers in ~1 second on its custom stack. To eliminate cold starts you set a minimum replica or container count (Baseten recommends min_replica >= 2 for production), which means paying for warm capacity around the clock.
Rate Limits Compared
| Provider | Limit structure |
|---|---|
| Fireworks | 10 RPM without a card, fixed 6,000 RPM with one; spending tiers cap monthly budget at $50 to $50,000 |
| DeepInfra | 200 concurrent requests per account; no free tier, postpaid billing |
| Together | Dynamic per-model limits that scale with sustained traffic; no fixed published caps. For a guaranteed limit, use a dedicated endpoint |
| Groq (free) | llama-3.1-8b-instant 30 RPM / 14.4K RPD / 6K TPM; llama-3.3-70b 30 RPM / 1K RPD. Developer plan unlocks higher limits |
| OpenRouter | :free models 20 RPM; 50 reqs/day without credits, 1,000/day after $10 purchased |
A fixed RPM ceiling is the most common reason coding agents return 429s under burst. Fireworks customers above 6,000 RPM move to a contact-sales arrangement. Together avoids a fixed number entirely by scaling per-model limits with your traffic, or selling a dedicated endpoint when you need a hard guarantee.
Batch Discounts and Fine-Tuning
Batch APIs trade latency for cost. Fireworks bills batch at 50% of serverless. Groq is 50% lower with a 24-hour to 7-day window. Together is up to 50% off on selected models, 24-hour best-effort completion, up to 50,000 requests and 30B enqueued tokens per batch.
Fireworks has the most granular fine-tuning pricing, per 1M training tokens: up to 16B is $0.50 LoRA SFT / $1.00 Full SFT; 16.1B-80B is $3.00/$6.00; 80B-300B is $6.00/$12.00; above 300B is $10.00/$20.00. DPO is 2x the SFT rate at each size. Together's LoRA SFT runs $0.48 up to 16B, $1.50 for 17B-69B, $2.90 for 70-100B.
Compliance: SOC 2, HIPAA, Zero Retention
SOC 2 Type II covers Baseten, Fireworks, Groq, Together, Modal (from the Starter tier), and DeepInfra (which adds ISO 27001). HIPAA is supported by Baseten, Fireworks, and Modal (Enterprise); Groq supports a BAA but excludes preview/beta features and its compound AI systems. For data retention, Fireworks does not log prompts or generations on open models without opt-in, DeepInfra deletes prompts and completions after a short window, and OpenRouter logs nothing by default (opting into logging earns a 1% discount).
Which Alternative to Pick
Cheapest per token: DeepInfra
Lowest published rate on Kimi K2.6 ($0.75/$3.50), DeepSeek V4 Pro ($1.30/$2.60), and GLM 5.1; H100 dedicated at $1.79/hr. 200 concurrent requests, no free tier.
Fastest tokens: Groq
500 tok/s on GPT-OSS 120B, 1,000 on GPT-OSS 20B, 840 on Llama 3.1 8B, at Fireworks-class prices. Free tier with low daily caps.
No markup, widest menu: OpenRouter
Pass-through pricing, 5.5% credit fee, 400+ models behind one key. First 1M BYOK requests/month free.
Cheapest GPUs + cold start: Modal
Per-second billing, ~$3.95/hr H100, ~1s container boot, $30/month free credits on the Starter tier.
Morph for Coding Agents
The providers above are general-purpose: they serve the same open models you would run anywhere. Morph is built for the code-heavy, bursty traffic of a coding agent. It runs an OpenAI-compatible endpoint with custom GPU kernels and speculative decoding tuned to the token distribution of code generation, where output structure is more predictable than free-form prose.
For DeepSeek specifically, Morph is the place to run it when output fidelity matters. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph keeps full 16-bit (bf16) activations, so output matches the reference weights. morph-dsv4flash (DeepSeek V4 Flash) is $0.139/M input and $0.278/M output. See the full lineup on models and pricing.
Two adjacent products handle the parts of an agent loop that raw inference does not. Fast Apply (morph-v3-fast) applies model-generated edits to a file at ~10,500 tok/s, so you don't pay a frontier model to re-emit an entire file. WarpGrep is semantic code search, free up to 100k requests then $1 per 1M requests. If your workload is mostly chat or steady low-concurrency prose, the general providers above are the right call.
Drop-In Migration
Morph exposes an OpenAI-compatible endpoint at https://api.morphllm.com/v1. If you already call Fireworks through an OpenAI-style client, point the base URL at Morph and change the model name. No SDK rewrite, no new request format.
Migrating from Fireworks to Morph
import OpenAI from "openai";
// Before: Fireworks
// const client = new OpenAI({
// baseURL: "https://api.fireworks.ai/inference/v1",
// apiKey: process.env.FIREWORKS_API_KEY,
// });
// After: Morph (same OpenAI client, one base URL + model string)
const client = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: process.env.MORPH_API_KEY,
});
const res = await client.chat.completions.create({
model: "morph-qwen35-397b",
messages: [{ role: "user", content: "Refactor this function..." }],
});Frequently Asked Questions
What is Fireworks AI's rate limit?
Without a payment method, Fireworks caps you at 10 requests per minute. A card raises the ceiling to a fixed 6,000 RPM. Spending tiers gate the monthly budget separately: Tier 1 $50/mo on a valid card, Tier 2 $500/mo after $50 spent or added, Tier 3 $5,000/mo, Tier 4 $50,000/mo.
How much does Fireworks AI cost per token?
Serverless: Kimi K2.6 is $0.95/M input ($0.16 cached) and $4.00/M output; DeepSeek V4 Pro is $1.74/M input ($0.145 cached) and $3.48/M output; GLM 5.1 is $1.40/$4.40; GPT-OSS 120B is $0.15/$0.60. Cached input gets a 50% discount and batch is 50% of serverless.
What is the cheapest Fireworks AI alternative?
DeepInfra is the per-token floor on most open models: Kimi K2.6 at $0.75/$3.50 versus Fireworks $0.95/$4.00, DeepSeek V4 Pro at $1.30/$2.60 versus $1.74/$3.48. For dedicated GPUs, DeepInfra lists H100 at $1.79/hr against Fireworks $7.00/hr. OpenRouter charges no markup, only a 5.5% credit-purchase fee.
What is the fastest Fireworks AI alternative?
Groq publishes the highest token rates on its hardware: 500 tok/s on GPT-OSS 120B, 1,000 on GPT-OSS 20B, 840 on Llama 3.1 8B Instant, at the same $0.15/$0.60 class price as Fireworks on GPT-OSS 120B. Cerebras claims up to 15x faster than NVIDIA GPUs but does not publish per-token rates.
Is Fireworks AI SOC 2 and HIPAA compliant?
Yes. Fireworks is SOC 2 Type II and HIPAA compliant, with zero data retention on open models unless you opt in, TLS 1.2+ in transit, and AES-256 at rest. Baseten, Groq, Together, Modal, and DeepInfra are also SOC 2 Type II; DeepInfra adds ISO 27001.
Is Morph a drop-in replacement for Fireworks AI?
Yes for OpenAI-compatible workloads. Point your client at https://api.morphllm.com/v1 and change the model string. If you call Fireworks through an OpenAI-style SDK today, you keep the same request format.
Related Comparisons
Built a Coding Agent? Try a Codegen-Tuned Endpoint
Morph runs an OpenAI-compatible API tuned to code generation, plus Fast Apply at ~10,500 tok/s for edits and WarpGrep for code search (free to 100k requests). Migrating from Fireworks is a one-string change.
