Fireworks AI vs Modal (2026): $0.15/M Tokens vs $3.95/hr H100, and Why You Probably Need Both

Both get filed under "serverless GPU," but they sit at different layers. Fireworks is a per-token model API: you call a catalog model by name and pay per token. Modal is per-second serverless GPU compute: you wrap a Python function in a decorator, bring your own model and engine (vLLM, SGLang, TRT-LLM), and pay for the GPU. Modal also runs non-LLM jobs and sandboxes for arbitrary code. Fireworks has no concept of running your own code.

That layer difference decides almost everything. On Fireworks, GPT-OSS 120B is $0.15 in / $0.60 out per million and DeepSeek V4 Pro is $1.74 / $3.48, with nothing to manage and no idle cost. On Modal, an H100 is $0.001097/sec (about $3.95/hr) and a B200 is $0.001736/sec (about $6.25/hr), and you decide what runs on it. Many teams run both: Modal for custom jobs and off-catalog models, Fireworks for stock LLM calls.

All numbers are from each provider's pricing page as of 2026-06-09. Both change pricing often, so verify before you commit.

TL;DR

Pick Fireworks AI if you want a catalog open model behind a per-token API with zero ops. GPT-OSS 120B at $0.15/$0.60, DeepSeek V4 Pro at $1.74/$3.48, Kimi K2.6 at $0.95/$4.00 per M. OpenAI-compatible, cached input at 50%, batch at 50%, fine-tuning from $0.50/M training tokens, SOC 2 Type II plus HIPAA, zero data retention on open models.
Pick Modal if you need to run your own model or arbitrary code: bring your own engine (vLLM, SGLang, TRT-LLM) and weights, per-second GPU billing (H100 ~$3.95/hr, B200 ~$6.25/hr), scale-to-zero with ~1s container boot, memory snapshotting, and sandboxes for untrusted code. SOC 2 from the Starter tier, HIPAA on Enterprise.

Who Wins Per Workload

These products live at different layers, so the right pick is set by the workload, not by a single winner. Each row is a real decision a team faces.

Who Wins Per Workload

Workload / decision	Fireworks AI	Modal
Just call a stock open model	Fireworks (catalog endpoint)	Overkill, you build it
Run your own model / engine	No (catalog only)	Modal (BYO vLLM/SGLang)
Non-LLM compute (image/audio/ETL)	No	Modal (any Python)
Bursty / low volume	Fireworks (pay per token)	Cold-start risk
Sustained high volume	Per-token markup adds up	Modal (saturate the GPU)
Fastest first call	Fireworks (warm catalog)	Cold start unless snapshot
Fine-tune then serve open model	Fireworks (managed FT + serve)	Modal (your own pipeline)
Untrusted / agent-generated code	No sandbox	Modal (isolated sandboxes)
Strictest compliance (HIPAA)	SOC 2 II + HIPAA serverless	HIPAA on Enterprise only

Quick Comparison

Fireworks sells tokens, Modal sells GPU seconds. Morph hosts open models too, but tuned for output fidelity and codegen: it serves DeepSeek at 16-bit (bf16) activations instead of the fp8 most serverless providers quantize to, and runs codegen-specific speculative decoding plus custom inference kernels.

Fireworks AI vs Modal at a Glance

Spec	Fireworks AI	Modal	Morph
What it is	Per-token inference API	Per-second serverless GPU	Per-token open models + code-apply
Billing model	Per token	Per GPU second	Per request / per token
Run custom code / your own model	No (catalog models)	Yes (any Python, BYO engine)	No (hosted open models)
Reference price	DeepSeek V4 Pro $1.74/$3.48 per M	H100 ~$3.95/hr, B200 ~$6.25/hr	morph-dsv4flash $0.139/$0.278 per M
DeepSeek activation precision	fp8 (typical serverless)	Whatever you configure	16-bit bf16 (no fp8 activations)
Fine-tuning	LoRA + full (SFT/DPO), from $0.50/M	Yes (your own pipeline)	N/A
Cold start	None (warm catalog)	~1s container boot, scale-to-zero	None (warm fleet)
Sandboxes for untrusted code	No	Yes (isolated Linux)	N/A
Code-specific apply	No	No	Yes (/v1/code/apply)
Best for	Stock open models, zero ops	Custom models, sandboxes, batch	DeepSeek fidelity + codegen

What Each One Actually Is

The single most important thing to understand: these are not the same kind of product. One hands you a model, the other hands you a GPU.

Fireworks AI: a managed inference API

Fireworks hosts a catalog of open-weight models (Llama, Qwen, DeepSeek, Kimi, and more) and exposes them through an OpenAI-compatible endpoint. You send a model name and a prompt, you get tokens back, you pay per token. You do not pick the GPU, you do not tune batching, and you cannot run a model that is not in the catalog unless you upload a fine-tune or rent a dedicated deployment.

Modal: serverless GPU compute

Modal hands you a GPU behind a Python decorator. You write a function, annotate it with the GPU type and dependencies, and Modal builds the container, schedules it, autoscales it, and scales it to zero when idle. It runs inference, fine-tuning, batch jobs, and sandboxes. There is no model catalog. You bring the code and the weights.

The practical split

If a model you want is in the Fireworks catalog and your traffic is bursty, Fireworks is less work and you pay nothing when idle. If you need a custom model, a fine-tuned checkpoint, a non-LLM workload, or full control over batching and GPU choice, Modal is the layer that does not box you in.

Pricing: Per Token vs Per Second

The verdict: Fireworks wins on low or bursty volume, Modal wins once you keep a GPU busy. They bill on incompatible units, so the comparison is about your duty cycle.

Fireworks serverless per-token (per model)

Fireworks AI Serverless Pricing, $/M tokens (2026-06-09)

Model	Input	Cached input	Output
GPT-OSS 120B	$0.15	$0.015	$0.60
GPT-OSS 20B	$0.07	-	$0.30
DeepSeek V4 Flash	$0.14	$0.028	$0.28
DeepSeek V4 Pro	$1.74	$0.145	$3.48
Kimi K2.6	$0.95	$0.16	$4.00
Kimi K2.5	$0.60	$0.10	$3.00
GLM 5.1	$1.40	$0.26	$4.40
Qwen 3.6 Plus	$0.50	$0.10	$3.00
MiniMax 2.7	$0.30	-	$1.20

Cached input gets a 50% discount and batch inference is billed at 50% of serverless pricing. Fireworks dedicated on-demand GPUs are also available on its own stack: H100 80GB and H200 141GB at $7.00/hr, B200 180GB at $10.00/hr, B300 288GB at $12.00/hr. Fine-tuning runs $0.50/M training tokens (LoRA SFT, up to 16B) to $20.00/M (Full SFT, over 300B); DPO is double the SFT rate at each size.

Modal serverless per-second GPU

Modal GPU Pricing (2026-06-09)

GPU	Per second	Approx per hour
B200	$0.001736	~$6.25
H200	$0.001261	~$4.54
H100	$0.001097	~$3.95
A100 80GB	$0.000694	~$2.50
L40S	$0.000542	~$1.95
A10	$0.000306	~$1.10
L4	$0.000222	~$0.80
T4	$0.000164	~$0.59

Modal bills per second of active compute with no idle charge, plus CPU at $0.0000131/physical core/sec and memory at $0.00000222/GiB/sec. The Starter plan is $0 with $30/month in free credits (100 containers, 10-GPU concurrency). The Team plan is $250/month plus compute, $100/month in credits, 1,000 containers, and 50-GPU concurrency. Enterprise is custom.

Where the break-even sits

Per-token pricing is a markup over GPU cost that buys you zero idle spend. The right pick is set by duty cycle: once you keep a GPU saturated most of the day, Modal per-second wins; below that, Fireworks per-token wins because the GPU would otherwise sit idle and you would pay for that idle time. The worked numbers are in the next section.

Cost on a Real Workload

Computed from list prices, 2026-06-09

Take DeepSeek V4 Pro at 50M output tokens per day. On Fireworks serverless that is the $3.48/M output rate: 50 x $3.48 = $174/day, about $5,220/month, plus input tokens, with zero idle spend and no GPU to manage. (At a lighter model like GPT-OSS 120B output at $0.60/M, the same 50M tokens/day is about $900/month.)

On Modal you rent the GPU, not the tokens. One H100 at $0.001097/sec is about $3.95/hr, so a single H100 kept busy 24/7 is about $2,846/month. Modal only beats the Fireworks bill if that one H100 actually serves your full token volume. For a large model like DeepSeek V4 Pro at $5,220/month on Fireworks, even one or two saturated H100s on Modal come in well under the per-token bill once traffic is steady.

The pattern holds: on an expensive model with high steady volume, owning the GPU-second is cheaper; on a cheap model or bursty traffic, the per-token API is cheaper because you avoid paying for an idle GPU. Modal also adds operational cost (you run vLLM/SGLang, tune batching, manage scaling) that the per-token API hides. Redo the arithmetic with your own model, token volume, and measured tok/s.

Rate Limits, Free Credits, Compliance

Limits and Trust (2026-06-09)

	Fireworks AI	Modal
Free to start	$1 in credits	$30/month in credits (Starter)
Rate limit	10 RPM no card, 6,000 RPM with card	10 GPU concurrency (Starter), 50 (Team)
Monthly spend tiers	$50 / $500 / $5,000 / $50,000	Pay-as-you-go, no token markup
SOC 2 Type II	Yes	From Starter tier
HIPAA	Yes	Enterprise only
Data retention	Zero retention on open models (opt-in to log)	Standard platform logging
Encryption	TLS 1.2+ in transit, AES-256 at rest	Platform-managed

Fireworks gates monthly budget by spending tier: Tier 1 is $50/month with a valid card, Tier 2 is $500/month after $50 added or spent, then $5,000 and $50,000. It is OpenAI-compatible, SOC 2 Type II and HIPAA compliant, and does not log prompts or generations for open models without explicit opt-in. Modal is pay-as-you-go with no per-token markup, SOC 2 from the Starter tier, with HIPAA compatibility and audit logs on Enterprise.

Cold Starts & Autoscaling

The verdict: Fireworks serverless has no cold start you manage, Modal has one but boots fast and gives you knobs to eliminate it.

On Fireworks serverless, catalog models are already loaded on shared infrastructure, so there is no per-request warmup you control. The tradeoff is that you do not choose the GPU and cannot guarantee tail latency for tight P99 SLOs, because you share capacity with other tenants.

On Modal, scale-to-zero means your container can be cold when a request lands. Modal boots containers in about 1 second on its custom container stack, and functions scale to zero after a default 60-second idle window (scaledown_window, configurable from 2 seconds to 20 minutes). For large models, memory snapshotting captures a container's memory at user-controlled points and reuses it across boots to cut the cold-start penalty. To remove cold starts entirely, set min_containers to keep warm capacity, or use buffer_containers to over-provision during spikes.

~1s

Modal container boot time

60s

Default scale-to-zero idle window

None

Fireworks serverless cold start (warm catalog)

Both autoscale on request volume and scale to zero. Modal gives you the knobs (GPU type, concurrency, batching, warm containers) and the responsibility. Fireworks hides them and handles it for you on shared hardware.

Fireworks Niche Features

The verdict: Fireworks is a managed, OpenAI-compatible API tuned for open models, with the billing discounts and fine-tuning paths that production teams want.

Cached input at 50%. Cached input tokens get a 50% discount. On GPT-OSS 120B that drops input from $0.15 to $0.015 per million; on DeepSeek V4 Pro from $1.74 to $0.145.
Batch inference at 50%. Batch jobs are billed at 50% of serverless pricing, halving cost for throughput-bound work that does not need real-time responses.
Fine-tuning. LoRA and full-parameter, SFT and DPO, per million training tokens: $0.50 (LoRA SFT, up to 16B) up to $20.00 (Full SFT, over 300B), DPO double the SFT rate at each size. Fine-tunes deploy behind the same serverless API.
Dedicated GPUs. On-demand H100 80GB and H200 141GB at $7.00/hr, B200 180GB at $10.00/hr, B300 288GB at $12.00/hr for isolated capacity.
Embeddings. $0.008/M input tokens for models up to 150M params; Qwen3 8B embeddings at $0.10/M.
Compliance. SOC 2 Type II and HIPAA, zero data retention on open models without opt-in, TLS 1.2+ in transit and AES-256 at rest.

When to Use Fireworks AI

Catalog open model, zero ops. If GPT-OSS, Qwen, DeepSeek, Kimi, or GLM is in the catalog and you just want an endpoint, Fireworks is a one-line OpenAI-compatible swap.
Bursty or low traffic. Per-token billing means you pay nothing when idle. No GPU to keep warm, no cold start to manage.
Cache-heavy or batch workloads. Cached input at 50% and batch at 50% of serverless cut cost on repeated prompts and throughput-bound jobs.
Fine-tune then serve. Train a LoRA or full fine-tune (from $0.50/M training tokens) and deploy it behind the same serverless API without managing infrastructure.
Regulated workloads. SOC 2 Type II and HIPAA, zero data retention on open models, with dedicated GPUs when you need data isolation.

Frequently Asked Questions

What is the difference between Fireworks AI and Modal?

Fireworks AI is a per-token model API. You call a catalog model by name and pay per token: GPT-OSS 120B at $0.15 in / $0.60 out per million, DeepSeek V4 Pro at $1.74 / $3.48, Kimi K2.6 at $0.95 / $4.00. Modal is per-second serverless GPU compute. You write a Python function, wrap it in a decorator, and pay for the GPU: H100 at $0.001097/sec (about $3.95/hr), B200 at $0.001736/sec (about $6.25/hr). Fireworks is faster to start if your model is in its catalog. Modal runs any model, any code, and any fine-tuned checkpoint.

Is Fireworks AI or Modal cheaper?

It depends on duty cycle. For bursty or low traffic on a catalog model, Fireworks per-token pricing wins because you pay nothing when idle and keep no GPU warm. For sustained volume that keeps a GPU busy most of the day, Modal per-second billing wins because you saturate the hardware. Worked example: DeepSeek V4 Pro at 50M output tokens/day on Fireworks ($3.48/M output) is about $5,220/month, while one Modal H100 kept busy 24/7 is about $2,846/month. On an expensive model at steady volume, Modal undercuts the per-token bill; on a cheap model or bursty traffic, Fireworks wins.

How fast are cold starts on Modal?

Modal boots containers in about 1 second on its custom container stack. Functions scale to zero when idle, with a default 60-second window (scaledown_window, configurable from 2 seconds to 20 minutes). For large models, memory snapshotting captures a container's memory at user-controlled points and reuses it across boots; min_containers keeps warm capacity and buffer_containers over-provisions during spikes. Fireworks serverless has no cold start you manage, since catalog models stay warm on shared infrastructure.

What are Fireworks AI rate limits and free credits?

Without a payment method, Fireworks allows 10 requests per minute. Adding a card raises the ceiling to 6,000 RPM. Monthly spend is gated by tier: $50 with a valid card, $500 after $50 added or spent, then $5,000 and $50,000. New accounts get $1 in free credits. Cached input tokens get a 50% discount and batch inference is billed at 50% of serverless pricing.

Can I use Fireworks AI and Modal together?

Yes, and many teams do. They solve different problems, so they compose cleanly. Use Fireworks as the per-token endpoint for catalog open models like Kimi, DeepSeek, GLM, or GPT-OSS where you just want fast tokens with zero ops. Use Modal for the workloads Fireworks cannot host: a private fine-tuned checkpoint served on your own vLLM or SGLang engine, non-LLM jobs like image or audio generation, batch ETL, training pipelines, and sandboxes for untrusted code. A common pattern is Modal for custom or off-catalog work and Fireworks for everything that is a plain stock LLM call.

Related Comparisons

Call a Model on Fireworks, Run Your Own on Modal

Pick the layer that matches the workload. If applying model-generated code edits is your bottleneck, that is a separate, code-specific tool.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Fireworks AI vs Modal (2026): $0.15/M Tokens vs $3.95/hr H100, and Why You Probably Need Both

Who Wins Per Workload

Quick Comparison

What Each One Actually Is

Fireworks AI: a managed inference API

Modal: serverless GPU compute

Pricing: Per Token vs Per Second

Fireworks serverless per-token (per model)

Modal serverless per-second GPU

Cost on a Real Workload

Rate Limits, Free Credits, Compliance

Cold Starts & Autoscaling

Fireworks Niche Features

When to Use Fireworks AI

Frequently Asked Questions

What is the difference between Fireworks AI and Modal?

Is Fireworks AI or Modal cheaper?

How fast are cold starts on Modal?

What are Fireworks AI rate limits and free credits?

Can I use Fireworks AI and Modal together?

Related Comparisons

Call a Model on Fireworks, Run Your Own on Modal

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Fireworks AI vs Modal (2026): $0.15/M Tokens vs $3.95/hr H100, and Why You Probably Need Both

Who Wins Per Workload

Quick Comparison

What Each One Actually Is

Fireworks AI: a managed inference API

Modal: serverless GPU compute

Pricing: Per Token vs Per Second

Fireworks serverless per-token (per model)

Modal serverless per-second GPU

Cost on a Real Workload

Rate Limits, Free Credits, Compliance

Cold Starts & Autoscaling

Fireworks Niche Features

Modal Niche Features

When to Use Fireworks AI

When to Use Modal

Frequently Asked Questions

What is the difference between Fireworks AI and Modal?

Is Fireworks AI or Modal cheaper?

How fast are cold starts on Modal?

What are Fireworks AI rate limits and free credits?

Can I use Fireworks AI and Modal together?

Related Comparisons

Call a Model on Fireworks, Run Your Own on Modal