Fireworks AI vs Modal: Call a Model vs Run Your Own GPU

These are different layers, not rivals. Fireworks is a per-token model API you call at $0.90/M for 70B-class. Modal is per-second serverless GPU where you bring your own model and engine. Pick by whether you call a model or run one.

June 3, 2026 · 1 min read

This is not a head-to-head, it is a layers comparison. Pick Fireworks to just call a fast stock model. Pick Modal when you need to run your own model, a custom workload, or non-LLM compute. Fireworks is a finished per-token inference API: you call a model by name and pay per token. Modal is serverless GPU compute: you wrap a Python function in a decorator, bring your own model and engine (vLLM, SGLang, TRT-LLM), and pay per second of GPU time. You also get sandboxes for arbitrary code on Modal.

That layer difference decides almost everything. Fireworks at $0.90 per million tokens for a 70B-class model is the simplest path if your model is already in their catalog. Modal at roughly $3.95/hr for an H100 with sub-second scale-up is the path when you need custom code, a fine-tuned checkpoint, or a model nobody hosts. Many teams run both: Modal for custom jobs, Fireworks for stock LLM calls.

All numbers are as of early 2026, and both providers change pricing often, so verify against their pages before you commit.

TL;DR

  • Pick Fireworks AI if you want a public open model behind a per-token API with zero ops. $0.90/M for 70B-class, $0.10/M under 4B, OpenAI-compatible, with FireAttention kernels, speculative decoding, fine-tuning, and SOC 2 Type II plus HIPAA.
  • Pick Modal if you need to run your own model or arbitrary code: bring your own engine (vLLM, SGLang, TRT-LLM) and weights, per-second GPU billing (about $3.95/hr H100), scale-to-zero, GPU memory snapshots that cut cold starts roughly six-fold, and sandboxes for untrusted code.

Who Wins Per Workload

These products live at different layers, so the right pick is set by the workload, not by a single winner. Each row is a real decision a team faces.

Workload / decisionFireworks AIModal
Just call a stock open modelFireworks (catalog endpoint)Overkill, you build it
Run your own model / engineNo (catalog only)Modal (BYO vLLM/SGLang)
Non-LLM compute (image/audio/ETL)NoModal (any Python)
Bursty / low volumeFireworks (pay per token)Cold-start risk
Sustained high volumePer-token markup adds upModal (saturate the GPU)
Fastest first callFireworks (warm catalog)Cold start unless snapshot
Fine-tune then serve open modelFireworks (managed FT + serve)Modal (your own pipeline)
Untrusted / agent-generated codeNo sandboxModal (bVisor sandboxes)
Strictest compliance (HIPAA)SOC 2 II + HIPAA serverlessHIPAA on Enterprise only

Quick Comparison

Fireworks sells tokens, Modal sells GPU seconds. Morph is in the third column only as a code-specific apply-and-search layer, not as a general host, so most of its cells read N/A.

SpecFireworks AIModalMorph
What it isPer-token inference APIPer-second serverless GPUCode-apply layer (not a host)
Billing modelPer tokenPer GPU secondPer request / per token
Run custom code / your own modelNo (catalog models)Yes (any Python, BYO engine)N/A, not a general host
70B-class price$0.90/M tokens~$3.95/hr H100N/A, not a general host
Fine-tuningLoRA + full (SFT/DPO)Yes (your own pipeline)N/A
Cold startNone (warm catalog)Sub-second to ~12s w/ snapshotN/A
Sandboxes for untrusted codeNoYes (bVisor)N/A
Code-specific applyNoNoYes (/v1/code/apply)
Best forStock open models, zero opsCustom models, sandboxes, batchApplying model-generated edits

What Each One Actually Is

The single most important thing to understand: these are not the same kind of product. One hands you a model, the other hands you a GPU.

Fireworks AI: a managed inference API

Fireworks hosts a catalog of open-weight models (Llama, Qwen, DeepSeek, Kimi, and more) and exposes them through an OpenAI-compatible endpoint. You send a model name and a prompt, you get tokens back, you pay per token. You do not pick the GPU, you do not tune batching, and you cannot run a model that is not in the catalog unless you upload a fine-tune or rent a dedicated deployment.

Modal: serverless GPU compute

Modal hands you a GPU behind a Python decorator. You write a function, annotate it with the GPU type and dependencies, and Modal builds the container, schedules it, autoscales it, and scales it to zero when idle. It runs inference, fine-tuning, batch jobs, and sandboxes. There is no model catalog. You bring the code and the weights.

The practical split

If a model you want is in the Fireworks catalog and your traffic is bursty, Fireworks is less work and you pay nothing when idle. If you need a custom model, a fine-tuned checkpoint, a non-LLM workload, or full control over batching and GPU choice, Modal is the layer that does not box you in.

Pricing: Per Token vs Per Second

The verdict: Fireworks wins on low or bursty volume, Modal wins once you keep a GPU busy. They bill on incompatible units, so the comparison is about your duty cycle.

Fireworks serverless per-token (size-tiered)

Model tierPrice per 1M tokens
Under 4B params$0.10
4B to 16B params$0.20
Over 16B params (Llama 70B class)$0.90
MoE up to 56B$0.50
MoE 56.1B to 176B$1.20
Batch inference50% of serverless
Cached input50% of input price

These size tiers apply to most catalog models. Featured models get their own rates, for example DeepSeek V4 Flash at $0.14 in / $0.28 out per million, and DeepSeek V4 Pro at $1.74 in / $3.48 out. Fireworks fine-tuning runs about $0.50 to $40 per million training tokens depending on model size and method (LoRA vs full, SFT vs DPO).

Modal serverless per-second GPU

GPUPer secondApprox per hour
B200$0.001736~$6.25
H200$0.001261~$4.54
H100$0.001097~$3.95
A100 80GB$0.000694~$2.50
A100 40GB$0.000583~$2.10
L40S$0.000542~$1.95
L4$0.000222~$0.80
T4$0.000164~$0.59

Modal bills per second of active compute with no idle charge. The Starter plan is $0 with $30/month in free credits and 10-GPU concurrency. The Team plan is $250/month plus compute, $100/month in credits, and 50-GPU concurrency. Fireworks dedicated on-demand GPUs are also available (H100/H200 around $7/hr, B200 around $10/hr) if you want isolated capacity on the Fireworks stack.

Where the break-even sits

Per-token pricing is a markup over GPU cost that buys you zero idle spend. Cheaper is set by duty cycle: above roughly 40 to 50% sustained H100 utilization Modal per-second wins, below it Fireworks per-token wins because the GPU would otherwise sit idle. The worked numbers are in the next section.

Cost on a Real Workload

Computed from list prices, early 2026

Take a 70B-class open model at 50M output tokens per day. On Fireworks serverless that is the $0.90/M tier: 50 x $0.90 = $45/day, about $1,350/month, with zero idle spend and no GPU to manage.

On Modal you rent the GPU, not the tokens. One H100 at $0.001097/sec is about $3.95/hr, so a single H100 kept busy 24/7 is about $2,846/month. Modal only beats the Fireworks bill if that one H100 actually serves all 50M tokens/day, which is roughly 580 output tok/s sustained. If your serving stack clears that on one H100, Modal at ~$2,846/mo undercuts Fireworks at ~$1,350/mo only once volume scales past the point where Fireworks would charge more than the GPU costs to keep busy.

Concretely: at 50M tokens/day Fireworks (~$1,350/mo) is cheaper than a dedicated H100 (~$2,846/mo). Modal pulls ahead once daily volume roughly doubles to where the per-token bill exceeds the fixed GPU cost, i.e. above about 105M output tokens/day on a single saturated H100. Below that, the per-token API is cheaper; above it, owning the GPU-second is cheaper. Redo the arithmetic with your own token volume and measured tok/s.

Cold Starts & Autoscaling

The verdict: Fireworks serverless has no cold start you manage, Modal has one but has spent serious engineering shrinking it.

On Fireworks serverless, catalog models are already loaded on shared infrastructure, so there is no per-request warmup you control. The tradeoff is that you do not choose the GPU and cannot guarantee tail latency for tight P99 SLOs, because you share capacity with other tenants.

On Modal, scale-to-zero means your container can be cold when a request lands. Modal quotes sub-second cold starts for CPU and a few seconds for typical GPU models. For large models, its GPU memory snapshotting (alpha) captures model weights in VRAM, CUDA kernels, and execution context, cutting cold starts roughly six-fold. On a Ministral 3 3B test, median cold start dropped from about 118s to about 12s.

~12s
Modal large-model cold start w/ GPU snapshot
6x
Cold-start reduction from snapshotting
None
Fireworks serverless cold start (warm catalog)

Both autoscale on request volume and scale to zero. Modal gives you the knobs (GPU type, concurrency, batching) and the responsibility. Fireworks hides them and handles it for you on shared hardware.

Fireworks Niche Features

The verdict: Fireworks is the speed-and-quality optimizer for open models, built by ex-PyTorch engineers around custom kernels.

  • FireAttention. Custom CUDA attention kernels. FireAttention V1 claimed 4x faster serving than vLLM via quantization with minimal quality loss. V2 extended that to long context with up to 12x faster long-context inference.
  • FireOptimizer + adaptive speculative decoding. Profile-driven serving that tunes the draft model to your workload, claiming 2x to 3x generation speedups without quality loss, searching over 100,000 serving configurations.
  • Long context throughput. Reported 167 to 174 tok/s on DeepSeek V4 Pro at full 1M token context.
  • Production features. Streaming, function calling, structured JSON output, prompt caching (cached input at 50%), predicted outputs for edit/rewrite workloads, and batch inference at 50% of serverless price.
  • Fine-tuning. LoRA and full-parameter, SFT and DPO, priced per million training tokens, with fine-tunes deployable behind the same serverless API.
  • Compliance. SOC 2 Type II and HIPAA, plus GDPR and ISO, with dedicated GPU deployments for data isolation.

When to Use Fireworks AI

  • Public open model, zero ops. If Llama, Qwen, DeepSeek, or Kimi is in the catalog and you just want an endpoint, Fireworks is a one-line OpenAI-compatible swap.
  • Bursty or low traffic. Per-token billing means you pay nothing when idle. No GPU to keep warm, no cold start to manage.
  • Speed-sensitive open-model serving. FireAttention kernels and adaptive speculative decoding push throughput past a stock vLLM deployment.
  • Fine-tune then serve. Train a LoRA or full fine-tune and deploy it behind the same serverless API without managing infrastructure.
  • Regulated workloads. SOC 2 Type II and HIPAA, with dedicated deployments when you need data isolation.

When to Use Modal

  • A model nobody hosts. Custom architecture, a private checkpoint, or a niche open model that is not in any catalog. Modal runs whatever you can put in a container.
  • Not just LLMs. Image, audio, and video generation, embeddings, batch ETL, training pipelines. Modal is general GPU compute, not a model API.
  • High, steady volume. When you keep a GPU busy, per-second billing beats per-token markup.
  • Untrusted code execution. Sandboxes with bVisor isolation for running agent-generated or user-submitted code safely.
  • Full control. You choose the GPU, the batching, and the concurrency, which matters for tight P99 latency targets.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

What is the difference between Fireworks AI and Modal?

Fireworks AI is a managed inference API. You call open models like Llama 3.1 70B by name and pay per token, around $0.90 per million for 70B-class models. Modal is serverless GPU infrastructure. You write a Python function, wrap it in a decorator, and pay per second of GPU time, around $3.95/hr for an H100. Fireworks is faster to start if your model is in its catalog. Modal lets you run any code, any model, and any fine-tuned checkpoint.

Is Fireworks AI or Modal cheaper?

Cheaper depends on duty cycle, not a flat answer. For low or bursty traffic on a public open model, Fireworks per-token pricing wins because you pay nothing when idle. For high, steady volume that keeps a GPU busy most of the day, Modal per-second GPU billing is cheaper because you saturate the hardware and avoid per-token markup. At a 70B-class model and 50M output tokens/day, Fireworks runs about $1,350/month versus about $2,846/month for a dedicated H100, so Fireworks wins until volume roughly doubles past where the per-token bill exceeds the fixed GPU cost.

How fast are cold starts on Modal?

Modal advertises sub-second container cold starts for CPU and a few seconds for typical GPU models. For large models, its GPU memory snapshotting (alpha) captures model weights in VRAM, CUDA kernels, and execution context, cutting cold starts roughly six-fold. On a Ministral 3 3B test, median cold start dropped from about 118s to about 12s. Fireworks serverless has no cold start you manage, since catalog models stay warm on shared infrastructure.

Does Fireworks AI support fine-tuning and structured output?

Yes. Fireworks offers LoRA and full-parameter fine-tuning (SFT and DPO) priced per million training tokens, from about $0.50 to $40 depending on model size. Inference supports streaming, function calling, structured JSON output, prompt caching, speculative decoding, and batch inference at 50% of serverless pricing. It is OpenAI-compatible, so it drops into existing OpenAI client code.

Can I use Fireworks AI and Modal together?

Yes, and many teams do. They solve different problems, so they compose cleanly. Use Fireworks as the per-token endpoint for stock open models like Llama or Qwen where you just want fast tokens with zero ops. Use Modal for the workloads Fireworks cannot host: a private fine-tuned checkpoint served on your own vLLM or SGLang engine, non-LLM jobs like image or audio generation, batch ETL, training pipelines, and sandboxes for untrusted code. A common pattern is Modal for custom or off-catalog work and Fireworks for everything that is a plain stock LLM call.

Related Comparisons

Call a Model on Fireworks, Run Your Own on Modal

Pick the layer that matches the workload. If applying model-generated code edits is your bottleneck, that is a separate, code-specific tool.