Fireworks AI vs Groq (2026): Groq Runs Llama 70B at 394 tok/s for $0.59/$0.79, Fireworks Hosts 100+ Models You Can Fine-Tune

You need open-model inference behind an OpenAI-compatible API, and Fireworks AI and Groq both sell it on opposite hardware. Fireworks runs Nvidia H100, H200, B200, and B300 GPUs and gives you flexibility: 100+ open models, custom weights, LoRA and reinforcement fine-tuning, structured output, and dedicated deployments. Groq built an HBM-free deterministic LPU with 230MB of on-chip SRAM that serves a fixed menu of models at a flat, very fast latency floor with no cold start, but no custom weights.

The choice is sharp. Pick Groq for the lowest, most predictable latency on a model already on its menu. Pick Fireworks when you need model choice, fine-tuning, or your own weights. Groq runs Llama 3.3 70B at 394 tok/s for $0.59 input and $0.79 output per million tokens. Fireworks does not host Llama 3.3 70B on serverless, but it serves DeepSeek V4, GLM 5.1, Qwen 3.6, Kimi K2.6, and MiniMax 2.7 with named per-model rates and dedicated GPUs from $7/hr.

All prices are verified as of June 9, 2026 and move often, so confirm against each provider's pricing page before you commit.

TL;DR

Pick Groq if you need the lowest, most predictable latency on a curated model set with no cold start. 394 tok/s on Llama 3.3 70B, 840 tok/s on Llama 3.1 8B, 1,000 tok/s on GPT-OSS 20B, deterministic LPU execution, and $0.59/$0.79 per million tokens on the 70B. The catch: a fixed menu and no custom weights.
Pick Fireworks AI if you need breadth and customization. 100+ open models, your own weights, LoRA SFT from $0.50 per million training tokens, structured output, dedicated H100/H200 at $7/hr, plus SOC 2 Type II, HIPAA, and a Zero Data Retention policy. The catch: GPU batching means latency varies more than an LPU's flat floor.

Who Wins Per Workload

The hardware split decides most of these. Groq owns the latency floor and stock-model simplicity; Fireworks owns anything that needs flexibility.

Who wins per workload

Workload / decision	Fireworks AI	Groq
Lowest latency floor	GPU batching varies	Groq: deterministic LPU
Fastest first call (no cold start)	Autoscale, minor cold start	Groq: always-warm menu
Cheapest Llama 3.3 70B	Not on serverless	Groq: $0.59/$0.79
Run your own / custom weights	Fireworks: any open model	Not supported
Fine-tune & deploy adapters	Fireworks: LoRA SFT from $0.50/1M	Not offered
Widest model catalog	Fireworks: 100+ models	Curated (~20)
Multimodal (vision/audio/image)	Fireworks: full stack	Whisper speech only
Embeddings	Fireworks: from $0.008/1M	Not offered
Sustained high volume	Fireworks: dedicated $7/hr H100	Enterprise GroqCloud
Regulated / needs a BAA	Fireworks: HIPAA + SOC 2 + ZDR	SOC 2, BAA with exclusions

Quick Comparison

Groq wins on raw speed and headline price for the models it hosts. Fireworks wins on model count, fine-tuning, and enterprise deployment options.

Fireworks AI vs Groq vs Morph at a Glance

Spec	Fireworks AI	Groq	Morph
Focus	Broad open-model serving + tuning	Fastest serving on LPU	Coding-agent inner loop
Hardware	Nvidia H100/H200/B200/B300	Custom LPU (no HBM)	Code-tuned GPU kernels
Open Models Hosted	100+	Curated (~20)	Apply/search/compact models
Llama 3.3 70B Speed	Not on serverless	394 tok/s	n/a (code models)
Llama 3.3 70B Price (/1M)	Not on serverless	$0.59 in / $0.79 out	n/a
GPT-OSS 120B Price (/1M)	$0.15 / $0.60	$0.15 / $0.60 @ 500 tok/s	n/a
Fine-Tuning	LoRA SFT from $0.50/1M	Not offered	n/a
Code-Specific Apply	No	No	Yes (/v1/code/apply)
Semantic Code Search	No	No	WarpGrep ($0/100k)
Apply Throughput	General serving	General serving	~10,500 tok/s
First-Pass Apply Accuracy	n/a	n/a	98%
Dedicated GPUs	From $7/hr (H100/H200)	Enterprise / GroqCloud	Managed fleet
Compliance	SOC 2 II + HIPAA + ZDR	SOC 2 II, BAA (exclusions)	Enterprise options

Hardware: LPU vs GPU

The whole comparison flows from one decision: Groq replaced the GPU, Fireworks optimized it.

Groq's Language Processing Unit is a deterministic dataflow chip. Each clock cycle runs the same operations in the same order, with no cache hierarchy, no prefetch logic, and no speculative execution. It uses about 230MB of on-chip SRAM instead of HBM, which removes the memory-bandwidth wall that caps GPU token generation. The cost of that design: a single LPU cannot hold a 70B model, so Groq links hundreds of chips across racks to serve one large model. That is GroqCloud's job to manage, not yours.

Fireworks runs Nvidia H100, H200, B200, and B300 GPUs and squeezes them with a proprietary inference stack rather than off-the-shelf vLLM or TensorRT-LLM. FireAttention CUDA kernels, speculative decoding, continuous batching, and hardware-specific quantization are how it competes on speed without custom silicon. The trade is that GPU throughput varies with batch size and deployment shape, so latency is not as flat as an LPU's, but the same flexibility lets Fireworks host 100+ open models and let you fine-tune them.

230MB

Groq LPU on-chip SRAM

0 HBM

Groq memory architecture

4 GPUs

Fireworks Nvidia tiers (H100/H200/B200/B300)

The practical difference: Groq's determinism means latency does not vary with batch size up to capacity, which is excellent for predictable real-time apps. Fireworks' GPU flexibility means it can host almost any open model and let you fine-tune it, which an LPU pipeline is not built to do.

Speed: Groq Leads Single-Stream

Groq is the fastest provider for the models it hosts. Independent benchmarks consistently put it at the top of single-stream throughput.

Groq Published Output Speed (tokens/sec)

Model	Groq tok/s	Groq price (in/out per 1M)
GPT-OSS 20B	1,000 tok/s	$0.075 / $0.30
Llama 3.1 8B Instant	840 tok/s	$0.05 / $0.08
Qwen3 32B	662 tok/s	$0.29 / $0.59
Llama 4 Scout (17Bx16E)	594 tok/s	$0.11 / $0.34
GPT-OSS 120B	500 tok/s	$0.15 / $0.60
Llama 3.3 70B Versatile	394 tok/s	$0.59 / $0.79

Groq publishes the per-model speeds above directly on its pricing page, all at 128k or 131k context. Fireworks does not publish a fixed tok/s per model because GPU throughput varies with batch size and deployment shape. For single-stream workloads where one user waits on one stream, Groq's LPU is hard to beat. For a tuned model or one Groq does not host, Fireworks' flexibility matters more than headline tok/s.

For a chatbot or voice agent where latency is the product, Groq's determinism keeps response times flat regardless of batch size up to capacity. For a batch pipeline, a model Groq does not host, or a model you fine-tuned, Fireworks is the better fit.

Pricing: Groq Cheaper Per Token, Fireworks Cheaper at Scale

The two catalogs only overlap on a few models, so most rows are not a head-to-head. The clean shared comparison is GPT-OSS 120B, where both charge $0.15 input / $0.60 output per million tokens and Groq adds a published 500 tok/s. On the curated models Groq hosts, Groq is cheap and fast. For DeepSeek, GLM, Qwen, Kimi, and MiniMax, Fireworks is where you go.

Groq Serverless Token Pricing (per 1M, verified June 2026)

Model	Input	Output	Speed
GPT-OSS 20B	$0.075	$0.30	1,000 tok/s
Llama 3.1 8B Instant	$0.05	$0.08	840 tok/s
Llama 4 Scout	$0.11	$0.34	594 tok/s
Qwen3 32B	$0.29	$0.59	662 tok/s
GPT-OSS 120B	$0.15 ($0.075 cached)	$0.60	500 tok/s
Llama 3.3 70B Versatile	$0.59	$0.79	394 tok/s
Kimi K2 Instruct	$1.00 ($0.50 cached)	$3.00	n/p

Fireworks AI Serverless Token Pricing (per 1M, verified June 2026)

Model	Input	Output
GPT-OSS 20B	$0.07	$0.30
GPT-OSS 120B	$0.15 ($0.015 cached)	$0.60
DeepSeek V4 Flash	$0.14 ($0.028 cached)	$0.28
MiniMax 2.7	$0.30	$1.20
Qwen 3.6 Plus	$0.50 ($0.10 cached)	$3.00
Kimi K2.5	$0.60 ($0.10 cached)	$3.00
Kimi K2.6	$0.95 ($0.16 cached)	$4.00
GLM 5.1	$1.40 ($0.26 cached)	$4.40
DeepSeek V4 Pro	$1.74 ($0.145 cached)	$3.48

Both providers discount cached input and batch jobs. Fireworks bills cached input at a 50% discount and runs batch inference at 50% of serverless pricing. Groq's prompt caching halves input cost on supported models, and its Batch API runs at 50% lower cost on a 24-hour to 7-day window. Fireworks gives new accounts $1 in free credits; Groq has a free tier at console.groq.com.

Fireworks Dedicated GPU Pricing (on-demand, per hour)

GPU	Fireworks	For reference: DeepInfra
H100 80GB	$7.00	$1.79
H200 141GB	$7.00	$2.19
B200 180GB	$10.00	$2.79
B300 288GB	$12.00	$4.20

When dedicated beats per-token

Per-token pricing wins until volume gets high. Fireworks dedicated GPUs convert a per-token bill into a fixed hourly rate with autoscaling and minimal cold starts, which wins for steady high-throughput traffic. Groq sells on-demand tokens with enterprise GroqCloud for committed capacity; it does not expose a public per-GPU hourly menu the way Fireworks does. If your only goal is the cheapest dedicated H100, Fireworks at $7/hr is not the floor: DeepInfra publishes $1.79/hr for the same card. See the full Fireworks vs DeepInfra breakdown.

Rate Limits and Spending Tiers

Limits decide whether you can ship at scale, not just whether the demo works. The two providers gate differently: Fireworks gates monthly budget by spending tier, Groq gates requests-per-minute by plan and model.

Rate Limits and Account Gating

Limit	Fireworks AI	Groq
Without a payment method	10 RPM	Free plan per-model caps
With a payment method	6,000 RPM fixed ceiling	Developer plan, higher limits
Free-tier example	$1 free credits	8B Instant: 30 RPM / 14.4K RPD
Budget gating	Tier 1 $50/mo to Tier 4 $50K/mo	Plan-based, Batch + Flex on Developer
Batch processing	50% of serverless	50% off, 24h to 7d window

Fireworks spending tiers step up automatically: Tier 1 is $50/mo with a valid card, Tier 2 is $500/mo once you have spent or added $50, Tier 3 is $5,000/mo, and Tier 4 is $50,000/mo. Groq's free plan caps each model individually, for example llama-3.1-8b-instant at 30 RPM, 14,400 requests/day, 6,000 tokens/minute, and 500,000 tokens/day; the Developer plan lifts those and adds Batch and Flex processing.

Cost on a Real Workload

Cost on a real workload (computed from verified list prices, June 2026)

Take the one model both providers host on serverless, GPT-OSS 120B, at 50M output tokens per day, output-only for a clean comparison. Both charge $0.60 per million output tokens.

Groq serverless: 50 × $0.60 = $30.00/day = ~$900/mo, at a published 500 tok/s.
Fireworks serverless: 50 × $0.60 = $30.00/day = ~$900/mo.
Fireworks dedicated H100 at $7/hr: 24 × $7 = $168/day = ~$5,040/mo running 24/7.

On the shared model the serverless price is identical, so the tiebreaker is speed and limits, not headline cost: Groq publishes 500 tok/s and a free tier; Fireworks gives you a 6,000 RPM ceiling and a path to fine-tune. A 24/7 dedicated H100 only beats serverless once it saturates: $168 / $0.60 per 1M = 280M output tokens/day, roughly 3,240 output tok/s sustained on that one GPU. Below that you are paying for idle GPU time. So below ~3,240 sustained output tok/s, serverless wins on price; above it, a saturated dedicated GPU wins, and Fireworks is the only one of the two that exposes the dedicated option publicly.

Model Selection: Fireworks by a Wide Margin

Fireworks hosts far more models. Groq trades breadth for speed.

Fireworks runs 100+ open models across text, vision, audio, image generation, and embeddings, with named serverless tiers for DeepSeek V4 Pro ($1.74/$3.48) and V4 Flash ($0.14/$0.28), GLM 5.1 ($1.40/$4.40), Qwen 3.6 Plus ($0.50/$3.00), Kimi K2.6 ($0.95/$4.00), and MiniMax 2.7 ($0.30/$1.20). Anything not pre-listed can run on a dedicated deployment. It also serves embeddings from $0.008 per million input tokens (Qwen3 8B embeddings at $0.10/M), which Groq does not offer.

Groq curates a smaller set tuned to its LPU: Llama variants, GPT-OSS 20B and 120B, Qwen3 32B, Kimi K2, Llama 4 Scout, plus Whisper for speech. If your model is on Groq, it is fast. If it is not, you wait until Groq adds it. Across the open weights both providers carry, the per-token spread is real: Kimi K2.6 runs $0.75/$3.50 on DeepInfra, $0.95/$4.00 on Fireworks, and $1.20/$4.50 on Together, so if a model is the whole job, the cheapest host is worth a check.

For DeepSeek specifically, the host you pick changes the output, not just the price. Most serverless providers quantize activations to fp8 to cut cost, which degrades quality. Morph serves DeepSeek with 16-bit (bf16) activations and never quantizes activations to fp8, so output matches the reference weights. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 input and $0.278 output per million tokens. For coding agents, Morph adds codegen-specific speculative decoding plus custom low-level inference kernels built for code generation, making it the fastest and highest-fidelity place to run DeepSeek when the workload is code. See pricing.

100+

Fireworks open models

~20

Groq curated models

Fine-Tuning & Customization: Fireworks Only

Fireworks is a training-and-serving platform. Groq is a serving platform.

Fireworks ships full LoRA supervised fine-tuning, DPO, and reinforcement fine-tuning. The price scales with model size, billed per million training tokens.

Fireworks Fine-Tuning Pricing (per 1M training tokens)

Model size	LoRA SFT	Full SFT	DPO
Up to 16B	$0.50	$1.00	2x SFT
16.1B to 80B	$3.00	$6.00	2x SFT
80B to 300B	$6.00	$12.00	2x SFT
Over 300B	$10.00	$20.00	2x SFT

Groq does not offer public fine-tuning. You bring open weights and Groq serves them fast; it is not where you train. If your production model is a fine-tune, Fireworks is the only option of the two, and its rates undercut Together AI's LoRA SFT ($0.48 up to 16B, $1.50 for 17B to 69B) only at the smallest size, so price the exact model size before committing.

Dedicated & Enterprise

Fireworks exposes more of the deployment stack to you; Groq keeps it managed.

Deployment & Compliance

Capability	Fireworks AI	Groq
Serverless per-token	Yes	Yes
Dedicated GPU (hourly)	Yes ($7-$12/hr)	Enterprise only
Fine-tuning	LoRA SFT/Full/DPO + RFT	Not offered
Embeddings	Yes (from $0.008/1M)	No
Speech (Whisper)	Via models	Yes
SOC 2 Type II	Yes	Yes
HIPAA	Yes	BAA with exclusions
Zero Data Retention	Open models, no opt-in	Not stated as default
Encryption	TLS 1.2+ in transit, AES-256 at rest	Not specified here

Both are SOC 2 Type II compliant. Fireworks is also HIPAA compliant and runs a Zero Data Retention policy that does not log or store prompts or generations for open models without explicit opt-in, with TLS 1.2+ in transit and AES-256 at rest. Groq can process PHI under a BAA on certain GroqCloud services, but preview and beta features and its compound AI systems are excluded from the BAA. For a regulated workload that needs a clean BAA across the whole surface, Fireworks has the more explicit public posture.

When to Use Fireworks AI

You need a model Groq does not host. DeepSeek V4 Pro/Flash, GLM 5.1, Qwen 3.6 Plus, Kimi K2.6, MiniMax 2.7, or any of 100+ open models, including vision, image, audio, and embeddings.
You want to fine-tune. LoRA SFT from $0.50 per million training tokens, plus Full SFT, DPO, and reinforcement fine-tuning that scale with model size.
You run steady high volume. Dedicated H100/H200 at $7/hr converts a per-token bill into a fixed rate.
You need higher request throughput. A 6,000 RPM ceiling with a payment method, gated only by spending tier.
You are in a regulated industry. SOC 2 Type II, HIPAA, and Zero Data Retention on open models with no opt-in required.

When to Use Groq

You need the lowest latency. 394 tok/s on Llama 3.3 70B, 840 tok/s on Llama 3.1 8B, and 1,000 tok/s on GPT-OSS 20B, all published on the pricing page.
You want predictable response times. Deterministic LPU execution keeps latency flat regardless of batch size, ideal for real-time voice and chat.
Your model is on the menu. Llama, GPT-OSS 20B/120B, Qwen3 32B, Kimi K2, Llama 4 Scout, and Whisper all run fast.
You want a cheap Llama 70B. $0.59/$0.79 on Llama 3.3 70B and $0.05/$0.08 on Llama 3.1 8B, models Fireworks does not run on serverless.
You want zero infrastructure and a free tier. No GPU sizing, just an OpenAI-compatible endpoint that is fast by default, with a free plan at console.groq.com.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Groq faster than Fireworks AI?

For single-stream token generation on the models Groq hosts, yes. Groq publishes 394 tok/s on Llama 3.3 70B, 840 tok/s on Llama 3.1 8B, 1,000 tok/s on GPT-OSS 20B, 662 tok/s on Qwen3 32B, and 500 tok/s on GPT-OSS 120B. Fireworks runs Nvidia GPUs and does not publish a fixed tok/s per model because throughput varies with batch size. Groq leads on raw single-stream latency for its menu.

How much does Llama 3.3 70B cost on Groq?

Groq prices Llama 3.3 70B Versatile at $0.59 input and $0.79 output per million tokens, at 394 tok/s with 128k context. Llama 3.1 8B Instant is $0.05/$0.08 at 840 tok/s. Fireworks does not host Llama 3.3 70B on serverless, so a same-model price comparison is not possible. On the shared model GPT-OSS 120B, both charge $0.15 input / $0.60 output.

What are Fireworks AI's rate limits?

Fireworks allows 10 requests per minute without a payment method, and a fixed 6,000 RPM ceiling once you add a valid card. Spending tiers gate monthly budget: Tier 1 $50/mo, Tier 2 $500/mo after $50 spent, Tier 3 $5,000/mo, Tier 4 $50,000/mo. New accounts get $1 in free credits. Groq's free plan caps individual models, for example llama-3.1-8b-instant at 30 RPM and 14,400 requests per day.

Can I fine-tune models on Fireworks AI and Groq?

Fireworks has full LoRA SFT, Full SFT, DPO, and reinforcement fine-tuning. LoRA SFT runs $0.50 per million training tokens up to 16B, $3.00 for 16.1B to 80B, $6.00 for 80B to 300B, and $10.00 above 300B; DPO is 2x the SFT rate. Groq does not offer public fine-tuning; it serves existing open weights rather than training new ones.

Which provider has more open models?

Fireworks, by a wide margin. It hosts 100+ open text, vision, audio, image, and embedding models, plus per-model serverless tiers for DeepSeek V4, GLM 5.1, Qwen 3.6, Kimi K2.6, and MiniMax 2.7. Groq runs a curated set tuned for its LPU and optimized for speed over breadth.

Are Fireworks AI and Groq SOC 2 and HIPAA compliant?

Both are SOC 2 Type II compliant. Fireworks is also HIPAA compliant with a Zero Data Retention policy on open models, TLS 1.2+ in transit, and AES-256 at rest. Groq can process PHI under a BAA on certain GroqCloud services, but preview and beta features and its compound AI systems are excluded from the BAA.

Related Comparisons

Groq for the latency floor, Fireworks for flexibility

If applying model-generated code edits is your bottleneck, that is a separate job. Morph Fast Apply runs at ~10,500 tok/s with published benchmarks.

Try Morph Free

Fast Apply benchmarks

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers