Groq vs DeepInfra (2026): 500 tok/s LPU Speed vs $1.79/hr H100 Pricing

You are choosing between two open-model inference providers that optimize for opposite things. Groq sells single-request speed on custom silicon. DeepInfra sells price and dedicated GPU access. The decision comes down to whether a user is waiting on each token, or whether you are paying a bill at the end of the month.

Groq built the LPU to finish one request as fast as physically possible. Published output speeds: GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B Instant at 840, Qwen3 32B at 662, Llama 4 Scout at 594, GPT-OSS 120B at 500, Llama 3.3 70B at 394. The tradeoff is a curated menu and no self-serve custom deployments.

DeepInfra runs a standard GPU fleet and competes on cost. DeepSeek V4-Flash is $0.10 input and $0.20 output per 1M tokens, Qwen3-235B-A22B-Instruct is $0.09/$0.10, and Llama 3.1 8B is $0.02/$0.05. It also publishes self-serve dedicated GPU rates from $0.89/hr (A100) to $4.20/hr (B300), which Groq does not. Every price below is verified as of June 2026.

TL;DR

Pick Groq if single-request latency is the priority. The LPU delivers fixed, published output speeds (500 tok/s on GPT-OSS 120B, 1,000 on GPT-OSS 20B, 394 on Llama 3.3 70B) that stay flat under load, plus a generous free tier. You accept a curated model list and no self-serve dedicated GPUs.
Pick DeepInfra if per-token price or dedicated GPU cost outranks the latency floor. DeepSeek V4-Flash at $0.10/$0.20, Llama 3.1 8B at $0.02/$0.05, dedicated H100 at $1.79/hr, plus Anthropic model resale and SOC 2 + ISO 27001 compliance. No free tier; 200 concurrent requests per account.

Who Wins Per Workload

The choice is rarely "which provider is better." It is "which provider wins for the request I am about to send." This is that table.

Pick by Decision, Not by Spec

Workload / decision	Groq	DeepInfra
Lowest single-stream latency	Groq, fixed LPU speed	GPU fleet, no published speed
Fastest first call (no cold start)	Groq, always warm	Cold start on dedicated scale-up
Cheapest small model	Llama 3.1 8B $0.05/$0.08	Llama 3.1 8B $0.02/$0.05
Run DeepSeek serverless	Not on menu	DeepInfra, V4 to R1
Self-serve dedicated GPU	Enterprise sales only	DeepInfra, $0.89-$4.20/hr
Resell Anthropic models	Not offered	DeepInfra, Haiku to Opus
Free tier to prototype	Groq, 14.4k req/day	No free tier
ISO 27001 listed compliance	SOC 2 + BAA	DeepInfra, SOC 2 + ISO 27001
Highest output speed on GPT-OSS	Groq, 500 tok/s published	No published speed

Quick Comparison

Groq sells speed and determinism on a fixed menu. DeepInfra sells price, dedicated GPUs, and a broader catalog. Morph is in the third column for one job: serving DeepSeek and open-source models for coding agents, not a general chat menu.

Groq vs DeepInfra at a Glance

Spec	Groq	DeepInfra	Morph
Hardware	Custom LPU	GPU (A100/H100/B200)	Code-tuned GPU kernels
Primary focus	Lowest latency floor	Lowest price + dedicated GPU	DeepSeek + codegen
GPT-OSS 120B (per 1M in/out)	$0.15 / $0.60	available	N/A
GPT-OSS 120B speed	500 tok/s published	not published	N/A
DeepSeek serverless	Not on menu	V4 to R1	morph-dsv4flash
Self-serve dedicated GPU	Enterprise only	$0.89-$4.20/hr	N/A
Cold start	None, always warm	On dedicated scale-up	N/A
Free tier	Yes, 14.4k req/day	None	Yes
Activation precision	Provider default	fp8 (most models)	16-bit (bf16)
Best for	Real-time chat, voice	Cheap serverless, DeepSeek, dedicated GPU	Coding agents on DeepSeek

Speed: Groq Publishes a Fixed LPU Floor

The hardware is the whole story. Groq runs its own chip, the LPU, which stores model weights on-chip and statically schedules the entire execution graph. There is no runtime speculation, so output speed stays flat under load. Groq publishes a per-model token rate for every model on its menu. DeepInfra runs commodity GPUs and does not publish per-model token speeds, so the honest comparison is Groq's published numbers against DeepInfra's price.

Groq Published Output Speed (per model)

Model	Context	Output speed	Input / output per 1M
GPT-OSS 20B	128k	1,000 tok/s	$0.075 / $0.30
Llama 3.1 8B Instant	128k	840 tok/s	$0.05 / $0.08
Qwen3 32B	131k	662 tok/s	$0.29 / $0.59
Llama 4 Scout (17Bx16E)	128k	594 tok/s	$0.11 / $0.34
GPT-OSS 120B	128k	500 tok/s	$0.15 / $0.60
Llama 3.3 70B Versatile	128k	394 tok/s	$0.59 / $0.79

1,000

Groq GPT-OSS 20B tok/s

500

Groq GPT-OSS 120B tok/s

200

DeepInfra concurrent requests

On Groq, prompt caching halves input cost on supported models, and the Batch API runs at 50% lower cost with a 24-hour to 7-day processing window. DeepInfra does not advertise a published token rate; its pitch is the price line, not the speed line.

Serverless Pricing: DeepInfra Undercuts at the Bottom

On the one large model both list, GPT-OSS 120B, the price is identical at $0.15 input and $0.60 output per 1M tokens. The difference is the catalog around it. DeepInfra runs much cheaper on small and mid-size open models, and it serves model families Groq does not list, including DeepSeek and resold Anthropic models.

Serverless Per-Token Pricing (per 1M tokens, June 2026)

Model	Groq (in/out)	DeepInfra (in/out)
GPT-OSS 120B	$0.15 / $0.60	available
GPT-OSS 20B	$0.075 / $0.30	available
Llama 3.1 8B	$0.05 / $0.08	$0.02 / $0.05
Qwen3 32B (Groq) / 235B (DeepInfra)	$0.29 / $0.59	$0.09 / $0.10
DeepSeek V4-Pro	Not on menu	$1.30 / $2.60
DeepSeek V4-Flash	Not on menu	$0.10 / $0.20
DeepSeek V3	Not on menu	$0.32 / $0.89
DeepSeek R1-0528	Not on menu	$0.50 / $2.15
Kimi K2 / K2.6	$1.00 / $3.00	$0.75 / $3.50
GLM-5.1	Not on menu	$1.05 / $3.50
Qwen3-Max	Not on menu	$1.20 / $6.00

DeepInfra also resells Anthropic models: Claude Haiku 4.5 at $1.00/$5.00, Sonnet 4.6 at $3.00/$15.00, and Opus 4.8 at $5.00/$25.00 per 1M tokens. Groq narrows its own gap with the 50% Batch API discount and 50% prompt caching on supported models, which stack to roughly a quarter of on-demand pricing for cacheable, async traffic. At full on-demand rates on small open models, DeepInfra is the cheaper provider.

The speed-versus-cost trade

Groq charges for latency. If a user is waiting on the output (chat, voice, an IDE completion), the fixed 500 to 1,000 tok/s floor justifies the per-token rate. If the work is offline or batched (summarization, data labeling, evals), DeepInfra's lower price compounds and the slower per-token speed rarely matters.

Dedicated GPU Pricing: DeepInfra Lists It, Groq Does Not

This is the clearest dividing line. DeepInfra publishes self-serve hourly rates for dedicated GPUs. Groq's dedicated and custom-model capacity runs through enterprise sales with no public hourly price.

Dedicated GPU Pricing (per hour)

GPU	Groq	DeepInfra
A100 80GB	Enterprise only	$0.89
H100 80GB	Enterprise only	$1.79
H200 141GB	Enterprise only	$2.19
B200 180GB	Enterprise only	$2.79
B300 270GB	Enterprise only	$4.20

DeepInfra's H100 at $1.79/hr is among the lowest published on-demand H100 rates: Modal lists about $3.95/hr, Replicate $5.49/hr, Together $6.49/hr on a dedicated endpoint, Baseten $6.50/hr, and Fireworks $7.00/hr. If your workload sustains enough throughput to amortize a flat hourly GPU, DeepInfra's rate is the floor. See the Together vs DeepInfra and Modal vs DeepInfra breakdowns for the dedicated-GPU math against other providers.

Cost on a Real Workload

Take a concrete case: serving GPT-OSS 120B at 50M output tokens per day. Both providers price output at $0.60 per 1M tokens, so on serverless the per-token bill is the same. The difference shows up at scale, where DeepInfra's dedicated H100 becomes a flat alternative.

Cost on a real workload (GPT-OSS 120B, 50M output tokens/day)

Serverless (either provider): 50 x $0.60 per 1M output = $30/day = about $900/mo.
DeepInfra dedicated H100: $1.79/hr x 24 x 30 = about $1,288/mo flat.
Groq with Batch API (50% off): 50 x $0.30 per 1M output = $15/day = about $450/mo for async-tolerant work.

At 50M output tokens/day, serverless (about $900/mo) beats a dedicated H100 (about $1,288/mo). A dedicated H100 wins only above about $1,288/mo of serverless spend, which is roughly 2,147M output tokens/mo. Below that, serverless is cheaper; above it, the flat hourly rate amortizes. If the work tolerates a delay, Groq's Batch API at half price is the cheapest path of all.

Rate Limits and Free Tier

Groq publishes per-model free-plan limits and a Developer plan that lifts them. DeepInfra runs a single concurrency cap with postpaid billing and no free tier.

Rate Limits and Billing

Item	Groq	DeepInfra
Free tier	Yes (console.groq.com)	None
Free limit example	Llama 3.1 8B: 30 RPM, 14.4k RPD	N/A
Free token cap example	6k TPM, 500k TPD	N/A
Concurrency	Per-model RPM limits	200 concurrent / account
Paid tier boost	Developer plan, Batch + Flex	Postpaid, no per-model boost
Billing model	Pay-per-use + Batch	Postpaid invoices
Invoicing thresholds	Standard pay-per-use	$20 / $100 / $500 / $2k / $10k

DeepInfra bills postpaid by card or pre-pay, with monthly invoices plus mid-month invoicing when usage crosses $20, $100, $500, $2,000, or $10,000. There is no free experimentation tier. Groq's free plan publishes hard per-model ceilings (for Llama 3.3 70B Versatile, 30 RPM, 1,000 RPD, 12k TPM, 100k TPD), which is enough to prototype before paying.

Compliance and Data Retention

DeepInfra lists the broader certification set. Both keep inference data out of long-term storage.

Compliance and Data Handling

Item	Groq	DeepInfra
SOC 2 Type II	Yes	Yes
ISO 27001	Not listed	Yes
HIPAA	BAA, with exclusions	Frameworks listed (TOMs)
Inference data retention	Per terms	Zero retention, metadata only
GDPR	Per terms	TOMs in place

DeepInfra states SOC 2 and ISO 27001 certification with technical and organizational measures for GDPR and HIPAA, and zero retention on inference: prompts and completions are deleted from disk and memory after a short window, with only metadata (request ID, cost, sampling parameters) logged. The one exception is Google models, where Google logs prompts and responses for abuse detection. Groq is SOC 2 Type II compliant and can process PHI under a BAA on certain GroqCloud services, but preview and beta features and its compound AI systems are excluded from that BAA.

Running DeepSeek for Codegen

Groq does not list DeepSeek on its published menu, so for DeepSeek the comparison is DeepInfra against the rest of the market. DeepInfra serves DeepSeek V4-Pro at $1.30/$2.60, V4-Flash at $0.10/$0.20, V3 at $0.32/$0.89, and R1-0528 at $0.50/$2.15 per 1M tokens, which are among the lowest published DeepSeek rates.

One detail decides quality. Most serverless providers quantize activations to fp8 to cut cost, which moves the model away from its reference weights and degrades output on hard prompts. Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. For coding agents specifically, Morph also runs codegen-tuned speculative decoding plus custom low-level inference kernels, which makes it the fastest and highest-fidelity option for the apply loop rather than a general menu. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 input and $0.278 output per 1M tokens; see pricing.

When to Use Groq

Real-time, latency-bound apps. Fixed output speed (1,000 tok/s on GPT-OSS 20B, 500 on GPT-OSS 120B, 394 on Llama 3.3 70B) makes voice agents and live chat feel instant.
Deterministic performance under load. The LPU's static scheduling keeps token rate flat instead of degrading when traffic spikes.
Cacheable or batched traffic. 50% prompt caching and the 50%-off Batch API (24h to 7d window) stack to about a quarter of on-demand pricing.
You are happy on the curated menu. If Llama, GPT-OSS, Qwen3, or Kimi K2 covers your need, you get top single-stream speed with no infrastructure to manage.
Free experimentation. 14,400 requests/day on the free tier is enough to prototype seriously before paying.

When to Use DeepInfra

Lowest per-token price on small open models. Llama 3.1 8B at $0.02/$0.05 and Qwen3-235B-A22B-Instruct at $0.09/$0.10. For batch and offline work, the savings compound.
DeepSeek serverless. V4-Pro to R1, with V4-Flash at $0.10/$0.20, which Groq does not list at all.
Self-serve dedicated GPUs. H100 at $1.79/hr, A100 at $0.89/hr, up to B300 at $4.20/hr, with no enterprise sales call.
Resold Anthropic models. Claude Haiku 4.5, Sonnet 4.6, and Opus 4.8 through the same API as the open models.
Compliance on paper. SOC 2 Type II plus ISO 27001, zero retention on inference, and metadata-only logging.

Neither is tuned for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, about 10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Groq or DeepInfra faster?

Groq is faster for single-stream latency. Its LPU publishes fixed output speeds: GPT-OSS 20B at 1,000 tok/s, GPT-OSS 120B at 500, Llama 3.1 8B Instant at 840, Qwen3 32B at 662, Llama 4 Scout at 594, and Llama 3.3 70B at 394, all flat under load because the LPU statically schedules execution. DeepInfra runs a standard GPU fleet and does not publish per-model token speeds; its advantage is price, not single-stream latency.

Is Groq or DeepInfra cheaper?

It depends on the model. On GPT-OSS 120B both charge $0.15/$0.60 per 1M tokens. DeepInfra is cheaper at the bottom of the catalog: Qwen3-235B-A22B-Instruct at $0.09/$0.10, Llama 3.1 8B at $0.02/$0.05, DeepSeek V4-Flash at $0.10/$0.20. Groq closes the gap with a 50% Batch API discount and prompt caching that halves input cost on supported models. DeepInfra has no free tier; Groq does.

How much does a dedicated H100 cost on each?

DeepInfra publishes self-serve dedicated GPU pricing: H100 80GB at $1.79/hr, H200 at $2.19/hr, B200 at $2.79/hr, B300 at $4.20/hr, A100 80GB at $0.89/hr. Groq does not publish self-serve dedicated GPU rates; its dedicated and custom capacity runs through enterprise sales. If you need a known hourly GPU rate without a sales call, DeepInfra lists it.

What are the rate limits on Groq and DeepInfra?

DeepInfra allows 200 concurrent requests per account on postpaid billing, with mid-month invoicing at usage thresholds of $20, $100, $500, $2,000, and $10,000, and no free tier. Groq's free plan publishes per-model limits, for example Llama 3.1 8B Instant at 30 RPM, 14,400 requests/day, 6,000 TPM, and 500,000 TPD; the Developer plan unlocks higher limits plus Batch and Flex processing.

Where should I run DeepSeek, Groq or DeepInfra?

Groq does not list DeepSeek on its published menu, so the practical choice is DeepInfra, which serves DeepSeek V4-Pro at $1.30/$2.60, V4-Flash at $0.10/$0.20, V3 at $0.32/$0.89, and R1-0528 at $0.50/$2.15 per 1M tokens. If output fidelity matters, Morph serves DeepSeek with 16-bit (bf16) activations instead of the fp8 quantization most providers use, and morph-dsv4flash is $0.139/$0.278 per 1M tokens.

Related Comparisons

Latency floor or cheapest breadth: pick the one your workload needs

Groq buys a fixed, fast latency floor; DeepInfra buys the lowest per-token price and self-serve dedicated GPUs. If you are running DeepSeek and output fidelity matters, Morph serves it at full 16-bit precision.

See Morph Models

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers