Replicate vs Groq (2026): $5.49/hr Per-Second GPU vs 1,000 tok/s Per-Token Text

You are choosing between two billing models, not two versions of the same product. Replicate sells GPU time: you package any model with Cog, push it, and pay per second of T4, L40S, A100, or H100 wall clock. Groq sells tokens on a fixed text menu, served on its own LPU silicon, with Llama 3.3 70B at 394 tok/s and GPT-OSS 20B at 1,000 tok/s. They barely overlap, so the choice is rarely close.

That split decides almost everything: who controls the model, who eats the cold start, and what you actually pay. Replicate gives you any-model flexibility and per-second hardware billing, with the catch that idle GPUs and multi-minute cold boots cost real money on warm deployments. Groq gives you a flat latency floor and per-token pricing with no cold start, with the catch that you run what is on the menu and nothing else: text and speech only, no custom weights.

TL;DR

Pick Replicate if you need to run a custom model, a private fine-tune, or an image/video/audio model. Cog packages any weights, and you pay per second of GPU time ($5.04/hr A100 80GB, $5.49/hr H100, $3.51/hr L40S, $0.81/hr T4). The cost is scale-to-zero cold boots that can take several minutes, plus idle time on any warm deployment.
Pick Groq if you need raw speed on a supported open text model and flat per-token pricing. Llama 3.3 70B runs at 394 tok/s for $0.59/$0.79 per million tokens; GPT-OSS 20B at 1,000 tok/s for $0.075/$0.30. A Batch API cuts cost 50% (24h to 7d window) and prompt caching halves cached input. The cost is a fixed catalog: no custom weights, text and speech only.
Running DeepSeek or open models for codegen? Neither hosts DeepSeek at full fidelity for coding. Morph Open Source Models serves DeepSeek in 16-bit (bf16) with no fp8 quantization; morph-dsv4flash is $0.139/$0.278 per 1M in/out.

Who Wins Per Workload

The two rarely compete for the same job. This table maps the decision you are actually making to the winner and why.

Who Wins Per Workload

Workload / decision	Replicate	Groq
Lowest latency on a stock text LLM	Groq, LPU latency floor	Groq, LPU latency floor
Run your own model / checkpoint	Replicate, Cog packages anything	No, fixed catalog
Fine-tune & host private weights	Replicate, dedicated GPUs	No custom weights
Multimodal (image / video / audio)	Replicate, FLUX / Veo / Kling	No, text and speech only
Fastest first call (no cold start)	Cold boot up to several minutes	Groq, always-on menu
Bursty / low-duty traffic	Pays for idle GPU time	Groq, per-token, no idle charge
Sustained high-utilization serving	Replicate, cheap above ~60-70% duty	Per-token adds up at full duty
Tool-calling agents (schema-valid)	Assemble it yourself	Groq, 128 tools, constrained decode
Zero infrastructure to manage	Pick GPU tier, set autoscaling	Groq, send tokens, get tokens

Quick Comparison

Replicate is flexible and pays by the second. Groq is fast and pays by the token. Morph is the column most coding agents actually need.

Replicate vs Groq vs Morph at a Glance

Spec	Replicate	Groq	Morph
Core focus	Any-model GPU hosting	Fast fixed-menu serving	Coding-agent inner loop
Billing model	Per second of GPU time	Per token	Per token / per request
Hardware	Nvidia T4 / L40S / A100 / H100	Custom LPU silicon	Code-tuned GPU kernels
Custom / private models	Yes, via Cog	No, fixed catalog	Code-specific models
Code apply endpoint	No	No	/v1/code/apply
Semantic code search	No	No	WarpGrep ($0/100k)
Top published speed	GPU-dependent	1,000 tok/s (GPT-OSS 20B)	~10,500 tok/s (fast-apply)
DeepSeek activations	Provider-dependent	Not hosted	16-bit (bf16), no fp8
Cold start	Several minutes (scale-to-zero)	None (always-on menu)	Always-on
Llama 3.3 70B price	Per-second GPU	$0.59 / $0.79 per 1M	N/A
Best for	Custom & multimodal models	High-throughput open models	Search, apply, compaction

Billing Model: Per Second vs Per Token

The single biggest difference is how you get charged, and it shapes which workloads make sense on each.

Replicate bills GPU wall-clock time per second. Public serverless models only charge while they process your request; setup and idle time on a public model are free. But most custom or private models run on dedicated hardware, where you pay for setup time, idle time, and active processing time. So a private deployment that sits idle between bursts still bills you for every second the instance is online. The exception is fast-booting fine-tunes, which charge only for active processing.

Groq bills per token, full stop. There is no idle charge, no GPU-hour, no setup fee. You send a request, you pay for the input and output tokens it consumes. A Batch API cuts cost 50% for async jobs (24h to 7d window), and prompt caching halves cached input tokens on supported models.

$5.04/hr

Replicate A100 80GB, billed per second

$0.59/$0.79

Groq Llama 3.3 70B, per 1M in/out tokens

The practical rule: Replicate wins when a GPU stays near 100% busy, because per-second time is cheap when fully utilized. Groq wins for bursty or low-duty-cycle traffic, because you never pay for idle silicon.

Pricing: Concrete Numbers

Direct comparison is awkward because the units differ, so here are both, as of early 2026.

Replicate Hardware Pricing (per second / per hour)

Hardware	Per second	Per hour
CPU	$0.000100	$0.36
Nvidia T4	$0.000225	$0.81
Nvidia L40S	$0.000975	$3.51
Nvidia A100 80GB	$0.001400	$5.04
Nvidia H100	$0.001525	$5.49

Replicate also bills some language models per token (DeepSeek R1 at $3.75/M input and $10/M output; Claude 3.7 Sonnet at $3.00/M input and $15/M output) and image/video models per output ($0.04 per output image, $0.09 per second of output video), but the platform's native unit is per-second hardware. Multi-GPU configs scale proportionally and require committed-spend contracts: 2x H100 at $10.98/hr, 4x A100 at $20.16/hr, 8x L40S at $28.08/hr. Replicate publishes no B200 or H200 rate.

Groq Per-Token Pricing (per 1M tokens)

Model	Input	Output	Speed
GPT-OSS 20B	$0.075	$0.30	1,000 tok/s
Llama 3.1 8B Instant	$0.05	$0.08	840 tok/s
Qwen3 32B (131k ctx)	$0.29	$0.59	662 tok/s
Llama 4 Scout (128k ctx)	$0.11	$0.34	594 tok/s
GPT-OSS 120B	$0.15	$0.60	500 tok/s
Llama 3.3 70B Versatile	$0.59	$0.79	394 tok/s
Kimi K2 Instruct	$1.00 ($0.50 cached)	$3.00	N/A

When per-second beats per-token

A model that runs flat-out 24/7 on a single A100 costs about $3,629/month on Replicate. If that same model serves enough tokens to exceed that figure at Groq's per-token rate, Replicate is cheaper. The crossover lives in utilization: dedicated hardware only pays off above roughly 60 to 70% sustained GPU duty. Below that, every idle second on Replicate is money Groq would not have charged.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Serving Llama 3.3 70B, 20M output tokens/day. On Groq, output is $0.79 per 1M tokens, so 20 × $0.79 = $15.80/day = about $474/month, with no idle charge and no instance to keep warm.

On Replicate, the equivalent is a dedicated A100 80GB at $5.04/hr = about $3,629/month if held warm 24/7. The break-even is volume: $3,629 ÷ $0.79 per 1M = roughly 4.59B output tokens/month, or about 153M tokens/day. So the dedicated A100 only wins above ~153M output tokens/day, nearly 8x the 20M in this scenario. A single Groq stream at the published 394 tok/s produces about 34M tokens/day, so reaching the Replicate break-even on one box also requires heavy batched concurrency, not a single stream.

Below ~153M output tokens/day, Groq's per-token rate wins outright. Above it, with high enough batch utilization, a held-warm Replicate A100 wins. Redo the arithmetic with your own daily token count and the prices above.

Speed: Groq's LPU vs Replicate's GPUs

On the models Groq supports, Groq is the faster of the two by a wide margin.

Groq does not use GPUs. Its LPU (Language Processing Unit) is custom silicon built for sequential, single-stream token generation, which is exactly the part of inference where GPUs leave latency on the table. Published speeds: 1,000 tok/s on GPT-OSS 20B, 840 tok/s on Llama 3.1 8B Instant, 662 tok/s on Qwen3 32B, 594 tok/s on Llama 4 Scout, 500 tok/s on GPT-OSS 120B, 394 tok/s on Llama 3.3 70B Versatile.

Replicate speed is whatever the model and GPU tier deliver. An H100 is fast, but you are still on general Nvidia hardware running a general serving stack, and you tune throughput by choosing hardware and batch settings yourself. For interactive single-user latency, Groq wins. For batched throughput on a custom model, a well-tuned Replicate H100 deployment can be competitive.

1,000 tok/s

Groq GPT-OSS 20B

394 tok/s

Groq Llama 3.3 70B Versatile

GPU-tier

Replicate, you pick the hardware

Cold Starts: The Hidden Replicate Tax

Groq has no cold-start problem; Replicate's serverless design has a real one.

Replicate scales to zero by default: when a model has not been used for a little while, it turns off, so a model that has not run recently must reload into GPU memory before it responds. Replicate's own docs say cold boots can take several minutes for large models. The cost detail in your favor: only running prediction time is charged, so a cold boot adds latency but not direct dollars. To eliminate the boot delay you set a minimum number of warm instances on a deployment, and then you do pay for idle time. Private models also bill for setup and idle, except fast-booting fine-tunes, which bill active processing only.

Groq serves an always-on catalog, so there is no scale-to-zero and no boot latency on a request. The trade is that you cannot bring your own weights to get that behavior; the always-on guarantee only covers models Groq already hosts.

Cold starts compound in bursty loops

Any workload that fans out many short calls pays the cold-start latency on the first call of every cold session. On Replicate that can be a multi-minute stall before the first token unless you keep an instance warm. Always-on serving removes that stall entirely, which matters more for latency-sensitive loops than headline throughput does. Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Model Catalog: Open vs Fixed

Replicate runs anything you can package; Groq runs what it has optimized.

Replicate's whole pitch is the open catalog plus Cog. You take arbitrary model code, wrap it in a Cog container, push it, and Replicate generates the API server and handles the GPU deployment. That covers language models, image generators (FLUX, Veo, Kling), audio, video, embeddings, and private fine-tunes. The cost of that flexibility is that Cog packaging is non-standard: your model is not a plain Docker image you can lift and run elsewhere.

Groq ships a curated menu: Llama 3.1/3.3, GPT-OSS 20B and 120B, Kimi K2, Qwen3 32B, and a few others, each hand-tuned for the LPU. You get speed and predictable behavior, but you cannot upload custom weights or a private fine-tune. If your model is not on the list, Groq is not an option.

Catalog & Modalities

Capability	Replicate	Groq
Custom weights / Cog	Yes	No
Private fine-tunes	Yes	No
Image / video / audio models	Yes	No (text + speech)
Curated fast LLM menu	Partial	Yes
Bring-your-own-architecture	Yes	No

Rate Limits & Compliance

Groq publishes hard per-model limits on its free plan; the Developer plan raises them and unlocks Batch and Flex processing. Replicate caps spend through deployment min/max instances rather than per-minute request ceilings. Both are SOC 2 Type II audited.

Limits & Compliance

Item	Replicate	Groq
Throughput control	Min/max instances per deployment	Per-model RPM / RPD / TPM / TPD
Free-plan example	Pay-per-second, no free serving tier	llama-3.1-8b-instant 30 RPM / 14.4K RPD / 6K TPM
Scaling to spikes	Zero to hundreds of instances	Developer plan raises limits, adds Batch/Flex
Batch discount	No native batch	50% lower, 24h to 7d window
SOC 2 Type II	Yes	Yes
HIPAA	Not stated	BAA covers some services, preview/compound excluded

On Groq, llama-3.3-70b-versatile sits at 30 RPM / 1K RPD / 12K TPM / 100K TPD on the free plan. If you need a fixed, predictable ceiling on Replicate, you set warm minimum instances; otherwise the platform scales from zero on demand.

Features & API Surface

Both expose modern API features, but aimed at different jobs.

Groq is built for agentic LLM use. It supports structured outputs with a JSON schema and strict: true constrained decoding that guarantees 100% schema adherence, parallel function calling with up to 128 tools, an OpenAI-compatible endpoint, a Batch API, prompt caching, and compound models with automatic built-in tool use. For tool-calling agents that need fast, schema-valid responses, this surface is strong.

Replicate is built for model deployment. Its API is prediction-centric: submit input, poll or stream the prediction, get output. It handles autoscaling on dedicated deployments, webhooks, and per-model schemas generated from your Cog definition. The strength is operational control over any model; the weakness is that you assemble the agent features (JSON schema enforcement, tool routing) yourself on top.

Feature Surface

Feature	Replicate	Groq
OpenAI-compatible endpoint	Partial (official models)	Yes
Structured output (JSON schema)	Model-dependent	Yes, constrained decoding
Function / tool calling	Model-dependent	Yes, up to 128 tools
Batch API	No native batch	Yes (50% off)
Prompt caching	No	Yes (50% off cached)
Autoscaling deployments	Yes	Managed (no instances)
Fine-tune hosting	Yes	No

Running DeepSeek & Open Models for Codegen

Neither provider is the right home for DeepSeek when output quality matters. Groq does not host DeepSeek at all. Replicate runs it per token (DeepSeek R1 at $3.75/M input and $10/M output) on standard Nvidia GPUs, and quality there depends on how the model is packaged.

Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph Open Source Models serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. That makes Morph the best place to run DeepSeek when output fidelity matters. For coding agents specifically, Morph adds codegen-tuned speculative decoding plus custom low-level inference kernels, which makes it the fastest and highest-quality option for code generation rather than a general-purpose menu.

Running DeepSeek: Replicate vs Groq vs Morph

Spec	Replicate	Groq	Morph
Hosts DeepSeek	Yes (per token)	No	Yes
DeepSeek activations	Package-dependent	N/A	16-bit (bf16), no fp8
DeepSeek V4 Flash price (in/out per 1M)	N/A	N/A	$0.139 / $0.278
Codegen speculative decoding	No	No	Tuned to code
Code apply endpoint	No	No	/v1/code/apply (~10,500 tok/s)

DeepSeek R1 pricing on Replicate from its pricing page; morph-dsv4flash pricing from Morph pricing.

When to Use Replicate

Custom or private models. If you need arbitrary weights, a private fine-tune, or an architecture Groq does not host, Cog packaging and dedicated GPUs are the reason to be here.
Multimodal generation. Image, video, and audio models (FLUX, Veo, Kling) live on Replicate. Groq is text and speech only.
Steady, high-utilization workloads. Per-second GPU billing is cheap when the hardware stays busy. Above roughly 60 to 70% sustained duty, dedicated instances beat per-token rates.
Operational control. You pick the GPU tier, set autoscaling bounds, and own the serving stack. For teams that want to tune the deployment, that control is the point.
Rapid model prototyping. Push a Cog container and get a working API in minutes without standing up your own GPU infrastructure.

When to Use Groq

Latency-sensitive serving. 1,000 tok/s on GPT-OSS 20B, 840 tok/s on Llama 3.1 8B, 394 tok/s on Llama 3.3 70B. For interactive chat or real-time agents, the LPU is the fastest of the two.
Bursty or low-duty traffic. Per-token billing with no idle charge means you never pay for silence. Spiky workloads are far cheaper than keeping a dedicated GPU warm.
Tool-calling agents. Constrained-decoding structured outputs and up to 128 parallel function calls make schema-valid agent responses reliable.
Cost-sensitive open-model use. $0.05/$0.08 for Llama 3.1 8B, with a Batch API at 50% off and prompt caching halving cached input, lands among the cheapest fast serving available.
Zero infrastructure. No instances, no cold starts, no autoscaling config. Send tokens, get tokens.

Frequently Asked Questions

Is Replicate or Groq cheaper?

Cheaper depends on utilization, not on the provider. Groq is cheaper for bursty or high-throughput open-model serving: Llama 3.3 70B is $0.59/$0.79 per million input/output tokens, Llama 3.1 8B is $0.05/$0.08. Replicate bills GPU time per second ($5.04/hr A100 80GB, $5.49/hr H100), which only wins above roughly 60 to 70% sustained GPU duty. For idle or spiky workloads, Replicate's per-second dedicated billing usually costs more.

Why is Groq so much faster than Replicate?

Groq runs custom LPU silicon designed for sequential token generation, not GPUs. Published speeds: 1,000 tok/s on GPT-OSS 20B, 840 tok/s on Llama 3.1 8B Instant, 662 tok/s on Qwen3 32B, 594 tok/s on Llama 4 Scout, 500 tok/s on GPT-OSS 120B, 394 tok/s on Llama 3.3 70B Versatile. Replicate runs standard Nvidia hardware (T4, L40S, A100 80GB, H100), so speed depends on the model and tier you choose. For single-stream latency on supported models, Groq is faster.

Can I run any model on Groq?

No. Groq serves a fixed catalog (Llama 3.1/3.3, GPT-OSS 20B/120B, Kimi K2, Qwen3 32B, and a few more). You cannot upload custom weights or a private fine-tune. Replicate is the opposite: package any model with Cog and run it on dedicated GPUs, including custom architectures and private fine-tunes.

Does Replicate have cold starts?

Yes. Replicate scales models to zero by default, so a model that has not run recently must reload into GPU memory before it responds, and Replicate's docs say cold boots can take several minutes for large models. Only running prediction time is charged, so a cold boot adds latency but not direct cost. To eliminate the delay you keep a minimum number of warm instances on a deployment, which means paying for idle time.

Can I move a model from Replicate to Groq or vice versa?

Only if the model is on Groq's menu. Replicate packages models as Cog containers, a non-standard format that does not lift-and-shift to any other host, including Groq, which accepts no custom weights at all. Going the other direction is easy: any open model Groq serves (Llama 3.1/3.3, GPT-OSS, Qwen3 32B) can also be packaged for Replicate, you just trade Groq's LPU latency floor for whatever a Replicate GPU tier delivers. The two are not interchangeable runtimes; they are different deployment models that happen to overlap on a few open LLMs.

What is the best place to run DeepSeek models?

For output fidelity, Morph Open Source Models. Most serverless providers quantize activations to fp8 to cut cost, which degrades quality. Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. Replicate runs DeepSeek per token (DeepSeek R1 at $3.75/M input and $10/M output) on standard GPUs; Groq does not host DeepSeek. morph-dsv4flash (DeepSeek V4 Flash) is $0.139/M input and $0.278/M output.

Related Comparisons

Pick Groq for Speed, Replicate for Reach

Groq is the fastest stock text model; Replicate runs anything you can package. If applying model-generated code edits is your bottleneck, that is a separate job.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Replicate vs Groq (2026): $5.49/hr Per-Second GPU vs 1,000 tok/s Per-Token Text

Who Wins Per Workload

Quick Comparison

Billing Model: Per Second vs Per Token

Pricing: Concrete Numbers

Cost on a Real Workload

Speed: Groq's LPU vs Replicate's GPUs

Cold Starts: The Hidden Replicate Tax

Model Catalog: Open vs Fixed

Rate Limits & Compliance

Features & API Surface

Running DeepSeek & Open Models for Codegen

When to Use Replicate

When to Use Groq

Frequently Asked Questions

Is Replicate or Groq cheaper?

Why is Groq so much faster than Replicate?

Can I run any model on Groq?

Does Replicate have cold starts?

Can I move a model from Replicate to Groq or vice versa?

What is the best place to run DeepSeek models?

Related Comparisons

Pick Groq for Speed, Replicate for Reach