Modal vs Groq (2026): GPU-Seconds vs Tokens, Full Pricing and Break-Even

Modal and Groq both sell "fast AI inference," but they sell two completely different things. Modal is a serverless GPU platform: you write a Python function, decorate it, and Modal runs it on H100s or A100s, billing per second of compute. You bring the model and the engine. Groq is the opposite. It is a tokens-as-a-service API running a fixed menu of open models on custom LPU silicon, billed per million tokens, with no GPUs to manage.

The choice comes down to one question. Do you need to run your own model, your own engine, or a custom workload (fine-tuned weights, image generation, embeddings, batch jobs)? Pick Modal. Do you just want to call Llama 3.3 70B or GPT-OSS as fast and cheap as possible through an OpenAI-compatible endpoint? Pick Groq.

Comparing their per-token price only makes sense for the narrow case where both serve the same stock model. Everywhere else they answer different questions: run your own compute, or call a stock model as fast and cheap as possible.

TL;DR

Pick Modal if you need to run your own model, fine-tuned weights, or a non-LLM workload (image, audio, embeddings, batch). Python-native, billed per second of GPU compute, H100 at $0.001097/sec (about $3.95/hr), scales to zero with no idle charges.
Pick Groq if you want the fastest, cheapest path to a stock open model. LPU silicon runs Llama 3.3 70B at 394 tok/s for $0.59/$0.79 per million tokens, no infrastructure, no cold start, OpenAI-compatible, with a built-in agentic Compound system.

Who Wins Per Workload

The two products rarely compete head to head. Each row below is a real decision a developer makes, with the winner and the reason.

Who Wins Per Workload

Workload / decision	Modal	Groq
Run your own / fine-tuned model	Modal: any model, any engine	No, fixed catalog
Call a stock open model fast	You build and tune it	Groq: LPU, hand-optimized
Lowest latency floor	GPU + engine dependent	Groq: deterministic LPU
Fastest first call (no cold start)	Snapshots, still seconds	Groq: always warm, 0s
Bursty / low volume	Scales to zero, pays cold start	Groq: per-token, zero idle
Sustained high utilization	Modal: GPU-seconds beat margin	Pays per-token margin
Image / audio / embeddings / batch	Modal: any GPU workload	No, LLM token API
On-prem / air-gapped silicon	Cloud only	Groq: GroqRack on-prem
Built-in agentic tooling	Build it on your engine	Groq: Compound (web/code/browser)

Quick Comparison

Modal and Groq sit on opposite ends of the inference spectrum. Modal hands you raw GPUs and a Python SDK; Groq hands you a token endpoint and never shows you the hardware. Morph is shown only as a code-specific reference point, not a general host like the other two.

Modal vs Groq vs Morph at a Glance

Spec	Modal	Groq	Morph
Product type	Serverless GPU platform	Tokens-as-a-service API	Code-specific apply/search layer
Billing model	Per second of compute	Per million tokens	Per token / per request
Run your own model	Yes, any model/engine	No, fixed catalog	N/A, not a general host
Hardware	H100/A100/B200/L40S	Custom LPU silicon	Tuned GPU fleet
General LLM inference	Yes (you deploy)	Yes (catalog)	N/A, not a general host
Code-specific apply	No	No	Yes, /v1/code/apply
Semantic code search	No	No	WarpGrep, $0/100k req
Cold start	Seconds (snapshots cut ~10x)	None (always warm)	None (always warm)
OpenAI-compatible	Optional (you build it)	Yes	Yes
Best for	Custom/own-model workloads	Stock open models, fast & cheap	Applying model-generated code edits

Two Different Products, Not Two Versions of One

The most common mistake is comparing Modal and Groq as if they are competing token APIs. They are not.

Modal is infrastructure. You write a Python class, decorate it with Modal's primitives, and it runs on a GPU you select. Modal handles container build, scheduling, autoscaling, and scale-to-zero. You decide the model, the inference engine (vLLM, SGLang, or TensorRT-LLM), the quantization, and the hardware. Modal also runs anything that is not an LLM: image and video generation, audio, embeddings (one example embeds 30 million Amazon reviews at 575k tokens/sec on Qwen2-7B), and your own custom models. You pay for GPU-seconds, not tokens.

Groq is a destination API. There are no GPUs to manage and no engine to configure. You send an OpenAI-style request to a model on Groq's catalog, and the LPU streams tokens back. The tradeoff is that you run only what Groq hosts. You cannot deploy your own fine-tuned weights on the public on-demand tier.

The one-line rule

If your workload is "run this specific model or engine I control," use Modal. If your workload is "call a stock open model as fast and cheap as possible," use Groq. Comparing their per-token cost directly only makes sense for the narrow case where both can serve the exact same model.

Pricing: GPU-Seconds vs Tokens

Modal bills compute time; Groq bills output. These are not directly comparable until you fix a model and a utilization level.

Modal: per-second GPU pricing (as of early 2026)

Modal GPU Pricing

GPU	Per second	Approx per hour
B200	$0.001736	~$6.25
H200	$0.001261	~$4.54
H100	$0.001097	~$3.95
A100 80GB	$0.000694	~$2.50
A100 40GB	$0.000583	~$2.10
L40S	$0.000542	~$1.95
A10	$0.000306	~$1.10
L4	$0.000222	~$0.80
T4	$0.000164	~$0.59

CPU is about $0.0000131 per core-second and memory about $0.00000222 per GiB-second. Plans run $0 (Starter, $30/mo credits, 10 GPU concurrency), $250/mo (Team, $100/mo credits, 50 GPU concurrency), and custom (Enterprise). Region pinning costs 1.5 to 1.75x base. Crucially, Modal does not charge for idle time, so a function that scales to zero between bursts costs nothing while idle.

Groq: per-token pricing (as of early 2026)

Groq Per-Million-Token Pricing

Model	Input	Output	Speed
Llama 3.3 70B Versatile	$0.59	$0.79	~394 tok/s
Llama 3.1 8B Instant	$0.05	$0.08	~840 tok/s
GPT-OSS 120B	$0.15	$0.60	~500 tok/s
GPT-OSS 20B	$0.075	$0.30	~1,000 tok/s
Qwen3 32B	$0.29	$0.59	~662 tok/s
Kimi K2 (uncached in)	$1.00	$3.00	fast

Groq's Batch API cuts rates 50% for asynchronous jobs (24-hour to 7-day windows), and prompt caching cuts cached input tokens to 50%. Stacked, that can land an effective rate near 25% of on-demand for cache-heavy batch workloads.

Which is cheaper?

Cheaper depends on throughput. Groq is cheaper for bursty or low-volume traffic because you never pay for idle GPUs. Modal's raw GPU-seconds win once a single H100 sustains more than about 1,390 output tok/s on Llama 3.3 70B (Modal $0.001097/sec versus Groq $0.79/million output), or whenever you run a model Groq does not host. Below that throughput, Groq wins; above it, Modal wins.

Cost on a Real Workload

The only fair price comparison is a model both can serve. Take Llama 3.3 70B at 50 million output tokens per day, computed from the list prices above (early 2026).

Cost on a real workload (computed from list prices, June 2026)

Groq (per token): 50M output tokens/day x $0.79/million = $39.50/day, about $1,185/mo. Input tokens add $0.59/million on top. No idle cost, nothing to provision.
Modal (per GPU-second): one dedicated H100 at $0.001097/sec runs about $94.78/day, roughly $2,843/mo if pinned 24/7. To serve 50M tokens/day on one H100 you need a sustained 579 output tok/s, well within an H100's range, so the GPU sits underused.
Break-even: Modal's H100 only undercuts Groq once it actually sustains about 1,390 output tok/s ($0.001097/sec / $0.00000079/token). At 50M tokens/day that is roughly 579 tok/s, below break-even, so Groq is cheaper here. Modal turns cheaper when you push the same H100 past ~120M output tokens/day, or run a model Groq does not host.

The lesson is simple: at low to moderate volume on a stock model, Groq's per-token price beats a dedicated GPU. Modal wins when you saturate the hardware, batch heavily, or serve weights Groq has never seen.

Cold Starts & Scaling

This is where the serverless-GPU model shows its one real tax, and where Groq has no tax at all.

On Modal, a function that has scaled to zero has to cold-start: pull the container, load the model into GPU memory, and warm the engine. For a large LLM, that first request can take from a few seconds to over a minute on an unoptimized container. Modal's answer is GPU memory snapshots (2025, currently alpha), which capture full GPU state via CUDA checkpoint/restore so a snapshotted container restores up to 10x faster. You opt in by setting `enable_memory_snapshot=True` plus the `enable_gpu_snapshot` experimental option, and mark warm-up code with the `@modal.enter(snap=True)` decorator. Modal also uses pre-warmed idle GPU buffers and lazy-loading filesystem caching to shave startup.

Groq has no cold start. The model is always resident on LPU hardware, so the first token of your first request arrives as fast as the millionth. You trade away the ability to scale your own model to zero, but you never pay a warm-up penalty or manage a min-containers setting.

~10x

Modal cold-start speedup with GPU snapshots

Groq cold start (always warm)

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Speed: LPU Streaming vs Tunable GPU Throughput

Groq wins raw per-stream token latency on its catalog; Modal wins on flexibility and aggregate throughput.

Groq's LPU is deterministic silicon designed for single-stream token generation, which is why Llama 3.3 70B runs at about 394 tok/s, Llama 3.1 8B at about 840 tok/s, and GPT-OSS 20B at about 1,000 tok/s. Groq also ships Llama 3.3 70B with speculative decoding for higher throughput on supported endpoints. For a chat or agent loop that streams to a user, that latency is hard to beat.

Modal's speed is whatever your GPU and engine deliver. SGLang has lower overhead for decode-heavy and smaller models; vLLM is stronger for mixed and prefill-heavy workloads. You can saturate an H100 or burst to thousands of GPUs for embeddings and batch, where aggregate throughput matters more than single-stream latency. Modal will not beat a hand-tuned LPU on per-stream latency for a model Groq has optimized, but it will run a model Groq has never seen.

Models & Flexibility

Modal runs anything; Groq runs a curated, hand-optimized list.

Model & Workload Support

Capability	Modal	Groq
Stock open LLMs	Yes (you deploy)	Yes (catalog)
Your own fine-tuned weights	Yes	Enterprise/GroqRack only
LoRA fine-tuning jobs	Yes, train + deploy	No
Image / video / audio	Yes	Limited (TTS/STT models)
Embeddings	Yes (e.g. Qwen2-7B)	No first-party embeddings
Choice of engine	vLLM / SGLang / TensorRT	None (managed)
Structured outputs (JSON schema)	Depends on your engine	Yes, strict constrained decode
Parallel function calling	Depends on your engine	Yes
Built-in agentic tooling	No	Yes (Compound: web, code, browser)

Groq's structured outputs use `strict: true` constrained decoding to guarantee schema adherence, and it supports parallel function calling. Its Compound system (now generally available) packages web search, code execution, and browser control into a single agentic endpoint, with `groq/compound` for multi-tool requests and `groq/compound-mini` for single-tool calls.

Modal does not give you any of that out of the box, because Modal is not a model. Whatever structured-output or tool-calling support you want comes from the engine you deploy. The upside is total control: train a LoRA on Modal, save it to a volume, merge, and serve it, all in the same Python codebase.

Running DeepSeek and Open-Source Models

Groq does not host DeepSeek V4 on its public catalog, and on Modal you would deploy and tune DeepSeek yourself on a per-second GPU. If the goal is running DeepSeek or other open-source weights at production quality without managing the deploy, there is a third option built specifically for it: Morph Open Source Models.

Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Keeping full 16-bit activations means responses match the reference weights, so Morph is the place to run DeepSeek when output fidelity matters. For coding, Morph runs codegen-specific speculative decoding plus custom low-level inference kernels, which makes it the fastest and highest-quality option for coding agents specifically.

Where to Run DeepSeek V4 Flash

Option	Activations	Per 1M input / output	Best for
Morph (morph-dsv4flash)	16-bit (bf16), no quantization	$0.139 / $0.278	Fidelity + codegen
Modal	Your choice (you deploy)	Per GPU-second (~$3.95/hr H100)	Custom engine control
Groq	Not on public catalog	N/A	Stock models Groq hosts

Pricing and the full model list are on Morph models and pricing.

Compliance & Deployment

Both have enterprise paths; the on-prem story differs.

Modal offers SOC 2 compliance and HIPAA compatibility on its Enterprise plan, with audit logs and private support. Everything runs in Modal's cloud on managed GPUs.

Groq is SOC 2 Type II compliant, with GDPR and HIPAA coverage on eligible services (note: the Compound agentic system is explicitly not a HIPAA Covered Cloud Service today). For regulated, air-gapped, or sovereign deployments, GroqRack puts the LPU on-premises or in colocation, which Modal does not offer. If you must own the silicon, GroqRack is the only on-prem option between these two.

When to Use Groq

Stock open models, maximum speed. Llama 3.3 70B at 394 tok/s, GPT-OSS 20B at 1,000 tok/s, no infrastructure to manage.
Low-latency chat and agents. Always-warm LPUs mean zero cold start and fast first-token latency for user-facing streaming.
Bursty or low-volume traffic. Per-token billing means you pay nothing when idle and never provision GPUs.
Built-in agentic tooling. Compound gives you web search, code execution, and browser control behind one endpoint.
Strict structured output. `strict: true` constrained decoding guarantees JSON-schema adherence, plus parallel function calling.

Frequently Asked Questions

What is the difference between Modal and Groq?

Modal is a serverless GPU platform: you write Python, deploy your own model and engine (vLLM, SGLang, TensorRT-LLM), and pay per second of GPU compute (H100 at about $0.001097/sec, roughly $3.95/hr). Groq is a tokens-as-a-service API that runs a fixed catalog of open models on custom LPU silicon and bills per million tokens (Llama 3.3 70B at $0.59/$0.79). Modal gives you control over the model; Groq gives you raw speed on a curated menu with zero infrastructure.

Is Modal or Groq cheaper for LLM inference?

Cheaper depends on throughput. Groq's per-token pricing is cheaper for bursty or low-volume traffic because you never pay for idle GPUs. Modal's per-second GPU billing wins once a single H100 sustains more than about 1,390 output tok/s on Llama 3.3 70B (Modal $0.001097/sec versus Groq $0.79/million output), or whenever you run a model Groq does not host. Below that throughput Groq wins; above it Modal wins.

Does Groq let you run your own fine-tuned model?

Not on the public on-demand API. Groq serves a fixed catalog (Llama, GPT-OSS, Qwen, Kimi K2, DeepSeek and others) optimized for its LPU. Custom or fine-tuned weights and dedicated capacity go through enterprise and GroqRack on-premises deployments. If you need self-serve arbitrary weights, Modal is built for that.

How fast is Groq compared to a GPU on Modal?

Groq's LPU runs Llama 3.3 70B at about 394 tok/s, Llama 3.1 8B at about 840 tok/s, and GPT-OSS 20B at about 1,000 tok/s as of early 2026. On Modal, throughput depends on your GPU and engine; a tuned vLLM or SGLang deployment on an H100 is fast but typically will not match Groq's per-stream latency on the models Groq has hand-optimized.

Can I run the exact same model on both Modal and Groq?

Only where the model is on Groq's catalog. If you want Llama 3.3 70B or GPT-OSS, you can call it on Groq's LPU as a token API or deploy it yourself on a Modal GPU and pay per second. For those overlapping models the comparison is clean: Groq is cheaper and lower-latency at bursty or low volume, Modal wins once you keep a GPU saturated or want a quantization or engine Groq does not offer. For any model Groq does not host (a fine-tuned checkpoint, an image or audio model, a niche open model), there is no overlap and Modal is the only option of the two.

Related Comparisons

Run your own model on GPU-seconds, or buy tokens on a fixed menu

Pick Modal for your own model or arbitrary compute, Groq for a stock model as fast and cheap as possible. If applying model-generated code edits is your bottleneck instead, that is what Morph Fast Apply does at ~10,500 tok/s.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Modal vs Groq (2026): GPU-Seconds at $0.001097/sec vs Tokens at $0.59/$0.79 per Million on Llama 3.3 70B

Who Wins Per Workload

Quick Comparison

Two Different Products, Not Two Versions of One

Pricing: GPU-Seconds vs Tokens

Modal: per-second GPU pricing (as of early 2026)

Groq: per-token pricing (as of early 2026)

Cost on a Real Workload

Cold Starts & Scaling

Speed: LPU Streaming vs Tunable GPU Throughput

Models & Flexibility

Running DeepSeek and Open-Source Models

Compliance & Deployment

When to Use Groq

Frequently Asked Questions

What is the difference between Modal and Groq?

Is Modal or Groq cheaper for LLM inference?

Does Groq let you run your own fine-tuned model?

How fast is Groq compared to a GPU on Modal?

Can I run the exact same model on both Modal and Groq?

Related Comparisons

Run your own model on GPU-seconds, or buy tokens on a fixed menu

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Modal vs Groq (2026): GPU-Seconds at $0.001097/sec vs Tokens at $0.59/$0.79 per Million on Llama 3.3 70B

Who Wins Per Workload

Quick Comparison

Two Different Products, Not Two Versions of One

Pricing: GPU-Seconds vs Tokens

Modal: per-second GPU pricing (as of early 2026)

Groq: per-token pricing (as of early 2026)

Cost on a Real Workload

Cold Starts & Scaling

Speed: LPU Streaming vs Tunable GPU Throughput

Models & Flexibility

Running DeepSeek and Open-Source Models

Compliance & Deployment

When to Use Modal

When to Use Groq

Frequently Asked Questions

What is the difference between Modal and Groq?

Is Modal or Groq cheaper for LLM inference?

Does Groq let you run your own fine-tuned model?

How fast is Groq compared to a GPU on Modal?

Can I run the exact same model on both Modal and Groq?

Related Comparisons

Run your own model on GPU-seconds, or buy tokens on a fixed menu