Modal vs Groq: Rent GPU-Seconds and Run Your Own Model, or Buy Tokens on a Fixed LPU Menu

Modal rents serverless GPUs by the second so you run any model or engine. Groq sells tokens on a fixed LPU catalog at flat fast latency with no cold start. Their per-token prices only meet when both serve the same stock model.

June 3, 2026 · 1 min read

Modal and Groq both sell "fast AI inference," but they sell two completely different things. Modal is a serverless GPU platform: you write a Python function, decorate it, and Modal runs it on H100s or A100s, billing per second of compute. You bring the model and the engine. Groq is the opposite. It is a tokens-as-a-service API running a fixed menu of open models on custom LPU silicon, billed per million tokens, with no GPUs to manage.

The choice comes down to one question. Do you need to run your own model, your own engine, or a custom workload (fine-tuned weights, image generation, embeddings, batch jobs)? Pick Modal. Do you just want to call Llama 3.3 70B or GPT-OSS as fast and cheap as possible through an OpenAI-compatible endpoint? Pick Groq.

Comparing their per-token price only makes sense for the narrow case where both serve the same stock model. Everywhere else they answer different questions: run your own compute, or call a stock model as fast and cheap as possible.

TL;DR

  • Pick Modal if you need to run your own model, fine-tuned weights, or a non-LLM workload (image, audio, embeddings, batch). Python-native, billed per second of GPU compute, H100 at $0.001097/sec (about $3.95/hr), scales to zero with no idle charges.
  • Pick Groq if you want the fastest, cheapest path to a stock open model. LPU silicon runs Llama 3.3 70B at 394 tok/s for $0.59/$0.79 per million tokens, no infrastructure, no cold start, OpenAI-compatible, with a built-in agentic Compound system.

Who Wins Per Workload

The two products rarely compete head to head. Each row below is a real decision a developer makes, with the winner and the reason.

Workload / decisionModalGroq
Run your own / fine-tuned modelModal: any model, any engineNo, fixed catalog
Call a stock open model fastYou build and tune itGroq: LPU, hand-optimized
Lowest latency floorGPU + engine dependentGroq: deterministic LPU
Fastest first call (no cold start)Snapshots, still secondsGroq: always warm, 0s
Bursty / low volumeScales to zero, pays cold startGroq: per-token, zero idle
Sustained high utilizationModal: GPU-seconds beat marginPays per-token margin
Image / audio / embeddings / batchModal: any GPU workloadNo, LLM token API
On-prem / air-gapped siliconCloud onlyGroq: GroqRack on-prem
Built-in agentic toolingBuild it on your engineGroq: Compound (web/code/browser)

Quick Comparison

Modal and Groq sit on opposite ends of the inference spectrum. Modal hands you raw GPUs and a Python SDK; Groq hands you a token endpoint and never shows you the hardware. Morph is shown only as a code-specific reference point, not a general host like the other two.

SpecModalGroqMorph
Product typeServerless GPU platformTokens-as-a-service APICode-specific apply/search layer
Billing modelPer second of computePer million tokensPer token / per request
Run your own modelYes, any model/engineNo, fixed catalogN/A, not a general host
HardwareH100/A100/B200/L40SCustom LPU siliconTuned GPU fleet
General LLM inferenceYes (you deploy)Yes (catalog)N/A, not a general host
Code-specific applyNoNoYes, /v1/code/apply
Semantic code searchNoNoWarpGrep, $0/100k req
Cold startSeconds (snapshots cut ~10x)None (always warm)None (always warm)
OpenAI-compatibleOptional (you build it)YesYes
Best forCustom/own-model workloadsStock open models, fast & cheapApplying model-generated code edits

Two Different Products, Not Two Versions of One

The most common mistake is comparing Modal and Groq as if they are competing token APIs. They are not.

Modal is infrastructure. You write a Python class, decorate it with Modal's primitives, and it runs on a GPU you select. Modal handles container build, scheduling, autoscaling, and scale-to-zero. You decide the model, the inference engine (vLLM, SGLang, or TensorRT-LLM), the quantization, and the hardware. Modal also runs anything that is not an LLM: image and video generation, audio, embeddings (one example embeds 30 million Amazon reviews at 575k tokens/sec on Qwen2-7B), and your own custom models. You pay for GPU-seconds, not tokens.

Groq is a destination API. There are no GPUs to manage and no engine to configure. You send an OpenAI-style request to a model on Groq's catalog, and the LPU streams tokens back. The tradeoff is that you run only what Groq hosts. You cannot deploy your own fine-tuned weights on the public on-demand tier.

The one-line rule

If your workload is "run this specific model or engine I control," use Modal. If your workload is "call a stock open model as fast and cheap as possible," use Groq. Comparing their per-token cost directly only makes sense for the narrow case where both can serve the exact same model.

Pricing: GPU-Seconds vs Tokens

Modal bills compute time; Groq bills output. These are not directly comparable until you fix a model and a utilization level.

Modal: per-second GPU pricing (as of early 2026)

GPUPer secondApprox per hour
B200$0.001736~$6.25
H200$0.001261~$4.54
H100$0.001097~$3.95
A100 80GB$0.000694~$2.50
A100 40GB$0.000583~$2.10
L40S$0.000542~$1.95
A10$0.000306~$1.10
L4$0.000222~$0.80
T4$0.000164~$0.59

CPU is about $0.0000131 per core-second and memory about $0.00000222 per GiB-second. Plans run $0 (Starter, $30/mo credits, 10 GPU concurrency), $250/mo (Team, $100/mo credits, 50 GPU concurrency), and custom (Enterprise). Region pinning costs 1.5 to 1.75x base. Crucially, Modal does not charge for idle time, so a function that scales to zero between bursts costs nothing while idle.

Groq: per-token pricing (as of early 2026)

ModelInputOutputSpeed
Llama 3.3 70B Versatile$0.59$0.79~394 tok/s
Llama 3.1 8B Instant$0.05$0.08~840 tok/s
GPT-OSS 120B$0.15$0.60~500 tok/s
GPT-OSS 20B$0.075$0.30~1,000 tok/s
Qwen3 32B$0.29$0.59~662 tok/s
Kimi K2 (uncached in)$1.00$3.00fast

Groq's Batch API cuts rates 50% for asynchronous jobs (24-hour to 7-day windows), and prompt caching cuts cached input tokens to 50%. Stacked, that can land an effective rate near 25% of on-demand for cache-heavy batch workloads.

Which is cheaper?

Cheaper depends on throughput. Groq is cheaper for bursty or low-volume traffic because you never pay for idle GPUs. Modal's raw GPU-seconds win once a single H100 sustains more than about 1,390 output tok/s on Llama 3.3 70B (Modal $0.001097/sec versus Groq $0.79/million output), or whenever you run a model Groq does not host. Below that throughput, Groq wins; above it, Modal wins.

Cost on a Real Workload

The only fair price comparison is a model both can serve. Take Llama 3.3 70B at 50 million output tokens per day, computed from the list prices above (early 2026).

Cost on a real workload (computed from list prices, June 2026)

  • Groq (per token): 50M output tokens/day x $0.79/million = $39.50/day, about $1,185/mo. Input tokens add $0.59/million on top. No idle cost, nothing to provision.
  • Modal (per GPU-second): one dedicated H100 at $0.001097/sec runs about $94.78/day, roughly $2,843/mo if pinned 24/7. To serve 50M tokens/day on one H100 you need a sustained 579 output tok/s, well within an H100's range, so the GPU sits underused.
  • Break-even: Modal's H100 only undercuts Groq once it actually sustains about 1,390 output tok/s ($0.001097/sec / $0.00000079/token). At 50M tokens/day that is roughly 579 tok/s, below break-even, so Groq is cheaper here. Modal turns cheaper when you push the same H100 past ~120M output tokens/day, or run a model Groq does not host.

The lesson is simple: at low to moderate volume on a stock model, Groq's per-token price beats a dedicated GPU. Modal wins when you saturate the hardware, batch heavily, or serve weights Groq has never seen.

Cold Starts & Scaling

This is where the serverless-GPU model shows its one real tax, and where Groq has no tax at all.

On Modal, a function that has scaled to zero has to cold-start: pull the container, load the model into GPU memory, and warm the engine. For a large LLM, that first request can take from a few seconds to over a minute on an unoptimized container. Modal's answer is GPU memory snapshots (2025, currently alpha), which capture full GPU state via CUDA checkpoint/restore so a snapshotted container restores up to 10x faster. You opt in by setting `enable_memory_snapshot=True` plus the `enable_gpu_snapshot` experimental option, and mark warm-up code with the `@modal.enter(snap=True)` decorator. Modal also uses pre-warmed idle GPU buffers and lazy-loading filesystem caching to shave startup.

Groq has no cold start. The model is always resident on LPU hardware, so the first token of your first request arrives as fast as the millionth. You trade away the ability to scale your own model to zero, but you never pay a warm-up penalty or manage a min-containers setting.

~10x
Modal cold-start speedup with GPU snapshots
0s
Groq cold start (always warm)

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Speed: LPU Streaming vs Tunable GPU Throughput

Groq wins raw per-stream token latency on its catalog; Modal wins on flexibility and aggregate throughput.

Groq's LPU is deterministic silicon designed for single-stream token generation, which is why Llama 3.3 70B runs at about 394 tok/s, Llama 3.1 8B at about 840 tok/s, and GPT-OSS 20B at about 1,000 tok/s. Groq also ships Llama 3.3 70B with speculative decoding for higher throughput on supported endpoints. For a chat or agent loop that streams to a user, that latency is hard to beat.

Modal's speed is whatever your GPU and engine deliver. SGLang has lower overhead for decode-heavy and smaller models; vLLM is stronger for mixed and prefill-heavy workloads. You can saturate an H100 or burst to thousands of GPUs for embeddings and batch, where aggregate throughput matters more than single-stream latency. Modal will not beat a hand-tuned LPU on per-stream latency for a model Groq has optimized, but it will run a model Groq has never seen.

Models & Flexibility

Modal runs anything; Groq runs a curated, hand-optimized list.

CapabilityModalGroq
Stock open LLMsYes (you deploy)Yes (catalog)
Your own fine-tuned weightsYesEnterprise/GroqRack only
LoRA fine-tuning jobsYes, train + deployNo
Image / video / audioYesLimited (TTS/STT models)
EmbeddingsYes (e.g. Qwen2-7B)No first-party embeddings
Choice of enginevLLM / SGLang / TensorRTNone (managed)
Structured outputs (JSON schema)Depends on your engineYes, strict constrained decode
Parallel function callingDepends on your engineYes
Built-in agentic toolingNoYes (Compound: web, code, browser)

Groq's structured outputs use `strict: true` constrained decoding to guarantee schema adherence, and it supports parallel function calling. Its Compound system (now generally available) packages web search, code execution, and browser control into a single agentic endpoint, with `groq/compound` for multi-tool requests and `groq/compound-mini` for single-tool calls.

Modal does not give you any of that out of the box, because Modal is not a model. Whatever structured-output or tool-calling support you want comes from the engine you deploy. The upside is total control: train a LoRA on Modal, save it to a volume, merge, and serve it, all in the same Python codebase.

Compliance & Deployment

Both have enterprise paths; the on-prem story differs.

Modal offers SOC 2 compliance and HIPAA compatibility on its Enterprise plan, with audit logs and private support. Everything runs in Modal's cloud on managed GPUs.

Groq is SOC 2 Type II compliant, with GDPR and HIPAA coverage on eligible services (note: the Compound agentic system is explicitly not a HIPAA Covered Cloud Service today). For regulated, air-gapped, or sovereign deployments, GroqRack puts the LPU on-premises or in colocation, which Modal does not offer. If you must own the silicon, GroqRack is the only on-prem option between these two.

When to Use Modal

  • You run your own model. Fine-tuned weights, a model your team trained, or a niche open model Groq does not host. Modal runs any model on any engine.
  • Non-LLM workloads. Image generation, video, audio, embeddings, batch pipelines. Modal bursts to thousands of GPUs and scales to zero between jobs.
  • High, sustained utilization. If you can keep a GPU busy, per-second GPU pricing beats paying a per-token margin.
  • Full engine control. You want to pick vLLM vs SGLang vs TensorRT-LLM, set quantization, and tune cold starts with memory snapshots.
  • Python-native infra. You want infrastructure defined in your application code, not a separate YAML or Terraform layer.

When to Use Groq

  • Stock open models, maximum speed. Llama 3.3 70B at 394 tok/s, GPT-OSS 20B at 1,000 tok/s, no infrastructure to manage.
  • Low-latency chat and agents. Always-warm LPUs mean zero cold start and fast first-token latency for user-facing streaming.
  • Bursty or low-volume traffic. Per-token billing means you pay nothing when idle and never provision GPUs.
  • Built-in agentic tooling. Compound gives you web search, code execution, and browser control behind one endpoint.
  • Strict structured output. `strict: true` constrained decoding guarantees JSON-schema adherence, plus parallel function calling.

Frequently Asked Questions

What is the difference between Modal and Groq?

Modal is a serverless GPU platform: you write Python, deploy your own model and engine (vLLM, SGLang, TensorRT-LLM), and pay per second of GPU compute (H100 at about $0.001097/sec, roughly $3.95/hr). Groq is a tokens-as-a-service API that runs a fixed catalog of open models on custom LPU silicon and bills per million tokens (Llama 3.3 70B at $0.59/$0.79). Modal gives you control over the model; Groq gives you raw speed on a curated menu with zero infrastructure.

Is Modal or Groq cheaper for LLM inference?

Cheaper depends on throughput. Groq's per-token pricing is cheaper for bursty or low-volume traffic because you never pay for idle GPUs. Modal's per-second GPU billing wins once a single H100 sustains more than about 1,390 output tok/s on Llama 3.3 70B (Modal $0.001097/sec versus Groq $0.79/million output), or whenever you run a model Groq does not host. Below that throughput Groq wins; above it Modal wins.

Does Groq let you run your own fine-tuned model?

Not on the public on-demand API. Groq serves a fixed catalog (Llama, GPT-OSS, Qwen, Kimi K2, DeepSeek and others) optimized for its LPU. Custom or fine-tuned weights and dedicated capacity go through enterprise and GroqRack on-premises deployments. If you need self-serve arbitrary weights, Modal is built for that.

How fast is Groq compared to a GPU on Modal?

Groq's LPU runs Llama 3.3 70B at about 394 tok/s, Llama 3.1 8B at about 840 tok/s, and GPT-OSS 20B at about 1,000 tok/s as of early 2026. On Modal, throughput depends on your GPU and engine; a tuned vLLM or SGLang deployment on an H100 is fast but typically will not match Groq's per-stream latency on the models Groq has hand-optimized.

Can I run the exact same model on both Modal and Groq?

Only where the model is on Groq's catalog. If you want Llama 3.3 70B or GPT-OSS, you can call it on Groq's LPU as a token API or deploy it yourself on a Modal GPU and pay per second. For those overlapping models the comparison is clean: Groq is cheaper and lower-latency at bursty or low volume, Modal wins once you keep a GPU saturated or want a quantization or engine Groq does not offer. For any model Groq does not host (a fine-tuned checkpoint, an image or audio model, a niche open model), there is no overlap and Modal is the only option of the two.

Related Comparisons

Run your own model on GPU-seconds, or buy tokens on a fixed menu

Pick Modal for your own model or arbitrary compute, Groq for a stock model as fast and cheap as possible. If applying model-generated code edits is your bottleneck instead, that is what Morph Fast Apply does at ~10,500 tok/s.