Together AI vs Groq (2026): Pricing, Speed, and Rate Limits Compared

You are choosing an inference provider for an open-weight model and you need two numbers: the per-token price and the speed. Together AI and Groq both expose an OpenAI-compatible API, but they answer those two numbers differently.

Groq runs its own chip, the LPU, and serves a tight catalog at a fixed per-token price with deterministic latency and no cold starts. Llama 3.3 70B is $0.59 input / $0.79 output per 1M tokens at 394 tok/s. Together runs an NVIDIA GPU cloud with a far wider model list, fine-tuning, dedicated endpoints, and a batch API, but it does not undercut Groq on a model both serve.

$0.59/$0.79

Groq Llama 70B (in/out per 1M)

394 tok/s

Groq Llama 70B speed

$6.49/hr

Together dedicated H100

$0.15/$0.60

GPT-OSS 120B (both)

TL;DR

Pick Groq if your model is on its catalog and you want the lowest fixed latency. The LPU runs Llama 3.3 70B at 394 tok/s and GPT-OSS 20B at 1,000 tok/s with no cold starts, $0.59/$0.79 on Llama 70B, and a 50% batch discount.
Pick Together AI if you need a model Groq does not serve (GLM-5.1, DeepSeek V4, Qwen3.7-Max), fine-tuning, or self-serve dedicated GPUs from $6.49/hr.
GPT-OSS 120B costs the same on both: $0.15/$0.60 per 1M. On that model the tiebreaker is Groq's published 500 tok/s and no cold start.

Serverless Pricing: Same Model, Same or Lower on Groq

For a model both providers serve, Groq matches or beats Together's per-token rate. The gap that matters is the models Groq does not serve at all.

Serverless per-token pricing (per 1M tokens, June 2026)

Model	Together AI (in / out)	Groq (in / out)
Llama 3.3 70B	$1.04 / $1.04	$0.59 / $0.79
GPT-OSS 120B	$0.15 / $0.60	$0.15 / $0.60
GPT-OSS 20B	$0.05 / $0.20	$0.075 / $0.30
GLM-5.1	$1.40 / $4.40	Not offered
DeepSeek V4 Pro	$2.10 / $4.40	Not offered
Kimi K2.6	$1.20 / $4.50	K2: $1.00 / $3.00
MiniMax M2.7	$0.30 / $1.20	Not offered
Qwen3.7-Max	$1.25 / $3.75	Qwen3 32B: $0.29 / $0.59

On Llama 3.3 70B, Groq is the cheaper option at $0.59 input / $0.79 output per 1M tokens, versus $1.04 flat on Together. On GPT-OSS 120B the rates are identical at $0.15 / $0.60. Together is cheaper on GPT-OSS 20B ($0.05 / $0.20 versus $0.075 / $0.30). Where Together wins outright is catalog: GLM-5.1, DeepSeek V4 Pro, MiniMax M2.7, and Qwen3.7-Max have no Groq equivalent, so if your stack depends on one of them the price comparison is moot.

Batch and Caching Discounts

Groq runs a Batch API at 50% lower cost with a 24-hour to 7-day processing window, and prompt caching halves input cost on supported models (Kimi K2 input drops from $1.00 to $0.50 per 1M on a cache hit). Together's Batch API gives up to 50% off serverless on selected models with a fixed 24-hour completion window, up to 50,000 requests per batch and 100 MB per input file.

Cost on a Real Workload

Llama 3.3 70B at 50M output tokens/day (computed from list prices, June 2026)

On Groq serverless, 50M output tokens/day at $0.79 per 1M is 50 × $0.79 = $39.50/day, about $1,185/mo, before any input cost. The same model on a single Together dedicated H100 80GB at $6.49/hr is 24 × 30 × $6.49 = $4,673/mo flat, no matter how much you serve.

Break-even runs the other way only at scale. A dedicated H100 at $4,673/mo equals Groq's $0.79/M output rate at about 5,915M output tokens/mo, roughly 197M output tokens/day. Below that, serverless wins; above it, a dedicated GPU you keep saturated wins. Groq does not sell self-serve dedicated capacity, so the only way to buy the high-volume side of that curve is Together (or a reserved cluster from $3.99/hr).

For most teams running well below 197M output tokens/day, Groq serverless is both cheaper and faster on Llama 70B. The crossover only favors a dedicated GPU when you can keep it busy around the clock, which is exactly the workload Together sells dedicated endpoints for and Groq does not.

Speed: Groq Publishes Fixed tok/s, Together Depends on Your GPU

Groq publishes a per-model output speed because the LPU runs the same way every time. Together's speed is a function of the GPU and replica count you provision, so there is no single published number.

Groq published output speed (tokens per second)

Model	Context	Output speed
GPT-OSS 20B	—	1,000 tok/s
Llama 3.1 8B Instant	128k	840 tok/s
Qwen3 32B	131k	662 tok/s
Llama 4 Scout (17Bx16E)	128k	594 tok/s
GPT-OSS 120B	—	500 tok/s
Llama 3.3 70B Versatile	128k	394 tok/s

Because the LPU compiler statically schedules every operation, there are no cache misses, no branch mispredicts, and no cold-start variance. Tail latency is flat under load, which is what interactive UX, voice, and real-time agents need. Together runs NVIDIA H100, H200, and B200 GPUs, so any model that fits on a GPU can run, but throughput and tail latency move with the hardware and replica count you choose.

394 tok/s

Groq Llama 3.3 70B

1,000 tok/s

Groq GPT-OSS 20B

No cold start

Groq LPU, deterministic

Rate Limits: Fixed and Published vs Dynamic

Groq publishes exact free-tier limits per model. Together's serverless limits are dynamic and scale with your sustained traffic, with no fixed per-model numbers published.

Groq free-plan rate limits

Model	RPM	RPD	TPM	TPD
llama-3.1-8b-instant	30	14,400	6,000	500,000
llama-3.3-70b-versatile	30	1,000	12,000	100,000

Groq's Developer plan unlocks higher limits plus Batch and Flex processing. Together's limits rise as you send steady traffic and throttle on sudden spikes (HTTP 429); for a known fixed ceiling Together recommends a dedicated endpoint, which gives reserved capacity with an SLA. If you need a guaranteed throughput number written down before you build, Groq publishes it and Together sells it as dedicated capacity.

Dedicated GPUs: Together Self-Serve, Groq Enterprise-Only

Together rents GPUs by the hour without a sales call. Groq does not publish self-serve dedicated pricing.

Together dedicated and cluster GPU pricing (per hour, June 2026)

GPU	Dedicated endpoint	On-demand cluster
H100 80GB	$6.49/hr	$5.49/hr
H200 141GB	Contact sales	$6.79/hr
B200 180GB	$11.95/hr	$9.95/hr
Reserved (7-180+ days)	—	$3.99-$9.65/hr

Reserved clusters drop the rate as low as $3.99/hr depending on hardware and commitment length. Groq's dedicated capacity and VPC-style isolation are enterprise conversations, not self-serve, so if you want to rent a GPU this afternoon, Together is the only one of the two that lets you.

Fine-Tuning: Together Trains, Groq Serves a Fixed Catalog

Together prices fine-tuning per 1M training tokens and lets you export the result. Groq serves a fixed catalog and does not offer self-serve training.

Together fine-tuning pricing (per 1M training tokens)

Model size	LoRA SFT	LoRA DPO
Up to 16B	$0.48	$0.54
17B to 69B	$1.50	$1.65
70B to 100B	$2.90	$3.20

If you need to train a custom model and own the weights, Together is the option of the two. Groq's value is the opposite trade: a smaller, fixed list where every model is already tuned for the LPU.

Compliance

Both clear the standard enterprise bar.

Compliance and deployment

Feature	Together AI	Groq
SOC 2 Type II	Yes	Yes
HIPAA	Yes	BAA, with exclusions
Self-serve dedicated GPUs	Yes ($6.49/hr H100)	Enterprise only
Fine-tuning	LoRA SFT + DPO	Not self-serve
OpenAI-compatible API	Yes	Yes
Published per-model speed	No	Yes

Together is SOC 2 Type 2 certified. Groq is SOC 2 Type II compliant; certain GroqCloud services can process PHI under Groq's BAA, but preview and beta features and Groq's compound AI systems are excluded from that BAA.

Running DeepSeek for Codegen: A Note on Quality

Neither Together nor Groq is built for the coding-agent inner loop, and there is a quality trap worth naming before you pick a DeepSeek host. Most serverless providers quantize activations to fp8 to cut cost, which degrades output relative to the reference weights. Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference model. That makes Morph the option when DeepSeek output fidelity matters.

For code specifically, Morph runs codegen-tuned speculative decoding plus custom low-level inference kernels built for code generation, not a general-purpose menu. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. If the bottleneck is applying model-generated edits rather than generating them, Morph Fast Apply runs that step at ~10,500 tok/s, and WarpGrep handles code search free up to 100k requests, then $1 per 1M requests. See full pricing.

When to Use Together AI

You need a model Groq does not serve. GLM-5.1 ($1.40 / $4.40), DeepSeek V4 Pro ($2.10 / $4.40), MiniMax M2.7 ($0.30 / $1.20), Qwen3.7-Max ($1.25 / $3.75), and many more have no Groq equivalent.
You fine-tune. LoRA SFT from $0.48 per 1M training tokens (up to 16B), $2.90 for 70B to 100B, with weights you own.
You want self-serve dedicated GPUs. H100 80GB at $6.49/hr dedicated or $5.49/hr cluster, reserved from $3.99/hr.
You run very high volume on one model. Past ~197M output tokens/day on Llama 70B, a saturated dedicated H100 beats serverless per-token pricing.

When to Use Groq

Latency is the product. Deterministic, no-cold-start LPU latency. 394 tok/s on Llama 3.3 70B, 1,000 tok/s on GPT-OSS 20B, 840 tok/s on Llama 3.1 8B Instant.
Your model is on the catalog. If it is on Groq's list it is already tuned for the hardware; on Llama 70B you also get the lower price at $0.59 / $0.79.
You want published, fixed limits. Free-tier limits are written down per model (Llama 70B: 30 RPM, 1,000 RPD, 12,000 TPM, 100,000 TPD), not dynamic.
You run async at scale. The Batch API cuts cost 50% with a 24-hour to 7-day window, and prompt caching halves input cost on supported models.

Frequently Asked Questions

Is Groq cheaper than Together AI?

On models both serve, Groq matches or beats Together. Llama 3.3 70B is $0.59 input / $0.79 output per 1M on Groq versus $1.04 flat on Together. GPT-OSS 120B is $0.15 / $0.60 on both. GPT-OSS 20B is $0.05 / $0.20 on Together versus $0.075 / $0.30 on Groq. Together's value is the models Groq does not serve at all, plus fine-tuning and dedicated GPUs.

How fast is Groq on Llama 3.3 70B?

Groq publishes 394 tok/s for Llama 3.3 70B Versatile, 500 tok/s for GPT-OSS 120B, 1,000 tok/s for GPT-OSS 20B, 840 tok/s for Llama 3.1 8B Instant, and 662 tok/s for Qwen3 32B. Together runs NVIDIA GPUs and does not publish fixed per-model speeds; throughput depends on the GPU and replica count you provision.

What are Groq's free-tier rate limits?

On Groq's free plan, llama-3.1-8b-instant allows 30 RPM, 14,400 RPD, 6,000 TPM, and 500,000 TPD. llama-3.3-70b-versatile allows 30 RPM, 1,000 RPD, 12,000 TPM, and 100,000 TPD. The Developer plan unlocks higher limits plus Batch and Flex processing. Together's serverless limits are dynamic and scale with sustained traffic; for a fixed ceiling Together recommends a dedicated endpoint.

Does Groq support fine-tuning?

Groq serves a fixed catalog and does not offer self-serve fine-tuning. Together AI offers LoRA SFT at $0.48 per 1M training tokens (up to 16B), $1.50 (17B to 69B), and $2.90 (70B to 100B), with LoRA DPO at $0.54 / $1.65 / $3.20 for the same bands.

Does Groq sell dedicated GPUs like Together?

No self-serve dedicated GPU pricing is published by Groq; dedicated capacity is an enterprise conversation. Together rents dedicated endpoints by the hour: H100 80GB at $6.49/hr and HGX B200 180GB at $11.95/hr, plus on-demand clusters at HGX H100 $5.49/hr, HGX H200 $6.79/hr, and HGX B200 $9.95/hr, with reserved rates from $3.99 to $9.65/hr.

Which is better for running DeepSeek?

Groq does not serve DeepSeek and Together quantizes activations to fp8 on most serverless models. If DeepSeek output fidelity matters, Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 / $0.278 per 1M in/out.

Related Comparisons

Groq for Fixed Latency, Together for Model Choice and Fine-Tuning

Pick by workload, not by brand. For running DeepSeek at full 16-bit fidelity or applying model-generated code edits at ~10,500 tok/s, Morph runs the coding loop.

See Morph Models

View pricing

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs Groq (2026): Groq Is Cheaper on Llama 70B ($0.59/$0.79), Together Adds Fine-Tuning and Dedicated GPUs