Together AI vs Groq: Pick Groq for a Flat Latency Floor, Together for Model Choice and Fine-Tuning

Groq's LPU gives the lowest predictable per-token latency on a fixed model menu, no cold start, no fine-tuning. Together has 200+ models, fine-tuning, weight export, and ATLAS speculative decoding that ramps as it learns your traffic.

June 3, 2026 · 1 min read

Together AI and Groq both sell tokens from open-weight models behind an OpenAI-compatible API, but they bet on opposite ends of the stack. Together runs a broad GPU cloud: 200+ models, serverless and dedicated endpoints, fine-tuning, and its ATLAS adaptive speculator that learns from your traffic at runtime. Groq built its own chip, the LPU, and serves a tight catalog of open models at deterministic latency with no cold starts.

The choice comes down to what you optimize for. Groq wins raw single-stream speed and predictable tail latency. Together wins breadth, customization, and control over the model and deployment. Pick Groq for the lowest predictable per-token latency on a stock model; pick Together for model choice, fine-tuning, and portability.

TL;DR

  • Pick Together AI if you need model breadth, fine-tuning, or dedicated GPUs. 200+ models, full and LoRA fine-tuning, serverless Multi-LoRA, ATLAS adaptive speculative decoding, and private VPC deployment.
  • Pick Groq if you need the lowest latency on a popular open model. The LPU runs Llama 3.3 70B at ~394 tok/s with deterministic, no-cold-start latency, and a 50% batch discount, on a tighter model catalog.

Who Wins Per Workload

The verdict is not one provider. It splits cleanly by what your workload needs.

Workload / decisionTogether AIGroq
Lowest latency floorGroq, flat LPU latencyGroq, flat LPU latency
Fastest first call (no cold start)Groq, deterministic, no warmupGroq, deterministic, no warmup
Cheapest on a shared modelGroq, $0.59 / $0.79 on Llama 70BGroq, $0.59 / $0.79 on Llama 70B
Largest model catalogTogether, 200+ open modelsTogether, 200+ open models
Fine-tune and export weightsTogether, full + LoRA, weights you ownTogether, full + LoRA, weights you own
Multimodal (image / audio / video)Together, FLUX + transcription + videoTogether, FLUX + transcription + video
Self-serve dedicated GPUsTogether, autoscaling from $4.99/hrTogether, autoscaling from $4.99/hr
Sustained speed once warmedTogether, ATLAS ramps to ~500 tok/sTogether, ATLAS ramps to ~500 tok/s
Avoid hardware lock-inTogether, portable weightsTogether, portable weights

Quick Comparison

Groq optimizes for speed on a narrow catalog. Together optimizes for breadth and control. Morph optimizes for the coding-agent inner loop.

SpecTogether AIGroqMorph
Primary FocusBroad GPU cloudCustom LPU speedCoding-agent loop
HardwareNVIDIA H100/H200/B200Custom LPUTuned GPU fleet
Model Catalog200+ open modelsTight curated setCode-specific models
Llama 3.3 70B (in/out per 1M)~$0.88-$1.04$0.59 / $0.79N/A
Peak Output Speed~500 tok/s (ATLAS)~500-1000 tok/s~10,500 tok/s (apply)
Code-Specific ApplyNoNoYes (/v1/code/apply)
Code SearchNoNoWarpGrep semantic
Fine-TuningFull + LoRA trainingLoRA inference onlyN/A (apply layer)
Dedicated EndpointsYes (autoscaling)Enterprise onlyN/A
Cold StartsPossible on serverlessNone (deterministic)Warm fleet
Best ForCustom models at scaleLow-latency open modelsCoding agents

Pricing: Groq Undercuts on the Same Model

For an identical model, Groq is the cheaper per-token option. Together's pricing buys you more models and deployment control, not the lowest rate.

ModelTogether AI (in / out)Groq (in / out)
Llama 3.3 70B~$0.88-$1.04 flat$0.59 / $0.79
GPT-OSS 120BVaries$0.15 / $0.60
GPT-OSS 20BVaries$0.075 / $0.30
GLM-5.1$1.40 / $4.40Not offered
DeepSeek V4 Pro$2.10 / $4.40Not offered
Batch API discount~50%50%
Prompt cachingYes (select models)50% on cache hits

Llama 3.3 70B costs $0.59 input and $0.79 output per million tokens on Groq, versus roughly $0.88 to $1.04 per million on Together AI. Groq also discounts its Batch API and prompt caching by 50%, and the two stack, so async workloads can run near 25% of on-demand pricing.

Together's catalog reaches models Groq does not serve at all, like GLM-5.1 at $1.40 / $4.40 and DeepSeek V4 Pro at $2.10 / $4.40 per million tokens. If your model is not on Groq's list, the price comparison is moot.

Dedicated GPU Pricing

Together rents dedicated GPUs by the hour: a single H100 80GB runs $6.49/hr, an HGX B200 180GB runs $11.95/hr, and reserved H100 clusters drop to $4.99/hr. Groq does not publish on-demand dedicated GPU pricing; dedicated capacity is an enterprise-only conversation.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Serving Llama 3.3 70B at 50M output tokens/day on Groq serverless: 50 × $0.79 = $39.50/day = ~$1,185/mo. The same volume on a single Together dedicated H100 80GB at $6.49/hr runs 24 × 30 × $6.49 = ~$4,673/mo flat. So at this volume serverless is roughly 4× cheaper.

Break-even runs the other way only at scale: a dedicated H100 at ~$4,673/mo equals Groq's $0.79/M output rate at about 5,915M output tokens/mo, or ~197M output tokens/day. Below that, serverless wins; above it, a dedicated GPU you saturate wins, and Together's reserved H100 at $4.99/hr (~$3,593/mo) drops the break-even to ~152M tokens/day.

Cheaper depends on volume: below ~197M output tokens/day on Llama 70B, Groq serverless undercuts a Together dedicated H100; above it, the saturated dedicated GPU wins. Groq does not sell self-serve dedicated capacity, so the only way to capture the high-volume side of that curve today is Together.

Speed: Groq Owns Single-Stream Latency

On raw single-request latency, Groq wins. Its LPU is built for it.

ModelTogether AIGroq
Llama 3.3 70BGPU-dependent~394 tok/s
GPT-OSS 120BGPU-dependent~500 tok/s
GPT-OSS 20BGPU-dependent~1,000 tok/s
DeepSeek-V3.1 (ATLAS)up to ~500 tok/sNot offered
Kimi-K2 (ATLAS)up to ~460 tok/sAvailable (preview)
Cold-start latencyPossibleNone (deterministic)

Groq runs Llama 3.3 70B at about 394 tok/s and the smaller GPT-OSS-20B at about 1,000 tok/s. Because the LPU compiler statically schedules every operation, there are no cache misses, no branch misprediction, and no cold-start variance. Tail latency is predictable, which matters for interactive UX and real-time voice.

Together counters with ATLAS, its adaptive speculative decoder. ATLAS learns from live traffic and reaches up to 500 tok/s on DeepSeek-V3.1 and up to 460 tok/s on Kimi-K2 once fully adapted, which Together measured at 2.65x faster than standard decoding. The catch is that the speedup ramps as the speculator learns your distribution, so a cold or highly varied workload sees less of it than the headline number.

Architecture: Custom Silicon vs GPU Cloud

The hardware split explains every other difference between these two.

Groq designed the LPU, a Language Processing Unit with a deterministic, statically scheduled architecture. The compiler knows where every byte sits at every clock cycle, so there is no speculative cache and no runtime scheduling overhead. That determinism is why a trillion-parameter model like Kimi K2 can stream tokens in real time on GroqCloud, and why tail latency stays flat under load.

Together runs standard NVIDIA hardware (H100, H200, B200) and wins on software. Its stack layers the Together Kernel Collection, custom FP8 kernels that it measures at 75%+ faster than base PyTorch, and the ATLAS runtime-learning speculator on top of stock GPUs. The advantage: any model that runs on a GPU can run on Together, and you can pick the exact GPU and replica count.

394 tok/s
Groq Llama 3.3 70B
2.65x
Together ATLAS vs standard decode
75%+
Together FP8 kernels vs PyTorch

Model Catalog: Breadth vs Curation

Together serves nearly everything; Groq serves a curated set fast.

Together AI lists 200+ open-source models spanning text, vision, image, audio, and video: Llama 4 Maverick and Scout, DeepSeek V4, Qwen3, Mistral, FLUX image generation, plus embeddings and audio transcription. If you need a niche or brand-new open model, Together usually has it within days.

Groq runs only open models, and a tighter list: Llama, GPT-OSS, Qwen3, Kimi K2, and a handful of others that its team has ported and tuned for the LPU. The tradeoff is intentional. Every model on Groq is fast because the team optimized it; you trade catalog size for guaranteed performance.

The model-availability tradeoff

If your stack depends on a specific model, check Groq's live list before committing. Together is the safer default for "serve whatever model my team picks next quarter"; Groq is the better choice for "make this one popular model as fast as physically possible."

Fine-Tuning: Together Trains, Groq Only Serves

This is the clearest functional gap. Together trains and serves custom models; Groq only serves adapters you trained elsewhere.

CapabilityTogether AIGroq
Full fine-tuningYesNo
LoRA trainingYesNo
LoRA inferenceYes (Multi-LoRA)Yes (enterprise)
DPO / preference tuningYesNo
Serve adapter on base priceYes (Multi-LoRA)No
Structured outputs / JSONYesYes (strict schema)
Function callingYesYes

Together offers full and LoRA fine-tuning, DPO, and serverless Multi-LoRA, which serves hundreds of adapters on the same infrastructure at base-model per-token prices. Groq supports LoRA inference only, gated to enterprise accounts, and you must train the adapter on another platform first.

Both support function calling and structured outputs. Groq's structured outputs use constrained decoding with strict schema mode, which guarantees 100% JSON-schema adherence and never emits invalid JSON.

Deployment & Enterprise

Both clear the standard compliance bar; Together gives you more deployment knobs.

FeatureTogether AIGroq
SOC 2 Type IIYesYes
HIPAAYes (BAA)Yes
ServerlessYesYes
Dedicated endpointsYes (autoscaling)Enterprise only
Private VPCYesEnterprise
OpenAI-compatible APIYesYes
Max context (typical)Model-dependent~128K-131K

Together's dedicated endpoints autoscale on two axes: GPUs per replica (vertical) and replica count (horizontal), with min/max bounds you set. That maps cleanly to spiky production traffic. It is SOC 2 Type II certified with HIPAA-aligned BAAs and private VPC deployment.

Groq is SOC 2, GDPR, and HIPAA compliant, exposes an OpenAI-compatible API, and offers regional endpoint selection. Dedicated capacity and VPC-style isolation are enterprise conversations rather than self-serve. Most Groq models cap context around 128K to 131K tokens.

When to Use Together AI

  • You need a specific or niche model. 200+ open models across text, vision, image, audio, and video. If your team will swap models often, Together usually already has the next one.
  • You fine-tune. Full and LoRA training, DPO, and serverless Multi-LoRA that serves hundreds of adapters at base-model prices. Groq cannot train at all.
  • You want dedicated GPUs. Self-serve dedicated endpoints with two-axis autoscaling, plus reserved clusters from $4.99/hr for predictable spend at scale.
  • You run image, audio, or video. FLUX, transcription, and video generation live alongside text under one API and one bill.
  • You want VPC isolation. Private VPC deployment with SOC 2 Type II and HIPAA BAAs for regulated workloads.

When to Use Groq

  • Latency is the product. Deterministic, no-cold-start latency on the LPU. ~394 tok/s on Llama 3.3 70B and ~1,000 tok/s on GPT-OSS-20B for interactive UX, voice, and real-time agents.
  • You serve a popular open model. If your model is on Groq's curated list, it is already tuned for the hardware and runs faster than the same model on a general GPU.
  • You want the lowest per-token price on that model. Llama 3.3 70B at $0.59 / $0.79 per million, with a 50% batch discount that stacks with prompt caching.
  • You need predictable tail latency. Static scheduling means no cache misses or branch mispredicts, so p99 stays flat under load.
  • You need guaranteed JSON. Strict structured-output mode uses constrained decoding for 100% schema adherence.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Groq cheaper than Together AI?

For the same model, usually yes. Llama 3.3 70B is $0.59 input / $0.79 output per million on Groq, versus roughly $0.88 to $1.04 per million on Together AI as of early 2026. Groq also discounts its Batch API and prompt caching by 50%. Together's value is breadth and customization, not the lowest per-token price.

Which is faster, Together AI or Groq?

For single-stream latency, Groq, on most models. The LPU runs Llama 3.3 70B at ~394 tok/s with deterministic, no-cold-start latency. Together's ATLAS speculator reaches up to 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 once it has learned your traffic, so it closes the gap on specific GPU workloads.

Does Groq support fine-tuning?

No, only LoRA inference, gated to enterprise accounts. You train the adapter elsewhere and serve it on Groq. Together AI offers full and LoRA training plus serverless Multi-LoRA that serves hundreds of adapters at base-model prices.

How many models do they support?

Together AI lists 200+ open models across text, vision, image, audio, and video. Groq runs a tighter curated set (Llama, GPT-OSS, Qwen3, Kimi K2, and others) optimized for the LPU. Together for breadth, Groq for speed on a popular model.

Can I move a model off Groq or Together if I need to?

Together is the more portable choice. It serves open-weight models you can also download and run yourself, and a model you fine-tune on Together exports as weights you own and can host elsewhere. Groq runs the same open models, but there is no fine-tuning to export, and the LPU speed advantage is tied to Groq's own hardware, so the fast version of your model lives only on GroqCloud. If avoiding lock-in matters, Together gives you the model and the weights; Groq gives you the speed but keeps it on its silicon.

Related Comparisons

Groq for a Flat Latency Floor, Together for Model Choice and Fine-Tuning

Pick by workload, not by brand. If applying model-generated code edits is the bottleneck, Morph Fast Apply runs that loop at ~10,500 tok/s.