Fireworks vs Together AI: 70B Pricing Is a Wash, So Pick on Lock-In

On Llama 70B the prices are within pennies, so the decision is engine philosophy and lock-in. Fireworks tunes speculative decoding ahead of time and keeps you on its stack; Together's ATLAS learns from live traffic and lets you export weights and rent raw GPUs.

June 3, 2026 · 1 min read

On Llama 70B, Fireworks AI and Together AI price within pennies of each other, so cost is not the deciding factor. The real choice is engine philosophy and lock-in. Fireworks tunes speculative decoding ahead of time with FireOptimizer and keeps you on its FireAttention stack. Together's ATLAS speculator learns from your live traffic, and Together lets you download fine-tuned weights and rent raw GPUs. Pick on tuned-and-managed versus adaptive-and-portable, not on price.

Both run an OpenAI-compatible API, both serve Llama, Qwen, DeepSeek, and GLM, and both let you rent dedicated GPUs by the hour when serverless stops being economical. The difference is in the engine. Fireworks built FireAttention, a custom CUDA inference stack with FP8 quantization and FireOptimizer, which tunes speculative decoding per workload from over 100,000 serving configurations. Together built ATLAS, an adaptive speculator that retrains on your live traffic and claims up to 4x vLLM throughput as it warms up.

TL;DR

  • Pick Fireworks if you want tuned-and-managed: FireAttention FP8 kernels and FireOptimizer's ahead-of-time speculative decoding give predictable low latency from day one, plus predicted outputs for edit and rewrite workloads. The tuning lives on Fireworks' stack and stays there.
  • Pick Together if you want adaptive-and-portable: ATLAS speculative decoding that learns from your live traffic, the broadest catalog (text, image, audio, embeddings, rerank, code interpreter), downloadable fine-tuned weights, and rentable raw GPU clusters if you want to run your own stack.

Who Wins Per Workload

The averages hide the real decision. Pick by what you are actually running.

Workload / decisionFireworks AITogether AI
Lowest latency floor on day oneFireworks: FireOptimizer tuned upfrontTogether: warms up over time
Best on long, stable trafficFireworks: predictable, fixed configTogether: ATLAS keeps speeding up
Bursty / spiky trafficFireworks: per-second scale-to-zeroTogether: less ATLAS benefit
Fine-tune and export weightsFireworks: plan-dependentTogether: download weights
Run your own engine on raw GPUsFireworks: no raw clustersTogether: HGX clusters from $3.49/hr
Multimodal (image / audio / rerank)Fireworks: limitedTogether: full multimodal bill
Edit / rewrite workloadsFireworks: predicted outputsTogether: no equivalent primitive
Cheapest small models (<16B)Fireworks: ~$0.10+ per 1MTogether: from ~$0.03 per 1M
Strictest compliance listFireworks: SOC 2 II, HIPAA, GDPR, ISOTogether: SOC 2 II, HIPAA (BAA)

Quick Comparison

Both providers are mature, OpenAI-compatible, and price-competitive. The contrast is engine philosophy and catalog breadth. Morph appears in the third column only for the narrow coding-agent apply slice, where neither general host competes.

SpecFireworks AITogether AIMorph
FocusHigh-throughput text + vision servingBroadest open-model catalog + GPU cloudCoding-agent inner loop
Inference engineFireAttention + FireOptimizerTogether Inference Engine + ATLASCode-tuned CUDA + ngram spec decode
Llama 70B serverless (per 1M)~$0.90~$0.88-$1.04N/A (not a general host)
Dedicated GPU (H100/hr)$7.00$6.49N/A
Code-specific apply endpointNoNoYes (/v1/code/apply)
Semantic code searchNoNoWarpGrep ($0/100k)
Apply throughputGeneral token-by-tokenGeneral token-by-token~10,500 tok/s
First-pass apply accuracyN/AN/A98%
Fine-tuningLoRA + full, serverless adaptersLoRA + full, downloadable weightsN/A
Image / audio modelsLimitedYes (image, audio, embeddings)Embeddings + rerank
ComplianceSOC 2 II, HIPAA, GDPR, ISOSOC 2 II, HIPAA (BAA)Standard
Best forFast text serving at scaleMultimodal + flexible deploymentCoding agents (apply/search/compact)

Numbers are list prices as of early 2026 and change often. Verify on each provider's pricing page before committing volume.

Inference Engine: FireAttention vs ATLAS

Both providers beat stock vLLM, but they get there in opposite ways. Fireworks tunes ahead of time. Together adapts continuously.

Fireworks: FireAttention + FireOptimizer

FireAttention is a custom CUDA inference engine with handwritten kernels and FP8 quantization on Hopper-class GPUs (H100, H200). Fireworks reports up to 4x faster serving than self-hosted vLLM, and up to 12x faster on long-context tasks with FireAttention V2. FireOptimizer sits on top and tunes serving configuration, including adaptive speculative decoding, by searching over 100,000 options to match your latency and quality targets.

Fireworks also ships predicted outputs, which speed up edit and rewrite workloads where most of the new text matches the original. That is a useful primitive if a large fraction of your output is unchanged from the input.

Together: ATLAS adaptive speculator

ATLAS (AdapTive-LeArning Speculator System) is Together's answer to static speculators. A small draft model proposes tokens that the main model verifies in parallel, and ATLAS keeps retraining that draft model on your live traffic. Throughput climbs as the speculator learns your prompt patterns, with Together claiming up to 4x vLLM once warmed. The tradeoff: the gain depends on traffic regularity, so bursty or highly varied workloads see less benefit.

The practical difference: Fireworks gives you a tuned config from day one, predictable across traffic shapes, and the tuning stays on Fireworks. Together gets faster the longer it runs on a stable workload, and you can take a fine-tuned model with you. Neither changes the fundamental model, so accuracy on your task is set by the model you pick, not the engine.

Serverless Pricing: Roughly a Wash on 70B

For mainstream open models the two providers land within pennies of each other. The cost differences show up at the tails: very small models and very large MoEs.

Model tierFireworks AITogether AI
Small (under 16B)From ~$0.10-$0.20From ~$0.03-$0.20
Llama 3.x 70B (above 16B)~$0.90~$0.88-$1.04 (in/out)
Large MoE (e.g. DeepSeek/GLM)Tier-based$1.40-$4.40 (in/out)
Embeddings (per 1M)$0.008-$0.10~$0.02
Batch inference50% of serverless~50% of serverless
Cached input50% of input priceModel-dependent

Both discount batch inference 50%. Fireworks additionally prices cached input at 50% for all text and vision models, which helps prompt-heavy workloads with stable system prompts. Together's budget tier reaches lower on the smallest models, with options like gpt-oss-20B near $0.05 input and $0.20 output per 1M.

Where the bill actually moves

On a single 70B model, picking the cheaper provider saves you cents. The real cost lever is the engine: faster speculative decoding means fewer GPU-seconds per request, and dedicated GPUs win once your sustained throughput crosses the break-even point versus per-token serverless. Together estimates dedicated H100s beat serverless above roughly 130,000 tokens per minute of sustained load.

Cost on a Real Workload

Cost on a real workload (computed from list prices, early 2026)

Serving Llama 70B at 50M output tokens per day, using only the list prices above:

  • Fireworks serverless: 50 x $0.90 = $45/day = ~$1,350/mo.
  • Together serverless: at ~$0.90 per 1M output, the same ~$45/day = ~$1,350/mo. Within rounding of Fireworks.
  • One dedicated H100, full month: Fireworks at $7.00/hr x 24 x 30 = ~$5,040/mo; Together at $6.49/hr = ~$4,673/mo; a Together raw HGX H100 cluster reservation at $3.49/hr = ~$2,513/mo.

At 50M tokens/day, serverless wins on both providers. A single managed H100 only beats Fireworks serverless above ~$5,040/mo of serverless spend, which at $0.90 per 1M is roughly 187M output tokens/day of sustained load. Together's raw cluster H100 at $3.49/hr breaks even far earlier, around 93M output tokens/day, the payoff for managing your own stack. Below those thresholds, stay serverless; the 70B serverless prices are a wash, so decide on lock-in, not cents.

Dedicated GPUs: Together Edges Fireworks on Price

Both let you reserve GPUs by the hour or second when serverless stops being economical. Together lists a slightly lower H100 rate. Fireworks scales to zero on idle.

GPUFireworks AITogether AI
H100 80GB$7.00$6.49
H200$7.00Custom quote
B200$10.00$11.95
B300$12.00N/A
Billing granularityPer second, scale to zeroPer minute
Raw GPU clustersNoYes (HGX H100/H200/B200)

Fireworks bills dedicated deployments per second and auto-scales to zero, which fits spiky traffic with long idle gaps. Together additionally rents raw HGX GPU clusters (H100 at $3.49/hr, H200 at $4.19/hr, B200 at $7.49/hr for cluster reservations), so it doubles as a GPU cloud if you want to run your own stack.

Fine-Tuning: Similar Price, Different Ownership

Both support LoRA and full fine-tuning priced per training token. The split is what you get at the end.

TierFireworks (LoRA / Full)Together (LoRA / Full)
Up to 16B$0.50 / $1.00$0.48 / $1.20
16B-80B / 17B-69B$3.00 / $6.00$1.50 / $3.75
70B-100B / 80B-300B$6.00 / $12.00$2.90 / $7.25
Serve fine-tuned adapterServerless LoRADedicated or serverless
Download weightsPlan-dependentYes

Together is cheaper on mid-to-large fine-tunes and lets you download your fine-tuned weights, which matters if you want to leave the platform or self-host later. Fireworks serves LoRA adapters serverless, so a fine-tuned model can share the base-model pool instead of holding a dedicated GPU. For many small adapters served on demand, that is the more efficient path.

Model Catalog: Together Is Broader, Fireworks Is Deeper on Text

Together is the better one-stop shop. Fireworks concentrates its engineering on text and vision throughput.

CapabilityFireworks AITogether AI
Text + vision LLMsYes (FireAttention)Yes
Image generationLimitedYes ($0.0006-$0.134/image)
Audio transcriptionLimitedYes (~$0.0015/min)
EmbeddingsYesYes (~$0.02/1M)
Rerank endpointLimitedYes (~$0.20/1M)
Code interpreterNoYes (~$0.03/session)
Structured output / JSONJSON schema + grammarsYes
Function callingYesYes

Fireworks does ship strong structured-output support, enforcing JSON schemas or custom grammars at decode time, which is valuable for agent tool calls and strict API responses. But if your product spans modalities, Together's single bill for image, audio, embeddings, and rerank is the simpler integration.

Compliance & Enterprise

Both clear the bar for regulated buyers. Fireworks publishes the longer certification list.

  • Fireworks: SOC 2 Type II, HIPAA, GDPR, and ISO certifications. Founded by ex-PyTorch engineers from Meta; reports processing roughly 15 trillion tokens per day at scale.
  • Together: SOC 2 Type 2, plus HIPAA with encryption in transit and at rest, audit logging, and Business Associate Agreements for healthcare customers.

For HIPAA workloads, confirm a signed BAA on your specific plan with either provider before sending protected health information. Both offer OpenAI-compatible endpoints, so migrating between them, or away from either, is mostly a base-URL and API-key change.

When to Use Fireworks AI

  • Fastest single-model text serving. FireAttention FP8 kernels and FireOptimizer's tuned speculative decoding deliver predictable low latency from day one, up to 4x vLLM and 12x on long context.
  • Edit and rewrite workloads. Predicted outputs accelerate generations where most of the output matches the input, common in document rewriting and code rewriting.
  • Strict structured output. JSON-schema and custom-grammar enforcement at decode time keeps agent tool calls and API responses well-formed.
  • Regulated industries. SOC 2 Type II, HIPAA, GDPR, and ISO give procurement a short path to yes.
  • Spiky traffic. Per-second dedicated billing with scale-to-zero avoids paying for idle GPUs between bursts.

When to Use Together AI

  • Multimodal products. Text, image, audio, embeddings, rerank, and a code interpreter on one bill, instead of stitching providers together.
  • Adaptive throughput on stable traffic. ATLAS keeps learning your prompt patterns and speeds up over time, up to 4x vLLM once warmed.
  • Fine-tune and own the weights. Cheaper mid-to-large fine-tuning, with downloadable weights if you want to self-host later.
  • You also need raw GPUs. Reserve HGX H100/H200/B200 clusters and run your own stack alongside the serverless API.
  • Cheapest small models. The budget tier reaches as low as $0.03-$0.05 per 1M input on the smallest models.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Fireworks AI or Together AI cheaper?

On Llama 70B they are effectively tied: Fireworks lists about $0.90 per 1M tokens for models above 16B, Together roughly $0.88 to $1.04 per 1M tokens for Llama 3.3 70B (as of early 2026). Together is cheaper at the tails, with a budget tier reaching about $0.03 per 1M input on the smallest models. Both discount batch inference 50%. Cheaper depends on your model mix, not the provider: Together wins on small models, parity on 70B.

What is the difference between FireAttention and Together ATLAS?

FireAttention is Fireworks' custom CUDA inference engine with FP8 quantization, paired with FireOptimizer, which tunes speculative decoding per workload from over 100,000 serving configurations ahead of time. Together ATLAS is an adaptive speculator that retrains on your live traffic, so throughput climbs as it learns your prompt patterns, up to 4x vLLM once warmed. Fireworks tunes ahead of time; Together adapts continuously.

Do Fireworks and Together support fine-tuning?

Yes, both support LoRA and full fine-tuning priced per training token. Fireworks charges about $3.00 per 1M tokens for LoRA SFT in the 16B to 80B range and serves fine-tuned LoRA adapters serverless. Together charges about $2.90 per 1M tokens for LoRA on 70B to 100B models and lets you download your fine-tuned weights. If you need raw checkpoint ownership, Together is the safer default.

Which provider has more models and modalities?

Together AI. Beyond text LLMs it serves image generation, audio transcription, embeddings, a rerank endpoint, and a code interpreter, plus rentable GPU clusters. Fireworks focuses more tightly on high-throughput text and vision serving with FireAttention, embeddings, and fine-tuning. For a single multimodal bill, Together covers more.

Does either provider lock you in?

Both expose OpenAI-compatible endpoints, so switching is mostly a base-URL and API-key change. The lock-in is at the layers above the API. Fireworks tunes FireOptimizer and FireAttention to your workload and keeps that tuning on its stack. Together is the more portable choice: it lets you download fine-tuned weights and rent raw HGX GPU clusters, so you can move a fine-tuned model off-platform or run your own engine. If keeping the option to self-host later matters, Together is the safer default; if you want the platform to own the tuning, Fireworks is fine.

Related Comparisons

On 70B, Pick on Lock-In, Not Price

Fireworks tunes ahead of time and keeps you on its stack; Together adapts to live traffic and lets you export weights. Separately, if applying model-generated code edits is your bottleneck, that is a different tool.