Together AI vs DeepInfra: Full-Stack Training Cloud vs the Price Floor

DeepInfra serves Llama 3.3 70B about 10x cheaper on input with ~0.35s TTFT and dedicated GPUs from $0.89/hr. Together AI costs more per token but adds ATLAS speculative decoding, full fine-tune and export, InfiniBand clusters, and SOC 2 / HIPAA controls. Which one fits your workload.

June 3, 2026 · 1 min read

Together AI and DeepInfra both serve open-weight models through an OpenAI-compatible API, both charge per token, and both have no minimums. That is where the similarity ends.

DeepInfra is the price floor. Llama 3.3 70B Turbo runs at $0.10 per million input tokens and $0.32 per million output, billed per minute with sub-half-second time-to-first-token. The pitch is blunt: cheapest serverless tokens, plus raw on-demand GPU instances from $0.89/hr for an A100. Founded in 2022, it grew through a self-serve, developer-led motion with almost no marketing.

Together AI sells performance and platform depth. Its ATLAS adaptive speculator learns from your live traffic and pushes DeepSeek-V3.1 past 500 tok/s. It runs SOC 2 Type II inference with HIPAA-aligned options, offers LoRA and full fine-tuning, and ships self-service GPU clusters that scale from 16 to over 100,000 InfiniBand-connected GPUs.

All pricing below is as of early 2026 and changes often. Check the live pricing pages before you commit.

TL;DR

  • Pick DeepInfra if cost is the deciding factor. Roughly 3x to 10x cheaper per token on common models, per-minute billing, sub-0.5s TTFT, and on-demand A100s at $0.89/hr. The lowest-friction way to serve open models.
  • Pick Together AI if you need speed at scale, full fine-tuning with weight export, SOC 2 / HIPAA-aligned inference, or training-scale GPU clusters. ATLAS speculative decoding and dedicated endpoints make it a platform, not just an endpoint.

Who Wins Per Workload

The decision rarely comes down to one number. Map your actual workload to the row below.

Workload / decisionTogether AIDeepInfra
Cheapest open-model tokensLoses on list priceDeepInfra, ~10x cheaper input
Fastest first call (low TTFT)Fast, not headlinedDeepInfra, ~0.35s TTFT
Sustained high throughputTogether, ATLAS to 500 tok/sStandard decoding
Cheapest raw GPU box$6.49/hr H100 endpointDeepInfra, $0.89/hr A100
Fine-tune and export weightsTogether, full + LoRA to 100BAdapter serving only
Training-scale clustersTogether, 16 to 100k+ GPUsOn-demand instances only
Strictest complianceTogether, SOC 2 + HIPAA + BAASelf-serve, limited
Widest model catalogBroadDeepInfra, 190+ models
Zero-friction indie startMore platform surfaceDeepInfra, base-URL swap

Quick Comparison

DeepInfra wins on raw cost. Together AI wins on speed mechanisms, fine-tuning, and enterprise compliance. Morph is a different layer entirely, tuned for code edits rather than general token serving.

SpecTogether AIDeepInfraMorph
FocusGeneral inference + training cloudCheapest serverless tokensCoding-agent inner loop
Llama 3.3 70B (in/out per 1M)$1.04 / $1.04$0.10 / $0.32N/A (apply model)
BillingPer token + per-hour endpointsPer token, per-minute GPUsPer request / per token
Speculative decodingATLAS adaptive (learns from traffic)Standardngram k=64, code-tuned
Code-specific apply endpointNoNoYes (/v1/code/apply)
Semantic code searchNoNoWarpGrep ($0/100k)
Apply throughputGeneral token servingGeneral token serving~10,500 tok/s
First-pass apply accuracyN/AN/A98%
Fine-tuningLoRA + full, up to 100BLoRA adaptersN/A
Dedicated GPU floor$6.49/hr (H100)$0.89/hr (A100)N/A
SOC 2 / HIPAA-alignedYesSelf-serve, limitedN/A
Best forSpeed + training + complianceCheap high-volume servingSearch, apply, compact loop

Cost on a Real Workload

Computed from list prices, June 2026

Serving Llama 3.3 70B with a chat-style mix of 50M input and 50M output tokens per day:

  • DeepInfra serverless: 50 x $0.10 (input) + 50 x $0.32 (output) = $5.00 + $16.00 = $21/day, about $630/mo.
  • Together AI serverless: 50 x $1.04 + 50 x $1.04 = $52.00 + $52.00 = $104/day, about $3,120/mo. Roughly 5x the DeepInfra bill on the same tokens.
  • DeepInfra dedicated H100: $1.79/hr x 24 x 30 = about $1,289/mo per GPU. Dedicated only beats DeepInfra serverless ($630/mo) once you saturate the box, which at this 100M tokens/day mix you do not. Serverless wins until you are pushing roughly 2x this volume on one GPU around the clock.

Break-even read: at 100M tokens/day, DeepInfra serverless is the floor at about $630/mo, Together serverless is about 5x that for the speculative-decoding and platform depth, and a dedicated GPU only pays off above sustained, near-24/7 saturation. List prices move often; redo the arithmetic with the live numbers before committing.

Pricing: DeepInfra Is the Floor

On serverless tokens, DeepInfra is consistently cheaper, often by a wide margin.

ModelTogether AIDeepInfra
Llama 3.3 70B (input)$1.04$0.10
Llama 3.3 70B (output)$1.04$0.32
DeepSeek (input)$2.10 ($0.20 cached)$0.32
DeepSeek (output)$4.40$0.89
Billing granularityPer tokenPer token, per-minute GPUs
Batch discount50% on batch APINot advertised

For Llama 3.3 70B, DeepInfra is roughly 10x cheaper on input and 3x cheaper on output. The gap narrows on premium models and reverses when you factor in speed: a faster provider finishes the same job in fewer wall-clock seconds, and Together AI bills for tokens, not time, on serverless.

$0.10 / $0.32
DeepInfra Llama 3.3 70B per 1M (in/out)
$1.04 / $1.04
Together AI Llama 3.3 70B per 1M (in/out)

Read the Caveat

These are list serverless prices as of early 2026 and both providers move them frequently. DeepInfra's cheaper tokens come with the standard decoder; Together's higher list price buys ATLAS speculative decoding and a deeper platform. Benchmark on your own traffic before locking in.

Speed: ATLAS Is Together AI's Real Edge

Together AI's speed story is not a faster GPU, it is a smarter decoder.

ATLAS, the Adaptive-Learning Speculator System, pairs a heavyweight static speculator trained on a broad corpus with a lightweight adaptive speculator that updates from your live traffic in real time. The longer it serves your workload, the better its draft predictions get. Together reports up to 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 in fully adapted scenarios, about 2.65x faster than standard decoding, and faster than dedicated speed hardware on those runs.

DeepInfra optimizes a different number: time-to-first-token. It reports TTFT as low as 0.35 seconds, which effectively removes cold-start delay for real-time apps. Its throughput uses standard decoding, so on long generations Together's adaptive speculation can pull ahead while DeepInfra stays cheaper per token.

500 tok/s
Together ATLAS on DeepSeek-V3.1
2.65x
ATLAS vs standard decoding
0.35s
DeepInfra time-to-first-token

The practical read: if your workload has steady, repeating patterns (a production agent or chatbot), ATLAS compounds and Together gets faster over time. If your workload is bursty and cost-sensitive, DeepInfra's low TTFT and cheap tokens win.

Dedicated GPUs and Clusters

Both offer dedicated compute, but they aim at different scales.

ComputeTogether AIDeepInfra
A100 80GBCluster-tier$0.89
H100 80GB$6.49 (dedicated endpoint)$1.79
H200 141GBCluster-tier$2.19
B200 180GB$11.95 (dedicated endpoint)$2.79
B300 270GBCluster-tier$4.20
BillingPer hourPer minute
Max scale100,000+ GPUs (InfiniBand)On-demand instances

DeepInfra's GPU instances are raw, per-minute, on-demand boxes: cheapest way to grab an A100 or B200 and run your own server. Together AI's dedicated endpoints package a managed serving stack at a higher rate, and its real differentiator is GPU Clusters, generally available since 2025, which scale from 16 to over 100,000 GB200/H200/H100 GPUs interconnected with InfiniBand and NVLink. That is training-scale infrastructure DeepInfra does not target.

Features and Developer Surface

On day-to-day API ergonomics, the two are closer than the pricing suggests.

FeatureTogether AIDeepInfra
OpenAI-compatible APIYesYes
JSON mode / structured outputYes (schema in response_format)Yes (83 of 84 models)
Function calling / tool useYesYes (79 of 84 models)
LoRA adapter servingYesYes
Batch APIYes (50% discount)Not advertised
Self-service GPU clustersYesOn-demand instances
Cold-start TTFTFast~0.35s

DeepInfra supports JSON mode on 83 of 84 models and function calling on 79 of 84, all through the OpenAI SDK with just a base-URL and key swap. Together AI matches structured output and function calling, then adds a batch API at a 50% token discount for non-time-sensitive jobs. For most app code, switching between the two is a base-URL change.

Fine-Tuning: Together AI Is the Full Platform

If you need to train, not just serve, Together AI is the more complete platform.

CapabilityTogether AIDeepInfra
LoRA fine-tuningYesAdapter serving
Full fine-tuningYes (up to 100B)No
Training price (up to 16B)$0.48-$1.35 per 1M tokensN/A
Training price (70-100B)$2.90-$8.00 per 1M tokensN/A
Long-context trainingYesN/A

Together AI prices fine-tuning per training token: roughly $0.48 to $1.35 per million for models up to 16B, scaling to $2.90 to $8.00 per million for 70B to 100B models, with both LoRA and full fine-tuning. DeepInfra focuses on serving: you can deploy a LoRA adapter on a supported base model and call it through the standard API, but it is not a from-scratch fine-tuning platform.

Compliance and Enterprise

Together AI is the safer pick for regulated workloads.

Together AI runs SOC 2 Type II compliant inference, offers HIPAA-aligned options with business associate agreements, encrypts data in transit and at rest, and lets you pin storage to North America, Europe, or Asia/Middle East for data residency. DeepInfra is built around a low-friction self-serve motion and does not foreground the same enterprise compliance surface. If you are in healthcare, finance, or any regulated vertical, that gap matters.

The Practical Split

DeepInfra optimizes for the indie developer and cost-sensitive startup: cheapest tokens, fewest sales calls. Together AI optimizes for the scaling company and the enterprise: speed mechanisms, fine-tuning, compliance, and clusters. Your stage and regulatory posture usually decide this more than any single benchmark.

When to Use Together AI

  • You need speed at scale. ATLAS adaptive speculative decoding hits 500 tok/s on DeepSeek-V3.1 and gets faster as it learns your traffic. For steady production workloads, it compounds.
  • You are fine-tuning, not just serving. LoRA and full fine-tuning up to 100B, priced per training token, with long-context support.
  • You have compliance requirements. SOC 2 Type II, HIPAA-aligned options, BAAs, and regional data residency.
  • You need training-scale GPUs. Self-service clusters from 16 to 100,000+ InfiniBand-connected GB200/H200/H100 GPUs.
  • You want a batch discount. The batch API cuts token cost 50% for non-time-sensitive jobs.

When to Use DeepInfra

  • Cost is the deciding factor. Llama 3.3 70B at $0.10/$0.32 per million tokens is roughly 3x to 10x cheaper than Together for the same model.
  • You want cheap raw GPUs. On-demand A100 at $0.89/hr, H100 at $1.79/hr, B200 at $2.79/hr, all billed per minute.
  • Latency matters for real-time apps. Time-to-first-token as low as 0.35s effectively removes cold-start delay.
  • You want zero friction. Self-serve signup, OpenAI SDK with a base-URL swap, JSON mode and function calling on nearly every model.
  • You are early-stage and price-sensitive. No sales calls, no minimums, lowest token cost in the market for many open models.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Together AI or DeepInfra cheaper?

DeepInfra is cheaper on serverless tokens for most models. Llama 3.3 70B Turbo runs at $0.10 per million input and $0.32 per million output on DeepInfra, versus $1.04/$1.04 on Together AI as of early 2026. DeepInfra also has lower dedicated GPU rates, from $0.89/hr for an A100 versus Together's $6.49/hr H100 endpoint. Together justifies the premium with ATLAS speculative decoding, fine-tuning, and SOC 2 compliance.

What is Together AI's ATLAS speculative decoding?

ATLAS pairs a heavyweight static speculator with a lightweight adaptive one that updates from your live traffic, so it gets faster the more it sees your workload. Together reports up to 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 in fully adapted scenarios, about 2.65x faster than standard decoding.

Does DeepInfra support fine-tuning and LoRA?

DeepInfra supports deploying LoRA adapters on top of supported base models, served through the OpenAI-compatible API. It is lighter than Together AI's fine-tuning platform, which offers LoRA and full fine-tuning up to 100B with per-token training pricing. For full custom fine-tuning, Together is the more complete option.

Do both providers offer dedicated GPUs?

Yes. DeepInfra offers per-minute on-demand instances: A100 at $0.89/hr, H100 at $1.79/hr, H200 at $2.19/hr, B200 at $2.79/hr. Together AI offers dedicated endpoints (H100 at $6.49/hr, B200 at $11.95/hr) plus self-service InfiniBand clusters that scale from 16 to over 100,000 GPUs.

Can I fine-tune on Together AI and serve the result on DeepInfra?

Mostly yes, and that split plays to each provider's strength. Together AI supports full fine-tuning up to 100B and lets you export the weights, so you can train there and serve the open-weight checkpoint on DeepInfra's cheaper tokens or its on-demand GPUs from $0.89/hr. DeepInfra itself only serves LoRA adapters on supported base models, not from-scratch fine-tuning. The catch: DeepInfra serves a fixed catalog plus your adapters, so a fully custom merged checkpoint may need a dedicated GPU instance rather than the serverless endpoint.

Related Comparisons

DeepInfra for the Floor, Together for the Platform

Serve open models as cheap as possible on DeepInfra, or train, export, and run regulated workloads on Together AI. If applying model-generated edits is your bottleneck, that is a different layer.