Together AI and DeepInfra both serve open-weight models through an OpenAI-compatible API, both charge per token, and both have no minimums. That is where the similarity ends.
DeepInfra is the price floor. Llama 3.3 70B Turbo runs at $0.10 per million input tokens and $0.32 per million output, billed per minute with sub-half-second time-to-first-token. The pitch is blunt: cheapest serverless tokens, plus raw on-demand GPU instances from $0.89/hr for an A100. Founded in 2022, it grew through a self-serve, developer-led motion with almost no marketing.
Together AI sells performance and platform depth. Its ATLAS adaptive speculator learns from your live traffic and pushes DeepSeek-V3.1 past 500 tok/s. It runs SOC 2 Type II inference with HIPAA-aligned options, offers LoRA and full fine-tuning, and ships self-service GPU clusters that scale from 16 to over 100,000 InfiniBand-connected GPUs.
All pricing below is as of early 2026 and changes often. Check the live pricing pages before you commit.
TL;DR
- Pick DeepInfra if cost is the deciding factor. Roughly 3x to 10x cheaper per token on common models, per-minute billing, sub-0.5s TTFT, and on-demand A100s at $0.89/hr. The lowest-friction way to serve open models.
- Pick Together AI if you need speed at scale, full fine-tuning with weight export, SOC 2 / HIPAA-aligned inference, or training-scale GPU clusters. ATLAS speculative decoding and dedicated endpoints make it a platform, not just an endpoint.
Who Wins Per Workload
The decision rarely comes down to one number. Map your actual workload to the row below.
| Workload / decision | Together AI | DeepInfra |
|---|---|---|
| Cheapest open-model tokens | Loses on list price | DeepInfra, ~10x cheaper input |
| Fastest first call (low TTFT) | Fast, not headlined | DeepInfra, ~0.35s TTFT |
| Sustained high throughput | Together, ATLAS to 500 tok/s | Standard decoding |
| Cheapest raw GPU box | $6.49/hr H100 endpoint | DeepInfra, $0.89/hr A100 |
| Fine-tune and export weights | Together, full + LoRA to 100B | Adapter serving only |
| Training-scale clusters | Together, 16 to 100k+ GPUs | On-demand instances only |
| Strictest compliance | Together, SOC 2 + HIPAA + BAA | Self-serve, limited |
| Widest model catalog | Broad | DeepInfra, 190+ models |
| Zero-friction indie start | More platform surface | DeepInfra, base-URL swap |
Quick Comparison
DeepInfra wins on raw cost. Together AI wins on speed mechanisms, fine-tuning, and enterprise compliance. Morph is a different layer entirely, tuned for code edits rather than general token serving.
| Spec | Together AI | DeepInfra | Morph |
|---|---|---|---|
| Focus | General inference + training cloud | Cheapest serverless tokens | Coding-agent inner loop |
| Llama 3.3 70B (in/out per 1M) | $1.04 / $1.04 | $0.10 / $0.32 | N/A (apply model) |
| Billing | Per token + per-hour endpoints | Per token, per-minute GPUs | Per request / per token |
| Speculative decoding | ATLAS adaptive (learns from traffic) | Standard | ngram k=64, code-tuned |
| Code-specific apply endpoint | No | No | Yes (/v1/code/apply) |
| Semantic code search | No | No | WarpGrep ($0/100k) |
| Apply throughput | General token serving | General token serving | ~10,500 tok/s |
| First-pass apply accuracy | N/A | N/A | 98% |
| Fine-tuning | LoRA + full, up to 100B | LoRA adapters | N/A |
| Dedicated GPU floor | $6.49/hr (H100) | $0.89/hr (A100) | N/A |
| SOC 2 / HIPAA-aligned | Yes | Self-serve, limited | N/A |
| Best for | Speed + training + compliance | Cheap high-volume serving | Search, apply, compact loop |
Cost on a Real Workload
Computed from list prices, June 2026
Serving Llama 3.3 70B with a chat-style mix of 50M input and 50M output tokens per day:
- DeepInfra serverless: 50 x $0.10 (input) + 50 x $0.32 (output) = $5.00 + $16.00 = $21/day, about $630/mo.
- Together AI serverless: 50 x $1.04 + 50 x $1.04 = $52.00 + $52.00 = $104/day, about $3,120/mo. Roughly 5x the DeepInfra bill on the same tokens.
- DeepInfra dedicated H100: $1.79/hr x 24 x 30 = about $1,289/mo per GPU. Dedicated only beats DeepInfra serverless ($630/mo) once you saturate the box, which at this 100M tokens/day mix you do not. Serverless wins until you are pushing roughly 2x this volume on one GPU around the clock.
Break-even read: at 100M tokens/day, DeepInfra serverless is the floor at about $630/mo, Together serverless is about 5x that for the speculative-decoding and platform depth, and a dedicated GPU only pays off above sustained, near-24/7 saturation. List prices move often; redo the arithmetic with the live numbers before committing.
Pricing: DeepInfra Is the Floor
On serverless tokens, DeepInfra is consistently cheaper, often by a wide margin.
| Model | Together AI | DeepInfra |
|---|---|---|
| Llama 3.3 70B (input) | $1.04 | $0.10 |
| Llama 3.3 70B (output) | $1.04 | $0.32 |
| DeepSeek (input) | $2.10 ($0.20 cached) | $0.32 |
| DeepSeek (output) | $4.40 | $0.89 |
| Billing granularity | Per token | Per token, per-minute GPUs |
| Batch discount | 50% on batch API | Not advertised |
For Llama 3.3 70B, DeepInfra is roughly 10x cheaper on input and 3x cheaper on output. The gap narrows on premium models and reverses when you factor in speed: a faster provider finishes the same job in fewer wall-clock seconds, and Together AI bills for tokens, not time, on serverless.
Read the Caveat
These are list serverless prices as of early 2026 and both providers move them frequently. DeepInfra's cheaper tokens come with the standard decoder; Together's higher list price buys ATLAS speculative decoding and a deeper platform. Benchmark on your own traffic before locking in.
Speed: ATLAS Is Together AI's Real Edge
Together AI's speed story is not a faster GPU, it is a smarter decoder.
ATLAS, the Adaptive-Learning Speculator System, pairs a heavyweight static speculator trained on a broad corpus with a lightweight adaptive speculator that updates from your live traffic in real time. The longer it serves your workload, the better its draft predictions get. Together reports up to 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 in fully adapted scenarios, about 2.65x faster than standard decoding, and faster than dedicated speed hardware on those runs.
DeepInfra optimizes a different number: time-to-first-token. It reports TTFT as low as 0.35 seconds, which effectively removes cold-start delay for real-time apps. Its throughput uses standard decoding, so on long generations Together's adaptive speculation can pull ahead while DeepInfra stays cheaper per token.
The practical read: if your workload has steady, repeating patterns (a production agent or chatbot), ATLAS compounds and Together gets faster over time. If your workload is bursty and cost-sensitive, DeepInfra's low TTFT and cheap tokens win.
Dedicated GPUs and Clusters
Both offer dedicated compute, but they aim at different scales.
| Compute | Together AI | DeepInfra |
|---|---|---|
| A100 80GB | Cluster-tier | $0.89 |
| H100 80GB | $6.49 (dedicated endpoint) | $1.79 |
| H200 141GB | Cluster-tier | $2.19 |
| B200 180GB | $11.95 (dedicated endpoint) | $2.79 |
| B300 270GB | Cluster-tier | $4.20 |
| Billing | Per hour | Per minute |
| Max scale | 100,000+ GPUs (InfiniBand) | On-demand instances |
DeepInfra's GPU instances are raw, per-minute, on-demand boxes: cheapest way to grab an A100 or B200 and run your own server. Together AI's dedicated endpoints package a managed serving stack at a higher rate, and its real differentiator is GPU Clusters, generally available since 2025, which scale from 16 to over 100,000 GB200/H200/H100 GPUs interconnected with InfiniBand and NVLink. That is training-scale infrastructure DeepInfra does not target.
Features and Developer Surface
On day-to-day API ergonomics, the two are closer than the pricing suggests.
| Feature | Together AI | DeepInfra |
|---|---|---|
| OpenAI-compatible API | Yes | Yes |
| JSON mode / structured output | Yes (schema in response_format) | Yes (83 of 84 models) |
| Function calling / tool use | Yes | Yes (79 of 84 models) |
| LoRA adapter serving | Yes | Yes |
| Batch API | Yes (50% discount) | Not advertised |
| Self-service GPU clusters | Yes | On-demand instances |
| Cold-start TTFT | Fast | ~0.35s |
DeepInfra supports JSON mode on 83 of 84 models and function calling on 79 of 84, all through the OpenAI SDK with just a base-URL and key swap. Together AI matches structured output and function calling, then adds a batch API at a 50% token discount for non-time-sensitive jobs. For most app code, switching between the two is a base-URL change.
Fine-Tuning: Together AI Is the Full Platform
If you need to train, not just serve, Together AI is the more complete platform.
| Capability | Together AI | DeepInfra |
|---|---|---|
| LoRA fine-tuning | Yes | Adapter serving |
| Full fine-tuning | Yes (up to 100B) | No |
| Training price (up to 16B) | $0.48-$1.35 per 1M tokens | N/A |
| Training price (70-100B) | $2.90-$8.00 per 1M tokens | N/A |
| Long-context training | Yes | N/A |
Together AI prices fine-tuning per training token: roughly $0.48 to $1.35 per million for models up to 16B, scaling to $2.90 to $8.00 per million for 70B to 100B models, with both LoRA and full fine-tuning. DeepInfra focuses on serving: you can deploy a LoRA adapter on a supported base model and call it through the standard API, but it is not a from-scratch fine-tuning platform.
Compliance and Enterprise
Together AI is the safer pick for regulated workloads.
Together AI runs SOC 2 Type II compliant inference, offers HIPAA-aligned options with business associate agreements, encrypts data in transit and at rest, and lets you pin storage to North America, Europe, or Asia/Middle East for data residency. DeepInfra is built around a low-friction self-serve motion and does not foreground the same enterprise compliance surface. If you are in healthcare, finance, or any regulated vertical, that gap matters.
The Practical Split
DeepInfra optimizes for the indie developer and cost-sensitive startup: cheapest tokens, fewest sales calls. Together AI optimizes for the scaling company and the enterprise: speed mechanisms, fine-tuning, compliance, and clusters. Your stage and regulatory posture usually decide this more than any single benchmark.
When to Use Together AI
- You need speed at scale. ATLAS adaptive speculative decoding hits 500 tok/s on DeepSeek-V3.1 and gets faster as it learns your traffic. For steady production workloads, it compounds.
- You are fine-tuning, not just serving. LoRA and full fine-tuning up to 100B, priced per training token, with long-context support.
- You have compliance requirements. SOC 2 Type II, HIPAA-aligned options, BAAs, and regional data residency.
- You need training-scale GPUs. Self-service clusters from 16 to 100,000+ InfiniBand-connected GB200/H200/H100 GPUs.
- You want a batch discount. The batch API cuts token cost 50% for non-time-sensitive jobs.
When to Use DeepInfra
- Cost is the deciding factor. Llama 3.3 70B at $0.10/$0.32 per million tokens is roughly 3x to 10x cheaper than Together for the same model.
- You want cheap raw GPUs. On-demand A100 at $0.89/hr, H100 at $1.79/hr, B200 at $2.79/hr, all billed per minute.
- Latency matters for real-time apps. Time-to-first-token as low as 0.35s effectively removes cold-start delay.
- You want zero friction. Self-serve signup, OpenAI SDK with a base-URL swap, JSON mode and function calling on nearly every model.
- You are early-stage and price-sensitive. No sales calls, no minimums, lowest token cost in the market for many open models.
Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).
Frequently Asked Questions
Is Together AI or DeepInfra cheaper?
DeepInfra is cheaper on serverless tokens for most models. Llama 3.3 70B Turbo runs at $0.10 per million input and $0.32 per million output on DeepInfra, versus $1.04/$1.04 on Together AI as of early 2026. DeepInfra also has lower dedicated GPU rates, from $0.89/hr for an A100 versus Together's $6.49/hr H100 endpoint. Together justifies the premium with ATLAS speculative decoding, fine-tuning, and SOC 2 compliance.
What is Together AI's ATLAS speculative decoding?
ATLAS pairs a heavyweight static speculator with a lightweight adaptive one that updates from your live traffic, so it gets faster the more it sees your workload. Together reports up to 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 in fully adapted scenarios, about 2.65x faster than standard decoding.
Does DeepInfra support fine-tuning and LoRA?
DeepInfra supports deploying LoRA adapters on top of supported base models, served through the OpenAI-compatible API. It is lighter than Together AI's fine-tuning platform, which offers LoRA and full fine-tuning up to 100B with per-token training pricing. For full custom fine-tuning, Together is the more complete option.
Do both providers offer dedicated GPUs?
Yes. DeepInfra offers per-minute on-demand instances: A100 at $0.89/hr, H100 at $1.79/hr, H200 at $2.19/hr, B200 at $2.79/hr. Together AI offers dedicated endpoints (H100 at $6.49/hr, B200 at $11.95/hr) plus self-service InfiniBand clusters that scale from 16 to over 100,000 GPUs.
Can I fine-tune on Together AI and serve the result on DeepInfra?
Mostly yes, and that split plays to each provider's strength. Together AI supports full fine-tuning up to 100B and lets you export the weights, so you can train there and serve the open-weight checkpoint on DeepInfra's cheaper tokens or its on-demand GPUs from $0.89/hr. DeepInfra itself only serves LoRA adapters on supported base models, not from-scratch fine-tuning. The catch: DeepInfra serves a fixed catalog plus your adapters, so a fully custom merged checkpoint may need a dedicated GPU instance rather than the serverless endpoint.
Related Comparisons
DeepInfra for the Floor, Together for the Platform
Serve open models as cheap as possible on DeepInfra, or train, export, and run regulated workloads on Together AI. If applying model-generated edits is your bottleneck, that is a different layer.