Together AI vs DeepInfra (2026): Cheapest Tokens vs Full Platform

You are deciding where to serve open-weight models, and the shortlist is Together AI and DeepInfra. Both expose an OpenAI-compatible API, both charge per token, and both have no minimum spend. The split is cost versus platform depth.

DeepInfra is the price floor. DeepSeek V4 Pro runs at $1.30 per 1M input and $2.60 per 1M output, against Together's $2.10/$4.40 on the same model. On-demand GPUs start at $0.89/hr for an A100 and $1.79/hr for an H100.

Together AI charges more per token and earns it with LoRA and full fine-tuning, a batch API at up to 50% off, and self-service InfiniBand GPU clusters DeepInfra does not offer. All prices below are verified as of June 9, 2026, and both providers move them often. Confirm on the live pricing pages before you commit.

TL;DR

Pick DeepInfra if cost is the deciding factor. It is 30-60% cheaper per token on every shared model below, rents an A100 at $0.89/hr and an H100 at $1.79/hr, and runs zero-retention inference. No free tier, 200 concurrent requests per account.
Pick Together AI if you need LoRA or full fine-tuning with weight export, a 50%-off batch API, or training-scale InfiniBand GPU clusters. Higher token prices buy a deeper platform, not just an endpoint.
For DeepSeek or codegen specifically, see the fidelity section: most serverless providers quantize activations to fp8 to cut cost. Morph serves DeepSeek at full bf16 and tunes inference for code.

Who Wins Per Workload

The decision rarely comes down to one number. Map your actual workload to the row below.

Together AI vs DeepInfra by Decision

Workload / decision	Together AI	DeepInfra
Cheapest open-model tokens	Higher list price	DeepInfra, 30-60% cheaper
Cheapest raw GPU box	$6.49/hr H100 endpoint	DeepInfra, $0.89/hr A100
Fine-tune and export weights	Together, LoRA + full	Serverless catalog only
Batch / non-urgent jobs	Together, up to 50% off	Not advertised
Training-scale clusters	Together, InfiniBand clusters	On-demand instances only
Zero-retention inference	SOC 2 Type 2	DeepInfra, deletes prompts
Known fixed rate limit	Dedicated endpoint	DeepInfra, 200 concurrent
Zero-friction indie start	More platform surface	DeepInfra, base-URL swap
DeepSeek at full fidelity	Serverless catalog	Serverless catalog

Serverless Per-Token Pricing: DeepInfra Is the Floor

On every model both providers share, DeepInfra is cheaper, usually by 30-60%. Prices are per 1M tokens, input / output, June 2026.

Serverless Per-Token Pricing (per 1M, in / out, June 2026)

Model	Together AI	DeepInfra
DeepSeek V4 Pro	$2.10 / $4.40	$1.30 / $2.60
Kimi K2.6	$1.20 / $4.50	$0.75 / $3.50
GLM-5.1	$1.40 / $4.40	$1.05 / $3.50
Qwen3-Max	$1.25 / $3.75	$1.20 / $6.00
Llama 3.1 8B	Not listed	$0.02 / $0.05
Cached input discount	Yes (model-specific)	Yes (model-specific)

DeepInfra leads on most shared models, but check the exact pair you run. On Qwen3-Max, DeepInfra is cheaper on input ($1.20 vs $1.25) yet more expensive on output ($6.00 vs $3.75), so an output-heavy agent workload flips the math toward Together. DeepInfra also resells Anthropic models (Claude Haiku 4.5 $1.00/$5.00, Sonnet 4.6 $3.00/$15.00, Opus 4.8 $5.00/$25.00) if you want closed and open models under one key.

$1.30 / $2.60

DeepInfra DeepSeek V4 Pro per 1M (in/out)

$2.10 / $4.40

Together AI DeepSeek V4 Pro per 1M (in/out)

Cost on a Real Workload

Computed from list prices, June 2026

Serving DeepSeek V4 Pro with 50M input and 50M output tokens per day:

DeepInfra serverless: 50 x $1.30 + 50 x $2.60 = $65 + $130 = $195/day, about $5,850/mo.
Together AI serverless: 50 x $2.10 + 50 x $4.40 = $105 + $220 = $325/day, about $9,750/mo. Roughly 1.67x the DeepInfra bill on the same tokens.
DeepInfra dedicated H100: $1.79/hr x 24 x 30 = about $1,289/mo per GPU. A single H100 only beats serverless once you saturate it; below that line, serverless is cheaper because you pay only for tokens used.

Break-even read: DeepInfra serverless is the floor for sub-saturation traffic, Together serverless costs about 1.67x more on this model for the platform depth, and a dedicated GPU pays off only above sustained near-24/7 saturation. Redo the arithmetic with the model you actually run and the live numbers before committing.

Dedicated GPUs and Clusters

Both offer dedicated compute, but they aim at different scales. DeepInfra rents raw, cheap on-demand boxes. Together packages a managed serving stack and sells training-scale clusters.

Dedicated GPU and Endpoint Pricing ($/hr, June 2026)

Compute	Together AI	DeepInfra
A100 80GB	Cluster-tier	$0.89 on-demand
H100 80GB	$6.49 dedicated endpoint	$1.79 on-demand
H200 141GB	Contact sales	$2.19 on-demand
B200 180GB	$11.95 dedicated endpoint	$2.79 on-demand
B300	Cluster-tier	$4.20 on-demand
Billing	Per hour	Per minute

Together also rents GPU clusters on-demand: HGX H100 at $5.49/hr, HGX H200 at $6.79/hr, HGX B200 at $9.95/hr, with reserved rates from $3.99 to $9.65/hr for 7 to 180-plus-day commitments. Those clusters are InfiniBand-connected, training-scale infrastructure DeepInfra does not target. If your job is renting a single cheap box to run your own server, DeepInfra wins outright; an H100 there is $1.79/hr versus Together's $6.49/hr dedicated endpoint or $5.49/hr cluster rate.

Rate Limits and Billing

The two handle scaling very differently.

Rate Limits and Billing

Aspect	Together AI	DeepInfra
Free tier	Limited free usage	None
Serverless rate limit	Dynamic, scales with traffic	200 concurrent requests/account
Fixed-limit option	Dedicated endpoint	Dedicated GPU instance
Billing model	Per token + per-hour endpoints	Postpaid, per-minute GPUs
Invoicing	Usage-based	Monthly + mid-month at thresholds

DeepInfra publishes a hard 200-concurrent-request cap per account and bills postpaid, with mid-month invoices triggered at usage thresholds of $20, $100, $500, $2,000, and $10,000. There is no free tier. Together does not publish fixed per-model serverless limits; its limits are dynamic and scale with sustained traffic, and it points you to a dedicated endpoint when you need a guaranteed ceiling.

Fine-Tuning and Batch: Together AI Is the Full Platform

If you need to train, not just serve, Together is the more complete platform and lets you export the resulting weights.

Fine-Tuning and Batch (June 2026)

Capability	Together AI	DeepInfra
LoRA SFT, up to 16B	$0.48 / 1M tokens	Not a training platform
LoRA SFT, 17B-69B	$1.50 / 1M tokens	N/A
LoRA SFT, 70-100B	$2.90 / 1M tokens	N/A
LoRA DPO, up to 16B	$0.54 / 1M tokens	N/A
Weight export	Yes	N/A
Batch API	Up to 50% off, 24h window	Not advertised

Together prices LoRA SFT fine-tuning per training token: $0.48 per 1M up to 16B, $1.50 for 17B-69B, and $2.90 for 70-100B, with LoRA DPO at $0.54 / $1.65 / $3.20 across the same size bands. Its batch API runs up to 50% off serverless on selected models with a 24-hour best-effort window, up to 50,000 requests per batch and 100 MB per input file. DeepInfra is built around cheap serving, not training, so the common pattern is to fine-tune on Together, export, then serve the open-weight checkpoint on DeepInfra's cheaper tokens or a $0.89/hr A100.

Compliance and Data Retention

Both are SOC 2 audited; their data-handling stories differ.

Compliance and Data Retention

Aspect	Together AI	DeepInfra
SOC 2	Type 2 certified	Certified
ISO 27001	Not stated here	Certified
GDPR / HIPAA	Evaluate per workload	Technical + org measures
Prompt retention	Per data policy	Deleted after short window
Metadata logging	Per data policy	Request ID, cost, params only

DeepInfra runs zero-retention inference: prompts and completions are deleted from disk and memory after a short retention period, and only metadata (request ID, cost, sampling parameters) is logged. The exception is Google models, where Google logs prompts and responses for abuse detection. DeepInfra also lists ISO 27001 alongside SOC 2 and measures for GDPR and HIPAA. Together AI is SOC 2 Type 2 certified, covering access management, encryption, incident response, and change management. Read each trust center against your exact framework before committing a regulated workload.

Running DeepSeek at Full Fidelity (and Codegen)

One factor neither price table shows: most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. If you are running DeepSeek specifically and output fidelity matters, that quantization is the variable to watch.

Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations, no fp8 or int8 quantization, so responses match the reference weights. For coding agents, Morph adds codegen-specific speculative decoding plus custom low-level inference kernels, which makes it the fastest and highest-quality option for code generation rather than a general-purpose menu. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input and $0.278 per 1M output. See pricing for the full list.

bf16

Morph DeepSeek activations (no fp8/int8)

$0.139 / $0.278

morph-dsv4flash per 1M (in/out)

Code-tuned

Speculative decoding + custom kernels

When to Use Together AI

You are fine-tuning, not just serving. LoRA SFT from $0.48 per 1M up to 16B, $2.90 for 70-100B, plus LoRA DPO and weight export.
You want a batch discount. Up to 50% off serverless on selected models with a 24-hour window and up to 50,000 requests per batch.
You need training-scale GPUs. Self-service InfiniBand clusters (HGX H100 $5.49/hr, HGX H200 $6.79/hr, HGX B200 $9.95/hr on-demand) with reserved rates from $3.99/hr.
Your model skews output-heavy. On a few models (Qwen3-Max output at $3.75 vs DeepInfra's $6.00), Together can be the cheaper choice.

When to Use DeepInfra

Cost is the deciding factor. 30-60% cheaper per token on DeepSeek V4 Pro, Kimi K2.6, and GLM-5.1, with Llama 3.1 8B at $0.02/$0.05.
You want cheap raw GPUs. On-demand A100 at $0.89/hr, H100 at $1.79/hr, H200 at $2.19/hr, B200 at $2.79/hr, billed per minute.
You need zero-retention inference. Prompts and completions deleted after a short window, only metadata logged, plus SOC 2 and ISO 27001.
You want zero friction. Self-serve signup, OpenAI SDK with a base-URL and key swap, no minimums, 200 concurrent requests per account.

Neither is built for the coding-agent apply loop. If applying model-generated code edits is the bottleneck, that is a different layer (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

Is Together AI or DeepInfra cheaper?

DeepInfra is cheaper on serverless tokens for every shared model in this comparison. DeepSeek V4 Pro is $1.30/$2.60 per 1M in/out on DeepInfra versus $2.10/$4.40 on Together. Kimi K2.6 is $0.75/$3.50 versus $1.20/$4.50. GLM-5.1 is $1.05/$3.50 versus $1.40/$4.40. DeepInfra also rents dedicated GPUs far cheaper: an A100 at $0.89/hr and an H100 at $1.79/hr versus Together's $6.49/hr H100 endpoint. Check output-heavy models like Qwen3-Max, where Together's $3.75 output beats DeepInfra's $6.00.

What rate limits do Together AI and DeepInfra enforce?

DeepInfra caps accounts at 200 concurrent requests and has no free tier; billing is postpaid with mid-month invoices at $20, $100, $500, $2,000, and $10,000 thresholds. Together does not publish fixed per-model serverless limits; they are dynamic and scale with sustained traffic, and Together recommends a dedicated endpoint when you need a known ceiling.

Does DeepInfra support fine-tuning like Together AI?

Together is the fuller training platform. It prices LoRA SFT at $0.48 per 1M up to 16B, $1.50 for 17B-69B, and $2.90 for 70-100B (LoRA DPO slightly higher) and lets you export the weights. DeepInfra is built around cheap serving plus on-demand GPUs, not from-scratch fine-tuning. Fine-tune on Together, export, then serve the open-weight checkpoint on DeepInfra's cheaper tokens.

Do both providers offer dedicated GPUs?

Yes. DeepInfra on-demand, billed per minute: A100 $0.89/hr, H100 $1.79/hr, H200 $2.19/hr, B200 $2.79/hr, B300 $4.20/hr. Together dedicated endpoints: 1x H100 $6.49/hr, 1x HGX B200 $11.95/hr, plus GPU clusters (HGX H100 $5.49/hr, HGX H200 $6.79/hr, HGX B200 $9.95/hr on-demand; reserved from $3.99/hr).

Which has better data retention and compliance?

DeepInfra runs zero-retention inference: prompts and completions are deleted after a short window, only metadata is logged, and it lists SOC 2 and ISO 27001 plus GDPR and HIPAA measures (Google models are the logging exception). Together AI is SOC 2 Type 2 certified. Evaluate each trust center against your exact framework before moving a regulated workload.

Where should I run DeepSeek for the best output quality?

Most serverless providers quantize activations to fp8 to cut cost, which degrades output. Morph serves DeepSeek with full 16-bit (bf16) activations so responses match the reference weights, and adds codegen-specific speculative decoding and custom kernels for coding agents. morph-dsv4flash is $0.139/$0.278 per 1M. See Morph models and pricing.

Related Comparisons

DeepInfra for the Floor, Together for the Platform

Serve open models as cheap as possible on DeepInfra, or fine-tune, export, and rent InfiniBand clusters on Together. For DeepSeek at full bf16 fidelity or the coding-agent apply loop, that is a different layer.

See Morph Models

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs DeepInfra (2026): DeepInfra Is 30-60% Cheaper Per Token, Together Adds Fine-Tuning and Clusters

Who Wins Per Workload

Serverless Per-Token Pricing: DeepInfra Is the Floor

Cost on a Real Workload

Dedicated GPUs and Clusters

Rate Limits and Billing

Fine-Tuning and Batch: Together AI Is the Full Platform

Compliance and Data Retention

Running DeepSeek at Full Fidelity (and Codegen)

When to Use Together AI

When to Use DeepInfra

Frequently Asked Questions

Is Together AI or DeepInfra cheaper?

What rate limits do Together AI and DeepInfra enforce?

Does DeepInfra support fine-tuning like Together AI?

Do both providers offer dedicated GPUs?

Which has better data retention and compliance?

Where should I run DeepSeek for the best output quality?

Related Comparisons

DeepInfra for the Floor, Together for the Platform