Together AI vs Replicate (2026): Per-Token Warm Endpoints vs Per-GPU-Second With Cold Starts

Together AI bills per token on always-warm endpoints. Replicate bills per GPU-second and turns idle models off, so the first request after a quiet period pays a cold boot that can run minutes on large models. That single axis, per-token-warm versus per-second-cold, decides your cost at volume, your latency on the first call, and which workloads each platform fits.

Together is an open-model API and training cloud: serverless per-token serving, dedicated endpoints, raw GPU clusters, and per-training-token fine-tuning. Replicate, now owned by Cloudflare, is a run-any-container catalog billed mostly by runtime, strongest for image, video, and one-line Cog deploys of arbitrary models.

TL;DR

Pick Together if you serve high-volume text and want per-token pricing on warm endpoints with no cold starts. Llama 3.3 70B is $1.04/M in and out, GLM-5.1 is $1.40/$4.40, GPT-OSS-120B is $0.15/$0.60. Graduate to dedicated H100 at $6.49/hr or H100 clusters at $5.49/hr when serverless stops paying off. SOC 2 Type 2 certified.
Pick Replicate if you need the widest catalog of community models, especially image and video, or you want to ship your own model with Cog. Per-GPU-second billing (H100 $5.49/hr, A100 $5.04/hr) with scale-to-zero fits spiky and one-off jobs, at the cost of a multi-minute cold boot on idle large models.
Pick Morph if the workload is DeepSeek or a coding agent. Morph runs DeepSeek at 16-bit activations (no fp8 quantization) and codegen-tuned speculative decoding. morph-dsv4flash is $0.139/$0.278 per 1M in/out.

Who Wins Per Workload

The question is rarely "which is better" and almost always "which fits this workload." Here is the call by the decision a developer actually faces.

Who Wins Per Workload

Workload / decision	Together AI	Replicate
Sustained high-volume text	Together: per-token warm endpoints	Loses: per-second on every gen
Bursty / one-off jobs	Loses: warm GPU sits idle	Replicate: scale-to-zero, pay nothing idle
Fastest first call (no cold start)	Together: warm catalog, 0s	Loses: minutes on idle large models
Multimodal (image/video/audio)	Curated set + Whisper $0.0015/min	Replicate: thousands, widest breadth
Fine-tune & export text-LLM weights	Together: per-token tunes, export weights	Per-GPU-second tunes
Ship an arbitrary custom model	Bring-your-own within families	Replicate: Cog packages any container
Raw GPU clusters at scale	Together: HGX H100/H200/B200 clusters	On-demand GPUs, committed multi-GPU only
Cheapest per request at volume	Together: per-token amortizes	Loses on steady traffic
Run DeepSeek at full quality	Serverless (fp8 typical)	Per-token, no bf16 guarantee

Quick Comparison

The headline split is the billing axis. Together bills per token on warm models; Replicate bills per GPU-second across a far larger but colder catalog. Morph is listed for the DeepSeek and coding-agent slice only.

Together vs Replicate vs Morph at a Glance

Spec	Together AI	Replicate	Morph
Focus	High-volume text + multimodal serving	Run any community model on demand	DeepSeek + coding-agent inner loop
Billing axis	Per token (warm)	Per GPU-second (cold-starts)	Per token (code-tuned)
Llama 3.3 70B serverless (per 1M)	$1.04 in / $1.04 out	Per GPU-second (model dependent)	Not a general host
GLM-5.1 serverless (per 1M)	$1.40 in / $4.40 out	Not listed	Not listed
Dedicated H100 / hr	$6.49 endpoint, $5.49 cluster	$5.49 on-demand	N/A
Cold starts	No (warm catalog)	Minutes on idle large models	No
Catalog	Curated open models	Thousands (community)	Code apply, search, compact, DeepSeek
DeepSeek activations	fp8 typical (serverless)	Provider-dependent	16-bit bf16 (no quant)
Code apply endpoint	No	No	Yes (/v1/code/apply)
Semantic code search	No	No	WarpGrep ($0/100k)
Apply throughput	General token-by-token	General + cold start	~10,500 tok/s
Best for	Per-token text serving at scale	Image/video + custom models	DeepSeek + coding agents

All numbers are list prices as of June 2026 and change often. Verify on each platform's pricing page before committing volume.

Billing Model: Per Token vs Per GPU-Second

This decision drives everything else. Together meters tokens; Replicate meters wall-clock GPU time.

Together: per-token on warm endpoints

Together runs a curated catalog of open models on always-on serverless endpoints and charges per input and output token. You never pay for the GPU sitting idle between your requests, and you never pay for model load time, because the endpoint stays warm. When sustained traffic outgrows serverless, you move to a dedicated endpoint (1x H100 80GB at $6.49/hr) or a raw HGX cluster (H100 at $5.49/hr on-demand, down to $3.99/hr reserved on long commitments).

Replicate: per-GPU-second across any model

Replicate charges for the time your model spends running on its hardware. H100 runs at $0.001525 per second ($5.49/hr), A100 80GB at $0.0014 per second ($5.04/hr), L40S at $0.000975 per second ($3.51/hr), and T4 at $0.000225 per second ($0.81/hr). Models scale to zero when idle, so you pay nothing while a model is off, but the next request cold-boots, which the docs say can take several minutes on large models. Only running prediction time is billed, so cold boots add latency, not cost.

$1.04/M

Together Llama 70B (in & out)

$5.49/hr

Replicate H100 (per-second)

minutes

Replicate idle-model cold boot

The practical consequence: on steady, high-volume text traffic, Together's per-token model wins because GPU-seconds on Replicate accumulate whether or not the GPU is producing tokens. On bursty or one-off jobs, Replicate's scale-to-zero wins because you are not paying to keep a GPU warm. Some hosted LLMs on Replicate (Claude, DeepSeek) are billed per token instead, so check the model page.

Serverless Token Pricing

Together publishes per-token rates for its open-model catalog. Replicate prices most models per GPU-second, but bills a handful of large language models per token. The per-token comparison below uses Together's catalog and Replicate's per-token language models.

Per-Token Pricing (per 1M tokens, June 2026)

Model	Together AI	Replicate
Llama 3.3 70B	$1.04 in / $1.04 out	Per GPU-second
DeepSeek V4 Pro	$2.10 in / $4.40 out	Not listed per-token
DeepSeek R1	Not listed	$3.75 in / $10.00 out
Kimi K2.6	$1.20 in / $4.50 out	Not listed
GLM-5.1	$1.40 in / $4.40 out	Not listed
Qwen3.5-397B-A17B	$0.60 in / $3.60 out	Not listed
GPT-OSS-120B	$0.15 in / $0.60 out	Not listed
MiniMax M2.7	$0.30 in / $1.20 out	Not listed
Claude 3.7 Sonnet (proxied)	Not hosted	$3.00 in / $15.00 out

Together also runs a Batch API at up to 50% off serverless on selected models with a 24-hour completion window (up to 50,000 requests per batch). Cached input tokens cost less on supported models. For a fixed, known rate limit, Together recommends a dedicated endpoint, since serverless rate limits are dynamic per-model and scale with sustained traffic.

GPU Pricing

For dedicated capacity, Together sells both single-GPU endpoints and full HGX clusters. Replicate sells per-second hardware, with multi-GPU configurations gated behind committed contracts.

GPU Pricing (June 2026)

Hardware	Together AI	Replicate
H100 80GB	$6.49/hr endpoint, $5.49/hr cluster	$5.49/hr ($0.001525/s)
A100 80GB	Cluster (contact sales)	$5.04/hr ($0.0014/s)
H200 141GB	$6.79/hr cluster	Not published
B200 180GB	$11.95/hr endpoint, $9.95/hr cluster	Not published
L40S	Cluster	$3.51/hr ($0.000975/s)
T4	Cluster	$0.81/hr ($0.000225/s)
Multi-GPU	HGX clusters (8x standard)	2x H100 $10.98/hr, committed only
Reserved discount	7-180+ days, $3.99-$9.65/hr	Committed contracts

Replicate's single-GPU on-demand H100 ($5.49/hr) undercuts Together's dedicated endpoint ($6.49/hr) and matches its on-demand cluster price. Replicate does not publish B200 or H200 rates, so for newer Blackwell and Hopper-refresh hardware Together is the only one of the two with a public number.

Cost on a Real Workload

Cost on a real workload (computed from June 2026 list prices)

Serving Llama 3.3 70B at 50M input and 50M output tokens per day. On Together serverless at $1.04 per 1M input and $1.04 per 1M output: (50 + 50) × $1.04 = $104/day = ~$3,120/month, with no GPU to provision and no cold starts.

Replicate is GPU-priced, not token-priced, so the comparison is utilization. One H100 at $5.49/hr held warm to absorb steady traffic costs $5.49 × 24 × 30 = ~$3,953/month per GPU, and scale-to-zero saves nothing because steady traffic never lets the GPU idle. If one H100 cannot sustain the throughput, each additional GPU adds another ~$3,953/month, while Together stays per-token.

The break-even flips when utilization drops. If that same workload ran a few minutes an hour instead of continuously, Replicate's scale-to-zero would cost a fraction of a held-warm GPU, while Together's per-token bill shrinks proportionally too. Below roughly one-third GPU utilization, Replicate's pay-per-second wins; above it, Together's warm per-token endpoint wins. The cold-boot penalty on Replicate is the tax you pay for that scale-to-zero economics.

Cold Starts & Latency

Together avoids cold starts on its catalog; Replicate trades latency for scale-to-zero economics on idle models.

On Replicate, a model that has not been used for a while turns off. The next request triggers a cold boot the docs describe as several minutes for large models, depending on container image size and GPU availability at that moment. That is fine for an async image job and unacceptable for an interactive chat. Replicate deployments support a minimum-instances setting to keep a model warm and eliminate cold starts, scaling from zero to hundreds of instances based on traffic, but warm instances bill for idle time. Private models bill for setup, idle, and active time, except fast-booting fine-tunes, which bill only active processing time.

Together keeps its serverless catalog warm, so listed models respond without a load penalty. For latency-sensitive, always-on traffic, this is the cleaner default. For a fixed rate-limit guarantee, Together points heavy users to a dedicated endpoint, where capacity is reserved rather than shared.

Neither is built for the coding-agent apply loop. If applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Together warm-catalog cold start

minutes

Replicate idle-model cold boot

Model Catalog: Replicate Is Broader, Together Is Production-Grade

Replicate has the larger raw count; Together has the more reliable serving guarantees and a unified per-token API.

Catalog & Modalities

Capability	Together AI	Replicate
Text LLMs	Curated (Llama, Qwen, DeepSeek, GLM, Kimi)	Per-token + per-second (incl. Claude, DeepSeek)
Image generation	Yes	Yes (~$0.04/output image, thousands)
Video generation	Yes	Yes (~$0.09/sec of output, large selection)
Audio / transcription	Whisper Large v3 $0.0015/min	Yes (Whisper, others)
Total models	Curated open-model set	Thousands (community)
Code interpreter	$0.03/session (60 min)	No
Sandbox / storage	vCPU $0.0446/hr, storage $0.16/GiB/mo	No
Production reliability	Hardened serverless endpoints	Varies by community model

Replicate built its name on accessibility: upload a model, get an API endpoint. That created the widest selection of experimental, fine-tuned, and niche variants anywhere, especially for image and video. The tradeoff is that it is a community-model platform first, so production reliability varies by model. Together curates fewer models but runs each as a hardened serverless endpoint with a unified per-token API.

Custom Deployment: Cog vs Bring-Your-Own

Replicate is the easier path to ship an arbitrary custom model; Together favors fine-tuning within its supported set.

Replicate's Cog packages any model, including custom pipelines, preprocessing, and non-standard architectures, into a container with a defined input and output schema, then runs it on managed GPUs. If you have a research model or a multi-step pipeline that does not fit a standard LLM or diffusion template, Replicate is the most direct way to get an API in front of it. Deployments give you dedicated hardware with min and max instances to balance warm capacity against spend.

Together is built around its curated catalog plus bring-your-own fine-tuning. You can fine-tune supported base models and serve the result serverless or dedicated, with the option to export your weights. It is less of a general container host and more of a managed inference and fine-tuning cloud for known model families.

Fine-Tuning

Both fine-tune, but they aim at different model types. Together prices text fine-tuning per training token; Replicate prices it per GPU-second and leads on image fine-tunes.

Fine-Tuning Comparison (June 2026)

Aspect	Together AI	Replicate
Text LoRA SFT (per 1M tokens)	$0.48 (up to 16B), $1.50 (17-69B), $2.90 (70-100B)	Per GPU-second
Text LoRA DPO (per 1M tokens)	$0.54, $1.65, $3.20 by size	Per GPU-second
Image fine-tuning	Limited	Yes (FLUX LoRA)
Export weights	Yes	Yes (LoRA weights)
Serve fine-tune	Serverless or dedicated	Warm runnable model

Together is the stronger choice for text fine-tuning: per-training-token pricing across LoRA and DPO with exportable weights if you want to self-host later. Replicate leads when the target is an image LoRA or any non-standard architecture you would rather package with Cog than fit into a text-LLM template.

Compliance

Together AI is SOC 2 Type 2 certified, with an independent audit covering access management, encryption, incident response, and change management. Replicate's compliance posture varies by the underlying model and provider, since much of its catalog is community-contributed. For a regulated text workload with a single attested vendor, Together is the clearer choice.

Best for DeepSeek and Codegen: Morph

Neither Together nor Replicate is tuned for running DeepSeek at full fidelity or for the coding-agent inner loop. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality. Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. That makes Morph the best place to run DeepSeek when output fidelity matters.

For coding agents specifically, Morph runs codegen-tuned speculative decoding plus custom low-level inference kernels optimized for code generation. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. The apply model runs at ~10,500 tok/s, and WarpGrep semantic code search is free up to 100k requests, then $1 per 1M. See Morph models and pricing.

When to Use Together AI

High-volume text serving. Per-token billing on warm endpoints amortizes across steady traffic. Llama 3.3 70B is $1.04/M in and out, GPT-OSS-120B is $0.15/$0.60.
Latency-sensitive, always-on traffic. The serverless catalog stays warm, so listed models respond with no cold-start penalty.
Multimodal on one bill. Text, image, video, Whisper transcription ($0.0015/min), embeddings, a code interpreter ($0.03/session), and sandboxes through one API.
You graduate to dedicated. Reserve a dedicated H100 endpoint ($6.49/hr) or HGX clusters (H100 $5.49/hr on-demand, down to $3.99/hr reserved) once serverless stops paying off.
Regulated workloads. SOC 2 Type 2 with audited access management, encryption, and incident response.

When to Use Replicate

Widest model selection. Thousands of community models, especially image and video, including experimental and niche variants you will not find on a curated catalog.
Ship a custom model fast. Cog packages any architecture or pipeline into an API on managed GPUs, with no per-token model-family restrictions.
Spiky or one-off jobs. Per-GPU-second billing with scale-to-zero means you pay nothing while a model is off, ideal for occasional batch image or video runs.
Output-billed media. Image at ~$0.04 per output image and video at ~$0.09 per second of output, so cost tracks what you produce.
Prototyping across many models. One API key to try thousands of models without provisioning anything, accepting a multi-minute cold boot on idle large models.

Frequently Asked Questions

Is Together AI or Replicate cheaper for LLM inference?

For steady text generation, Together is far cheaper because it bills per token on warm endpoints. Together lists Llama 3.3 70B at $1.04 per 1M input and $1.04 per 1M output. Replicate bills per GPU-second: an H100 is $5.49/hr ($0.001525/sec), so a held-warm GPU absorbing steady traffic costs about $3,953 per month per GPU. Replicate only wins on cost for spiky or one-off jobs where a warm GPU would sit idle, because its scale-to-zero charges nothing while a model is off.

What is the main difference between Together AI and Replicate?

Together is a per-token serverless inference cloud for a curated catalog of open models, kept warm for low latency, plus dedicated endpoints and raw GPU clusters. Replicate is a per-GPU-second platform for running any Cog-packaged container, including thousands of image and video models, but it scales idle models to zero and cold-boots them on the next request, which can take minutes on large models. Together optimizes for high-volume warm serving; Replicate optimizes for breadth and running custom or niche models on demand.

Does Replicate have cold starts?

Yes. Replicate scales models to zero by default: when a model has not been used for a while, it turns off. The next request triggers a cold boot the docs say can take several minutes on large models. You are only billed for running prediction time, so cold boots add latency, not cost. You can pin a deployment to minimum instances to keep it warm and remove cold starts, but then you pay for idle time. Together keeps its serverless catalog warm, so listed models respond without a load penalty.

How does Replicate bill, per token or per GPU-second?

Most Replicate models are billed by runtime in GPU-seconds: H100 $0.001525/sec ($5.49/hr), A100 80GB $0.0014/sec ($5.04/hr), L40S $0.000975/sec ($3.51/hr), T4 $0.000225/sec ($0.81/hr). Some hosted language models bill per token instead, for example DeepSeek R1 at $3.75/M input and $10/M output, and Claude 3.7 Sonnet at $3/M input and $15/M output. Image and video models bill per output, around $0.04 per output image or $0.09 per second of output video. The unit changes by model, so check each model page.

Where can I run DeepSeek with the highest output quality?

Most serverless providers quantize activations to fp8 to cut cost, which degrades DeepSeek output. Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights, and it adds codegen-tuned speculative decoding and custom kernels for coding agents. morph-dsv4flash (DeepSeek V4 Flash) is $0.139/M input and $0.278/M output. Together does not list a DeepSeek V4 Flash serverless price, and Replicate bills DeepSeek per token at higher rates. See Morph models.

Related Comparisons

Together for Per-Token Text, Replicate for Multimodal Breadth, Morph for DeepSeek

Pick Together for high-volume warm per-token serving, Replicate for image, video, and one-line Cog deploys. For DeepSeek at 16-bit activations or a coding-agent apply loop, Morph serves morph-dsv4flash at $0.139/$0.278 per 1M and applies code at ~10,500 tok/s.

Try Morph Free

See Morph models

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs Replicate (2026): Per-Token at $1.04/M for Llama 70B vs $5.49/hr GPU-Seconds With Cold Starts

Who Wins Per Workload

Quick Comparison

Billing Model: Per Token vs Per GPU-Second

Together: per-token on warm endpoints

Replicate: per-GPU-second across any model

Serverless Token Pricing

GPU Pricing

Cost on a Real Workload

Cold Starts & Latency

Model Catalog: Replicate Is Broader, Together Is Production-Grade

Custom Deployment: Cog vs Bring-Your-Own

Fine-Tuning

Compliance

Best for DeepSeek and Codegen: Morph

When to Use Together AI

When to Use Replicate

Frequently Asked Questions

Is Together AI or Replicate cheaper for LLM inference?

What is the main difference between Together AI and Replicate?

Does Replicate have cold starts?

How does Replicate bill, per token or per GPU-second?

Where can I run DeepSeek with the highest output quality?

Related Comparisons

Together for Per-Token Text, Replicate for Multimodal Breadth, Morph for DeepSeek