Together AI vs Replicate (2026): Together for Text-LLM Throughput and Training, Replicate for Multimodal Breadth and Cog Deploys

Together is the open-LLM API and training cloud: per-token serving, ATLAS speculative decoding, fine-tune-and-export, raw GPU clusters. Replicate is the run-any-container catalog: image, video, audio, and one-line Cog deploys billed per GPU-second with cold starts.

June 3, 2026 · 1 min read

Pick Together AI for serious text-LLM throughput and training, and pick Replicate for multimodal breadth and frictionless model packaging. Together is an open-model API and training cloud tuned for production LLMs: per-token serving, ATLAS speculative decoding, fine-tune-and-export, and raw GPU clusters when serverless stops paying off. Replicate, now owned by Cloudflare, is a run-any-container catalog that shines for image, video, and audio and for one-line Cog deploys of arbitrary models, billed per GPU-second with cold starts on idle models.

That split, an open-LLM cloud versus a universal model catalog, decides cost at scale, latency on the first request, and which workloads each platform fits. Together is built for high-volume text and warm-endpoint economics. Replicate is built for running anything anyone has packaged, on demand.

TL;DR

  • Pick Together if you serve high-volume text or multimodal traffic and want per-token pricing on warm endpoints. 200+ curated models, ATLAS speculative decoding up to 4x vLLM once warmed, dedicated GPUs and raw HGX clusters when serverless stops being economical, and SOC 2 Type 2 plus HIPAA.
  • Pick Replicate if you need the widest catalog of community models, especially image and video, or you want to package and ship your own model with Cog. Per-GPU-second billing with scale-to-zero fits spiky and one-off jobs, and Flux fine-tunes train in under 2 minutes for under $2.

Who Wins Per Workload

The decision is rarely "which is better" and almost always "which fits this workload." Here is the call by the decision a developer actually faces.

Workload / decisionTogether AIReplicate
Sustained high-volume textTogether: per-token warm endpointsLoses: per-second on every gen
Bursty / one-off jobsLoses: warm endpoint sits idleReplicate: scale-to-zero, pay nothing idle
Fastest first call (no cold start)Together: warm catalog, 0sLoses: 30-120s on idle models
Multimodal (image/video/audio)Curated set onlyReplicate: thousands, widest breadth
Fine-tune & export text-LLM weightsTogether: per-token tunes, download weightsImage LoRA only, per-second
Ship an arbitrary custom modelBring-your-own within familiesReplicate: Cog packages any container
Raw GPU clusters at scaleTogether: reserved H100/H200 + HGXOn-demand GPUs, no HGX clusters
Cheapest per request at volumeTogether: ~order of magnitude lowerLoses on steady traffic
Strictest complianceTogether: SOC 2 Type 2 + HIPAA + BAAVaries by model

Quick Comparison

The headline split is billing axis. Together bills per token on warm models; Replicate bills per GPU-second across a far larger but colder catalog. Morph is added here as the recommended pick for the coding-agent slice only.

SpecTogether AIReplicateMorph
FocusHigh-volume text + multimodal servingRun any community model on demandCoding-agent inner loop
Billing axisPer token (warm)Per GPU-second (cold-starts)Per token (code-tuned)
Llama 70B serverless (per 1M)~$0.88-$1.04~$0.01-$0.03 per request (GPU-sec)N/A (not a general host)
Dedicated GPU (H100/hr)~$2.99 reserved$5.49 on-demandN/A
Cold startsNo (warm catalog)30-120s on idle modelsNo
Model catalog200+ curatedThousands (community)Code apply, search, compact
Code-specific apply endpointNoNoYes (/v1/code/apply)
Semantic code searchNoNoWarpGrep ($0/100k)
Apply throughputGeneral token-by-tokenGeneral + cold start~10,500 tok/s
First-pass apply accuracyN/AN/A98%
Custom model deployBring-your-own + fine-tuneCog packaging, any modelN/A
Best forPer-token serving at scaleImage/video + custom modelsCoding agents (apply/search/compact)

Numbers are list prices as of early 2026 and change often. Verify on each platform's pricing page before committing volume.

Billing Model: Per Token vs Per GPU-Second

This is the decision that drives everything else. Together meters output; Replicate meters wall-clock GPU time.

Together: per-token on warm endpoints

Together runs a curated catalog of 200+ models on always-on serverless endpoints and charges per input and output token. You never pay for the GPU sitting idle between your requests, and you never pay for model load time, because the endpoint stays warm. The Together Inference Engine adds ATLAS, an adaptive speculator that retrains on your live traffic and claims up to 4x vLLM throughput once warmed, so steady workloads get faster the longer they run.

Replicate: per-GPU-second across any model

Replicate charges for the time your model spends running on its hardware, including setup and cold-start time on public models. H100 runs at $0.001525 per second ($5.49/hr), A100 80GB at $0.0014 per second ($5.04/hr), L40S at $0.000975 per second ($3.51/hr), and T4 at $0.000225 per second ($0.81/hr). For an LLM, you pay for every second of generation rather than per token, which means a slow or cold model costs more for the same output.

200+
Together curated models
$5.49/hr
Replicate H100 (per-second)
4x
Together ATLAS vs vLLM (warmed)

The practical consequence: on steady, high-volume text traffic, Together's per-token model wins because GPU-seconds on Replicate include idle and cold-start overhead. On bursty or one-off jobs, Replicate's scale-to-zero wins because you are not paying to keep an endpoint warm. Some hosted LLMs on Replicate (Claude, DeepSeek) are billed per token instead, so check the model page.

Pricing in Practice

For mainstream LLM serving at volume, Together is cheaper per request by roughly an order of magnitude. Replicate's pricing shines on image, video, and infrequent custom jobs.

WorkloadTogether AIReplicate
Llama 3.3 70B (per 1M tokens)~$0.88-$1.04~$0.01-$0.03 per request
Small model (under 16B)From ~$0.03-$0.20 / 1MGPU-second (model dependent)
Image generation~$0.0006-$0.134 / imageGPU-second (e.g. Flux)
H100 GPU-hour~$2.99 reserved$5.49 on-demand
A100 80GB GPU-hourCustom$5.04
Scale to zeroDedicated onlyYes (public models)
Idle costNone on serverlessPaid on private deployments

One worked example: a Llama 3.3 70B call with 1K input and 500 output tokens costs Together roughly $0.0007. The same call on Replicate runs $0.01 to $0.03 once GPU-second and cold-start time are included, an order of magnitude more. That gap only matters at volume, but at volume it dominates the bill.

Where each platform wins on cost

Cheaper depends on volume, not preference. Together wins when a warm endpoint is busy enough to amortize: high-volume chat, RAG, batch generation. Replicate wins when an endpoint would otherwise sit idle: occasional image jobs, a niche model you call a few times a day, a custom pipeline you do not want to keep online. Together also estimates dedicated H100s beat its own serverless above roughly 130,000 tokens per minute of sustained load, so heavy users graduate to dedicated either way.

Cost on a Real Workload

Cost on a real workload (computed from list prices, June 2026)

Serving Llama 70B at 50M output tokens per day. On Together serverless at the low end of its listed range (~$0.88 per 1M tokens): 50 × $0.88 = $44/day = ~$1,320/month. That 50M/day works out to an average of ~34,700 tokens/minute, well under Together's own ~130k tok/min break-even, so serverless is the right tier and dedicated would not pay off at this load.

Replicate is GPU-priced, not token-priced, so the comparison is utilization. One H100 at $5.49/hr held warm to absorb steady traffic costs $5.49 × 24 × 30 = ~$3,953/month per GPU, and scale-to-zero saves nothing because steady traffic never lets the GPU idle. Together's per-token serverless is roughly 3x cheaper here, and the gap only widens as volume climbs, since Together stays per-token while each extra Replicate GPU adds another ~$3,953/month.

The break-even flips only when utilization drops: if that same workload ran a few minutes an hour instead of continuously, Replicate's scale-to-zero would cost a fraction of a held-warm GPU while Together's per-token bill would shrink proportionally too. Below roughly one-third GPU utilization, Replicate's pay-per-second wins; above it, Together's warm per-token endpoint wins.

Cold Starts & Latency

Together avoids cold starts on its catalog; Replicate trades latency for scale-to-zero economics on idle models.

On Replicate, public models scale to zero when idle. A request after an idle period pays a cold start, 30 to 120 seconds for large image or 70B models depending on container image size and GPU availability at that moment. That is fine for an async image job and unacceptable for an interactive chat. You can pin a private deployment to always-on instances to remove cold starts, but then you pay for the idle time you were trying to avoid.

Together keeps its serverless catalog warm, so listed models respond without a load penalty, and ATLAS speculative decoding pushes steady-state throughput higher the longer a workload runs. For latency-sensitive, always-on traffic, this is the cleaner default.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

0s
Together warm-catalog cold start
30-120s
Replicate idle-model cold start

Model Catalog: Replicate Is Broader, Together Is Production-Grade

Replicate has the larger raw count; Together has the more reliable serving guarantees.

CapabilityTogether AIReplicate
Total models200+ curatedThousands (community)
Text LLMsYes (Llama, Qwen, DeepSeek, GLM)Yes (incl. GPT, Claude proxied)
Image generationYes (FLUX, others)Yes (FLUX, SDXL, thousands)
Video generationYesYes (large selection)
Audio / transcriptionYes (~$0.0015/min)Yes (Whisper, others)
EmbeddingsYes (~$0.02/1M)Yes (community)
Rerank endpointYes (~$0.20/1M)Limited
Code interpreterYes (~$0.03/session)No

Replicate built its name on accessibility: upload a model, get an API endpoint. That created the widest selection of experimental, fine-tuned, and niche variants anywhere, especially for image and video. The tradeoff is that it is a community-model platform first, so production reliability varies by model. Together curates fewer models but runs each as a hardened serverless endpoint with a unified per-token API.

Custom Deployment: Cog vs Bring-Your-Own

Replicate is the easier path to ship an arbitrary custom model; Together favors fine-tuning within its supported set.

Replicate's Cog packages any model, including custom pipelines, preprocessing, and non-standard architectures, into a container with a defined input and output schema, then runs it on managed GPUs. If you have a research model or a multi-step pipeline that does not fit a standard LLM or diffusion template, Replicate is the most direct way to get an API in front of it. Private deployments give you dedicated hardware with scale-to-zero on fast-booting fine-tunes.

Together is built around its curated catalog plus bring-your-own fine-tuning. You can fine-tune supported base models and serve the result serverless or dedicated, with full data control and the option to download your weights. It is less of a general container host and more of a managed inference and fine-tuning cloud for known model families.

Fine-Tuning: Different Targets

Both fine-tune, but they aim at different model types. Together leads on text LoRA and full fine-tunes; Replicate leads on image fine-tunes.

AspectTogether AIReplicate
Text LoRA (per 1M tokens)~$0.48-$2.90 by sizePer GPU-second
Full fine-tuningYes (per training token)Per GPU-second
Image fine-tuningLimitedYes (FLUX, under 2 min / ~$2)
Download weightsYesYes (LoRA weights)
Serve fine-tuneServerless or dedicatedWarm runnable model

Replicate's FLUX fine-tuner trains a LoRA in under 2 minutes for under $2 on 8x H100, and hands back a warm, runnable model plus downloadable LoRA weights. That is the fastest path to a custom image model. Together is the stronger choice for text fine-tuning, with per-training-token pricing across LoRA and full tunes and downloadable weights if you want to self-host later.

When to Use Together AI

  • High-volume text serving. Per-token billing on warm endpoints amortizes across steady traffic, and ATLAS speculative decoding climbs to 4x vLLM as it learns your prompt patterns.
  • Latency-sensitive, always-on traffic. The serverless catalog stays warm, so listed models respond without a cold-start penalty.
  • Multimodal on one bill. Text, image, audio, embeddings, rerank, and a code interpreter through a single unified API.
  • You graduate to dedicated. Reserve dedicated H100/H200 endpoints (around $2.99/hr reserved) or raw HGX clusters once you cross roughly 130k tokens per minute.
  • Regulated workloads. SOC 2 Type 2 plus HIPAA with encryption in transit and at rest, and Business Associate Agreements for healthcare.

When to Use Replicate

  • Widest model selection. Thousands of community models, especially image and video, including experimental and niche variants you will not find on a curated catalog.
  • Ship a custom model fast. Cog packages any architecture or pipeline into an API on managed GPUs, no per-token model family restrictions.
  • Spiky or one-off jobs. Per-GPU-second billing with scale-to-zero means you pay nothing when idle, ideal for occasional batch image or video runs.
  • Fast custom image models. FLUX LoRA fine-tunes in under 2 minutes for under $2, with downloadable weights and a warm runnable model.
  • Prototyping across many models. One API key to try thousands of models without provisioning anything.

Frequently Asked Questions

Is Together AI or Replicate cheaper for LLM inference?

For high-volume text generation, Together is usually far cheaper because it bills per token on warm endpoints, roughly $0.88 to $1.04 per 1M tokens for Llama 3.3 70B as of early 2026. Replicate bills per GPU-second, so a single 70B request can cost $0.01 to $0.03 once cold-start time is included, an order of magnitude more on steady traffic. Replicate only wins on cost for spiky or one-off jobs where a warm endpoint would sit idle.

What is the main difference between Together AI and Replicate?

Together is a per-token serverless inference cloud for a curated catalog of 200+ models, kept warm for low latency. Replicate is a per-GPU-second platform for running any Cog-packaged model, thousands of them including image and video, but with cold starts on idle models. Together optimizes for high-volume warm serving; Replicate optimizes for breadth and running custom or niche models on demand.

Does Replicate have cold starts?

Yes. Public models scale to zero when idle, so a request after an idle period pays a cold start, typically 30 to 120 seconds for large image or 70B models depending on container size and GPU availability. You can avoid cold starts with an always-on private deployment, but then you pay for idle time. Together keeps its serverless catalog warm, so it does not have this tradeoff on listed models.

Which platform has more models?

Replicate has the larger raw count, with thousands of community-contributed models, the widest selection of experimental and niche variants anywhere, especially for image and video. Together curates a smaller set of 200+ production models across text, image, audio, code, and embeddings, with a unified per-token API and higher reliability guarantees. Replicate wins on breadth; Together wins on production-grade serving.

Should I use Together or Replicate to fine-tune and deploy a custom model?

It depends on the model type. For a text LLM, Together is the stronger choice: it fine-tunes supported base models with per-training-token pricing across LoRA and full tunes, serves the result serverless or dedicated, and lets you download the weights to self-host later. For an image model or any non-standard architecture, Replicate wins: its FLUX fine-tuner trains a LoRA in under 2 minutes for under $2 and hands back downloadable weights, and Cog packages any container, including custom pipelines and preprocessing, into an API on managed GPUs. Choose Together when you are training and exporting text-LLM weights; choose Replicate when you are packaging an arbitrary model or fine-tuning for image and video.

Related Comparisons

Together for Text-LLM Throughput, Replicate for Multimodal Breadth

Pick Together when you serve high-volume text or train and export LLM weights; pick Replicate for image, video, and one-line Cog deploys. If applying model-generated code edits is your bottleneck, that is a separate job Morph handles at ~10,500 tok/s.