Pick Together AI for serious text-LLM throughput and training, and pick Replicate for multimodal breadth and frictionless model packaging. Together is an open-model API and training cloud tuned for production LLMs: per-token serving, ATLAS speculative decoding, fine-tune-and-export, and raw GPU clusters when serverless stops paying off. Replicate, now owned by Cloudflare, is a run-any-container catalog that shines for image, video, and audio and for one-line Cog deploys of arbitrary models, billed per GPU-second with cold starts on idle models.
That split, an open-LLM cloud versus a universal model catalog, decides cost at scale, latency on the first request, and which workloads each platform fits. Together is built for high-volume text and warm-endpoint economics. Replicate is built for running anything anyone has packaged, on demand.
TL;DR
- Pick Together if you serve high-volume text or multimodal traffic and want per-token pricing on warm endpoints. 200+ curated models, ATLAS speculative decoding up to 4x vLLM once warmed, dedicated GPUs and raw HGX clusters when serverless stops being economical, and SOC 2 Type 2 plus HIPAA.
- Pick Replicate if you need the widest catalog of community models, especially image and video, or you want to package and ship your own model with Cog. Per-GPU-second billing with scale-to-zero fits spiky and one-off jobs, and Flux fine-tunes train in under 2 minutes for under $2.
Who Wins Per Workload
The decision is rarely "which is better" and almost always "which fits this workload." Here is the call by the decision a developer actually faces.
| Workload / decision | Together AI | Replicate |
|---|---|---|
| Sustained high-volume text | Together: per-token warm endpoints | Loses: per-second on every gen |
| Bursty / one-off jobs | Loses: warm endpoint sits idle | Replicate: scale-to-zero, pay nothing idle |
| Fastest first call (no cold start) | Together: warm catalog, 0s | Loses: 30-120s on idle models |
| Multimodal (image/video/audio) | Curated set only | Replicate: thousands, widest breadth |
| Fine-tune & export text-LLM weights | Together: per-token tunes, download weights | Image LoRA only, per-second |
| Ship an arbitrary custom model | Bring-your-own within families | Replicate: Cog packages any container |
| Raw GPU clusters at scale | Together: reserved H100/H200 + HGX | On-demand GPUs, no HGX clusters |
| Cheapest per request at volume | Together: ~order of magnitude lower | Loses on steady traffic |
| Strictest compliance | Together: SOC 2 Type 2 + HIPAA + BAA | Varies by model |
Quick Comparison
The headline split is billing axis. Together bills per token on warm models; Replicate bills per GPU-second across a far larger but colder catalog. Morph is added here as the recommended pick for the coding-agent slice only.
| Spec | Together AI | Replicate | Morph |
|---|---|---|---|
| Focus | High-volume text + multimodal serving | Run any community model on demand | Coding-agent inner loop |
| Billing axis | Per token (warm) | Per GPU-second (cold-starts) | Per token (code-tuned) |
| Llama 70B serverless (per 1M) | ~$0.88-$1.04 | ~$0.01-$0.03 per request (GPU-sec) | N/A (not a general host) |
| Dedicated GPU (H100/hr) | ~$2.99 reserved | $5.49 on-demand | N/A |
| Cold starts | No (warm catalog) | 30-120s on idle models | No |
| Model catalog | 200+ curated | Thousands (community) | Code apply, search, compact |
| Code-specific apply endpoint | No | No | Yes (/v1/code/apply) |
| Semantic code search | No | No | WarpGrep ($0/100k) |
| Apply throughput | General token-by-token | General + cold start | ~10,500 tok/s |
| First-pass apply accuracy | N/A | N/A | 98% |
| Custom model deploy | Bring-your-own + fine-tune | Cog packaging, any model | N/A |
| Best for | Per-token serving at scale | Image/video + custom models | Coding agents (apply/search/compact) |
Numbers are list prices as of early 2026 and change often. Verify on each platform's pricing page before committing volume.
Billing Model: Per Token vs Per GPU-Second
This is the decision that drives everything else. Together meters output; Replicate meters wall-clock GPU time.
Together: per-token on warm endpoints
Together runs a curated catalog of 200+ models on always-on serverless endpoints and charges per input and output token. You never pay for the GPU sitting idle between your requests, and you never pay for model load time, because the endpoint stays warm. The Together Inference Engine adds ATLAS, an adaptive speculator that retrains on your live traffic and claims up to 4x vLLM throughput once warmed, so steady workloads get faster the longer they run.
Replicate: per-GPU-second across any model
Replicate charges for the time your model spends running on its hardware, including setup and cold-start time on public models. H100 runs at $0.001525 per second ($5.49/hr), A100 80GB at $0.0014 per second ($5.04/hr), L40S at $0.000975 per second ($3.51/hr), and T4 at $0.000225 per second ($0.81/hr). For an LLM, you pay for every second of generation rather than per token, which means a slow or cold model costs more for the same output.
The practical consequence: on steady, high-volume text traffic, Together's per-token model wins because GPU-seconds on Replicate include idle and cold-start overhead. On bursty or one-off jobs, Replicate's scale-to-zero wins because you are not paying to keep an endpoint warm. Some hosted LLMs on Replicate (Claude, DeepSeek) are billed per token instead, so check the model page.
Pricing in Practice
For mainstream LLM serving at volume, Together is cheaper per request by roughly an order of magnitude. Replicate's pricing shines on image, video, and infrequent custom jobs.
| Workload | Together AI | Replicate |
|---|---|---|
| Llama 3.3 70B (per 1M tokens) | ~$0.88-$1.04 | ~$0.01-$0.03 per request |
| Small model (under 16B) | From ~$0.03-$0.20 / 1M | GPU-second (model dependent) |
| Image generation | ~$0.0006-$0.134 / image | GPU-second (e.g. Flux) |
| H100 GPU-hour | ~$2.99 reserved | $5.49 on-demand |
| A100 80GB GPU-hour | Custom | $5.04 |
| Scale to zero | Dedicated only | Yes (public models) |
| Idle cost | None on serverless | Paid on private deployments |
One worked example: a Llama 3.3 70B call with 1K input and 500 output tokens costs Together roughly $0.0007. The same call on Replicate runs $0.01 to $0.03 once GPU-second and cold-start time are included, an order of magnitude more. That gap only matters at volume, but at volume it dominates the bill.
Where each platform wins on cost
Cheaper depends on volume, not preference. Together wins when a warm endpoint is busy enough to amortize: high-volume chat, RAG, batch generation. Replicate wins when an endpoint would otherwise sit idle: occasional image jobs, a niche model you call a few times a day, a custom pipeline you do not want to keep online. Together also estimates dedicated H100s beat its own serverless above roughly 130,000 tokens per minute of sustained load, so heavy users graduate to dedicated either way.
Cost on a Real Workload
Cost on a real workload (computed from list prices, June 2026)
Serving Llama 70B at 50M output tokens per day. On Together serverless at the low end of its listed range (~$0.88 per 1M tokens): 50 × $0.88 = $44/day = ~$1,320/month. That 50M/day works out to an average of ~34,700 tokens/minute, well under Together's own ~130k tok/min break-even, so serverless is the right tier and dedicated would not pay off at this load.
Replicate is GPU-priced, not token-priced, so the comparison is utilization. One H100 at $5.49/hr held warm to absorb steady traffic costs $5.49 × 24 × 30 = ~$3,953/month per GPU, and scale-to-zero saves nothing because steady traffic never lets the GPU idle. Together's per-token serverless is roughly 3x cheaper here, and the gap only widens as volume climbs, since Together stays per-token while each extra Replicate GPU adds another ~$3,953/month.
The break-even flips only when utilization drops: if that same workload ran a few minutes an hour instead of continuously, Replicate's scale-to-zero would cost a fraction of a held-warm GPU while Together's per-token bill would shrink proportionally too. Below roughly one-third GPU utilization, Replicate's pay-per-second wins; above it, Together's warm per-token endpoint wins.
Cold Starts & Latency
Together avoids cold starts on its catalog; Replicate trades latency for scale-to-zero economics on idle models.
On Replicate, public models scale to zero when idle. A request after an idle period pays a cold start, 30 to 120 seconds for large image or 70B models depending on container image size and GPU availability at that moment. That is fine for an async image job and unacceptable for an interactive chat. You can pin a private deployment to always-on instances to remove cold starts, but then you pay for the idle time you were trying to avoid.
Together keeps its serverless catalog warm, so listed models respond without a load penalty, and ATLAS speculative decoding pushes steady-state throughput higher the longer a workload runs. For latency-sensitive, always-on traffic, this is the cleaner default.
Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).
Model Catalog: Replicate Is Broader, Together Is Production-Grade
Replicate has the larger raw count; Together has the more reliable serving guarantees.
| Capability | Together AI | Replicate |
|---|---|---|
| Total models | 200+ curated | Thousands (community) |
| Text LLMs | Yes (Llama, Qwen, DeepSeek, GLM) | Yes (incl. GPT, Claude proxied) |
| Image generation | Yes (FLUX, others) | Yes (FLUX, SDXL, thousands) |
| Video generation | Yes | Yes (large selection) |
| Audio / transcription | Yes (~$0.0015/min) | Yes (Whisper, others) |
| Embeddings | Yes (~$0.02/1M) | Yes (community) |
| Rerank endpoint | Yes (~$0.20/1M) | Limited |
| Code interpreter | Yes (~$0.03/session) | No |
Replicate built its name on accessibility: upload a model, get an API endpoint. That created the widest selection of experimental, fine-tuned, and niche variants anywhere, especially for image and video. The tradeoff is that it is a community-model platform first, so production reliability varies by model. Together curates fewer models but runs each as a hardened serverless endpoint with a unified per-token API.
Custom Deployment: Cog vs Bring-Your-Own
Replicate is the easier path to ship an arbitrary custom model; Together favors fine-tuning within its supported set.
Replicate's Cog packages any model, including custom pipelines, preprocessing, and non-standard architectures, into a container with a defined input and output schema, then runs it on managed GPUs. If you have a research model or a multi-step pipeline that does not fit a standard LLM or diffusion template, Replicate is the most direct way to get an API in front of it. Private deployments give you dedicated hardware with scale-to-zero on fast-booting fine-tunes.
Together is built around its curated catalog plus bring-your-own fine-tuning. You can fine-tune supported base models and serve the result serverless or dedicated, with full data control and the option to download your weights. It is less of a general container host and more of a managed inference and fine-tuning cloud for known model families.
Fine-Tuning: Different Targets
Both fine-tune, but they aim at different model types. Together leads on text LoRA and full fine-tunes; Replicate leads on image fine-tunes.
| Aspect | Together AI | Replicate |
|---|---|---|
| Text LoRA (per 1M tokens) | ~$0.48-$2.90 by size | Per GPU-second |
| Full fine-tuning | Yes (per training token) | Per GPU-second |
| Image fine-tuning | Limited | Yes (FLUX, under 2 min / ~$2) |
| Download weights | Yes | Yes (LoRA weights) |
| Serve fine-tune | Serverless or dedicated | Warm runnable model |
Replicate's FLUX fine-tuner trains a LoRA in under 2 minutes for under $2 on 8x H100, and hands back a warm, runnable model plus downloadable LoRA weights. That is the fastest path to a custom image model. Together is the stronger choice for text fine-tuning, with per-training-token pricing across LoRA and full tunes and downloadable weights if you want to self-host later.
When to Use Together AI
- High-volume text serving. Per-token billing on warm endpoints amortizes across steady traffic, and ATLAS speculative decoding climbs to 4x vLLM as it learns your prompt patterns.
- Latency-sensitive, always-on traffic. The serverless catalog stays warm, so listed models respond without a cold-start penalty.
- Multimodal on one bill. Text, image, audio, embeddings, rerank, and a code interpreter through a single unified API.
- You graduate to dedicated. Reserve dedicated H100/H200 endpoints (around $2.99/hr reserved) or raw HGX clusters once you cross roughly 130k tokens per minute.
- Regulated workloads. SOC 2 Type 2 plus HIPAA with encryption in transit and at rest, and Business Associate Agreements for healthcare.
When to Use Replicate
- Widest model selection. Thousands of community models, especially image and video, including experimental and niche variants you will not find on a curated catalog.
- Ship a custom model fast. Cog packages any architecture or pipeline into an API on managed GPUs, no per-token model family restrictions.
- Spiky or one-off jobs. Per-GPU-second billing with scale-to-zero means you pay nothing when idle, ideal for occasional batch image or video runs.
- Fast custom image models. FLUX LoRA fine-tunes in under 2 minutes for under $2, with downloadable weights and a warm runnable model.
- Prototyping across many models. One API key to try thousands of models without provisioning anything.
Frequently Asked Questions
Is Together AI or Replicate cheaper for LLM inference?
For high-volume text generation, Together is usually far cheaper because it bills per token on warm endpoints, roughly $0.88 to $1.04 per 1M tokens for Llama 3.3 70B as of early 2026. Replicate bills per GPU-second, so a single 70B request can cost $0.01 to $0.03 once cold-start time is included, an order of magnitude more on steady traffic. Replicate only wins on cost for spiky or one-off jobs where a warm endpoint would sit idle.
What is the main difference between Together AI and Replicate?
Together is a per-token serverless inference cloud for a curated catalog of 200+ models, kept warm for low latency. Replicate is a per-GPU-second platform for running any Cog-packaged model, thousands of them including image and video, but with cold starts on idle models. Together optimizes for high-volume warm serving; Replicate optimizes for breadth and running custom or niche models on demand.
Does Replicate have cold starts?
Yes. Public models scale to zero when idle, so a request after an idle period pays a cold start, typically 30 to 120 seconds for large image or 70B models depending on container size and GPU availability. You can avoid cold starts with an always-on private deployment, but then you pay for idle time. Together keeps its serverless catalog warm, so it does not have this tradeoff on listed models.
Which platform has more models?
Replicate has the larger raw count, with thousands of community-contributed models, the widest selection of experimental and niche variants anywhere, especially for image and video. Together curates a smaller set of 200+ production models across text, image, audio, code, and embeddings, with a unified per-token API and higher reliability guarantees. Replicate wins on breadth; Together wins on production-grade serving.
Should I use Together or Replicate to fine-tune and deploy a custom model?
It depends on the model type. For a text LLM, Together is the stronger choice: it fine-tunes supported base models with per-training-token pricing across LoRA and full tunes, serves the result serverless or dedicated, and lets you download the weights to self-host later. For an image model or any non-standard architecture, Replicate wins: its FLUX fine-tuner trains a LoRA in under 2 minutes for under $2 and hands back downloadable weights, and Cog packages any container, including custom pipelines and preprocessing, into an API on managed GPUs. Choose Together when you are training and exporting text-LLM weights; choose Replicate when you are packaging an arbitrary model or fine-tuning for image and video.
Related Comparisons
Together for Text-LLM Throughput, Replicate for Multimodal Breadth
Pick Together when you serve high-volume text or train and export LLM weights; pick Replicate for image, video, and one-line Cog deploys. If applying model-generated code edits is your bottleneck, that is a separate job Morph handles at ~10,500 tok/s.