Together AI runs a broad open-model menu on an OpenAI-compatible API, but its serverless per-token rates sit above most of the field, and it publishes no fixed per-model rate limits. If you are shopping for a Together AI alternative, the right one depends on the exact model you run, whether you need dedicated GPUs, and how much you care about a free tier or a known RPM ceiling. Below: nine alternatives compared on exact pricing, with sources. Figures are as of June 2026.
Why Teams Look Past Together AI
Together is solid infrastructure: a wide catalog, an OpenAI-compatible API, SOC 2 Type 2 certification, and a full GPU-cluster offering. Three things push teams to compare alternatives.
Serverless prices run above the field. On Together, DeepSeek V4 Pro is $2.10 input / $4.40 output per million tokens. DeepInfra lists the same model at $1.30/$2.60, Novita at $1.60/$3.20, and Baseten and Fireworks at $1.74/$3.48. Kimi K2.6 is $1.20/$4.50 on Together vs $0.75/$3.50 on DeepInfra. On GPT-OSS-120B, Together matches the field at $0.15/$0.60.
No fixed per-model rate limits. Together's limits are dynamic per model and scale with sustained traffic; there are no published per-model ceilings. For a known fixed limit, Together tells you to provision a dedicated endpoint. That is fine for steady traffic and awkward for bursty agent fan-out where you want a predictable ceiling.
Dedicated GPU rates are mid-pack. A Together dedicated H100 80GB endpoint is $6.49/hr; the same card is $1.79/hr on DeepInfra and about $3.95/hr on Modal. Together's on-demand GPU clusters are cheaper at $5.49/hr for an HGX H100, but that is a cluster product, not a per-model serverless endpoint.
Serverless Per-Token Pricing, Same Models
The cleanest comparison is the same open model across providers. Prices are per million tokens, input / output, with cached-input rate noted where published.
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $1.30 ($0.10 cached) | $2.60 |
| Novita | $1.60 | $3.20 |
| Baseten | $1.74 ($0.145 cached) | $3.48 |
| Fireworks | $1.74 ($0.145 cached) | $3.48 |
| Together AI | $2.10 ($0.20 cached) | $4.40 |
| Morph (dsv4flash, 16-bit activations) | $0.139 | $0.278 |
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $0.75 ($0.15 cached) | $3.50 |
| Novita | $0.80 | $3.40 |
| Baseten | $0.95 ($0.16 cached) | $4.00 |
| Fireworks | $0.95 ($0.16 cached) | $4.00 |
| Together AI | $1.20 ($0.20 cached) | $4.50 |
| Provider | Input | Output |
|---|---|---|
| DeepInfra | $1.05 ($0.205 cached) | $3.50 |
| Baseten | $1.30 ($0.26 cached) | $4.30 |
| Novita | $1.38 | $4.40 |
| Fireworks | $1.40 ($0.26 cached) | $4.40 |
| Together AI | $1.40 ($0.26 cached) | $4.40 |
| Provider | Input | Output |
|---|---|---|
| Baseten | $0.10 | $0.50 |
| Fireworks | $0.15 ($0.015 cached) | $0.60 |
| Together AI | $0.15 | $0.60 |
| Groq (500 tok/s) | $0.15 ($0.075 cached) | $0.60 |
Cached input changes the math
Several providers discount cached input tokens by roughly half, so a prompt-heavy workload with repeated context can land well below the headline rate. Fireworks applies a 50 percent cached-input discount, OpenRouter passes provider rates through with no markup, and DeepInfra publishes cached rates per model. Compare on your real input/output ratio and cache-hit rate, not list price.
Dedicated GPU Pricing ($/hr)
If you want a model pinned to your own GPUs instead of serverless, the hourly rate is the number that matters. Published on-demand rates for an H100 80GB and a B200 180GB:
| Provider | $/hr | Notes |
|---|---|---|
| DeepInfra | $1.79 | dedicated |
| Modal | ~$3.95 | $0.001097/s, per-second billing |
| Together AI (cluster) | $5.49 | HGX H100 GPU cluster |
| Replicate | $5.49 | $0.001525/s |
| Together AI (endpoint) | $6.49 | dedicated endpoint |
| Baseten | $6.50 | $0.10833/min, dedicated |
| Fireworks | $7.00 | on-demand |
| Provider | $/hr | Notes |
|---|---|---|
| DeepInfra | $2.79 | dedicated |
| Modal | ~$6.25 | $0.001736/s |
| Baseten | $9.98 | $0.16633/min |
| Together AI (cluster) | $9.95 | HGX B200 cluster ($11.95/hr dedicated) |
| Fireworks | $10.00 | on-demand |
Replicate publishes no B200 rate. Modal and Replicate bill per second and scale to zero, so an endpoint with bursty traffic pays only for active compute. Together's reserved GPU clusters run $3.99 to $9.65/hr depending on hardware and a 7 to 180+ day commitment.
Rate Limits and Batch Discounts
Rate-limit design is where these providers diverge most. Together is dynamic with no published per-model ceiling. Fireworks publishes a hard 6,000 RPM cap with a payment method (10 RPM without one) and gates monthly budget by spending tier: $50/mo Tier 1, $500 Tier 2, $5,000 Tier 3, $50,000 Tier 4. DeepInfra caps at 200 concurrent requests per account. Groq publishes per-model RPM, RPD, TPM, and TPD on the free plan and lifts them on Developer.
| Provider | Rate limit | Batch discount |
|---|---|---|
| Together AI | Dynamic per model, no fixed ceiling; dedicated endpoint for a known limit | Up to 50% off serverless, 24h window, 50k req/batch |
| Fireworks | 6,000 RPM with card (10 RPM without) | 50% of serverless |
| DeepInfra | 200 concurrent requests/account | Not published |
| Groq | Per-model RPM/RPD/TPM/TPD on free; higher on Developer | 50% lower, 24h-7d window |
| OpenRouter | 50 free model req/day (1,000 after $10 credit); :free variants 20 RPM | Pass-through provider rates |
Batch is the cheapest path for non-interactive jobs. Fireworks bills batch at 50 percent of serverless, Groq cuts cost by half with a 24h-to-7-day window, and Together discounts selected models up to 50 percent on a 24h best-effort window (up to 50,000 requests per batch, 100 MB per input file).
Compliance and Data Retention
For regulated workloads, the certification and retention posture often decides the provider before price does.
| Provider | Certifications | Retention default |
|---|---|---|
| Together AI | SOC 2 Type 2 | Standard logging |
| Fireworks | SOC 2 Type II, HIPAA | Zero retention on open models (opt-in to log) |
| DeepInfra | SOC 2, ISO 27001; GDPR/HIPAA measures | Zero retention; only metadata logged |
| Baseten | SOC 2 Type II, HIPAA | Standard |
| Groq | SOC 2 Type II; HIPAA BAA with exclusions | Standard |
| Modal | SOC 2 from Starter; HIPAA on Enterprise | Standard |
| OpenRouter | Pass-through | No prompt/completion logging by default |
The 9 Alternatives, Ranked by Use Case
DeepInfra: cheapest per-token and per-GPU on most models
Lowest serverless rates in this set on DeepSeek V4 Pro ($1.30/$2.60), Kimi K2.6 ($0.75/$3.50), and GLM-5.1 ($1.05/$3.50), plus the cheapest dedicated H100 at $1.79/hr. SOC 2 and ISO 27001, zero retention with metadata-only logging. No free tier, 200 concurrent requests per account, and postpaid billing with mid-month invoicing at $20/$100/$500/$2,000/$10,000 thresholds. Best when price per token is the deciding factor.
Fireworks AI: production serverless with HIPAA and 50% batch
DeepSeek V4 Pro at $1.74/$3.48, Kimi K2.6 at $0.95/$4.00, a 6,000 RPM ceiling with a card, 50 percent batch pricing, and a published zero-retention policy on open models. SOC 2 Type II and HIPAA. Fine-tuning runs $0.50 to $20 per million training tokens by model size. Best for teams that want a known RPM ceiling and compliance docs.
Baseten: dedicated deploys with autoscaling control
Per-minute dedicated GPUs (H100 $6.50/hr, B200 $9.98/hr) plus a model API. Autoscaling defaults to scale-to-zero (min_replica=0), but docs warn cold starts can take minutes for large models and recommend min_replica >= 2 for production. SOC 2 Type II and HIPAA. Best when you want fine autoscaling control on dedicated capacity.
Groq: fastest tok/s on supported models
Custom LPU hardware: GPT-OSS 120B at 500 tok/s, GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B at 840 tok/s, all at competitive per-token rates ($0.15/$0.60 on GPT-OSS 120B). Free plan with published per-model limits, 50 percent batch. SOC 2 Type II; HIPAA BAA with exclusions for preview features. Best when raw latency is the priority and your model is on Groq's list.
Modal: per-second billing and ~1s cold starts
GPUs billed per second (H100 ~$3.95/hr, B200 ~$6.25/hr) with no idle charge, ~1 second container boots, and memory snapshotting to cut cold-start penalty. Starter is free with $30/month in credits; Team is $250/month plus compute. SOC 2 from Starter, HIPAA on Enterprise. Best for spiky, scale-to-zero serving where you control the container.
Replicate: run-and-pay model hosting
Per-second hardware (H100 $5.49/hr, A100 80GB $5.04/hr) plus per-token rates on some LLMs. Scale-to-zero by default; cold boots can take several minutes for large models but you are only billed for active prediction time. Deployments expose min instances (keep warm) and max instances (cap spend). Best for image/video models and fine-tunes you only pay for while running.
Novita: low-cost serverless with a wide menu
DeepSeek V4 Pro $1.60/$3.20, Kimi K2.6 $0.80/$3.40, Llama 3.3 70B $0.135/$0.40, GLM-4.7-Flash $0.07/$0.40. Sits between DeepInfra and Fireworks on price across most models. Best when you want low rates on a broad catalog.
OpenRouter: one key, 400+ models, no markup
Pass-through pricing with no inference markup (you pay the provider's rate) plus a 5.5 percent fee on credit purchases. 400+ models across 60+ providers, zero prompt logging by default, and a free tier (50 req/day, 1,000 after a $10 credit). BYOK is free for the first 1M requests/month. Best for routing across many models behind one API.
Morph: tuned for the code path
Covered next: the same open models, served on a stack tuned to the code token distribution.
Morph: Tuned for the Code Path
If the workload is a coding agent or IDE assistant, most tokens are code: diffs, file rewrites, tool-call payloads. Code has a different token distribution than prose, heavy on brackets, identifiers, and indentation. A general serving stack treats every token type the same. Morph tunes its stack to that distribution with custom GPU kernels and speculative decoding, reaching about 255 tokens per second on the same open models, and runs apply edits on morph-v3-fast at roughly 10,500 tokens per second.
For DeepSeek specifically, the deciding factor is precision. Most serverless providers quantize activations to fp8 to cut cost, which degrades output against the reference weights. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize them, so output matches the reference model. That makes Morph the best place to run DeepSeek when fidelity matters, and morph-dsv4flash (DeepSeek V4 Flash) lists at $0.139 input / $0.278 output per million tokens. See Morph Models and pricing.
Morph is OpenAI-compatible at https://api.morphllm.com/v1, so switching from Together is a one-string base-URL change. Point your client at Morph and pick a model: morph-qwen35-397b (397B MoE, 262k context), morph-minimax27-230b (230B MoE, agentic), morph-qwen36-27b (dense, 131k context), or deepseek-v4-flash (1M context).
Switching from Together to Morph
import OpenAI from "openai";
// Together
const together = new OpenAI({
baseURL: "https://api.together.xyz/v1",
apiKey: process.env.TOGETHER_API_KEY,
});
// Morph: same client, one string changes
const morph = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: process.env.MORPH_API_KEY,
});
const prompt = [{ role: "user", content: "Write a TypeScript LRU cache with tests." }];
const out = await morph.chat.completions.create({ model: "morph-qwen35-397b", messages: prompt });Which One to Pick
- Lowest per-token cost: DeepInfra (cheapest on DeepSeek V4 Pro, Kimi K2.6, GLM-5.1) or Novita.
- Known RPM ceiling and HIPAA: Fireworks (6,000 RPM, SOC 2 Type II + HIPAA, 50% batch).
- Cheapest dedicated GPUs: DeepInfra ($1.79/hr H100) or Modal (per-second, scale-to-zero).
- Fastest tok/s: Groq, when your model is on its list.
- One key across 400+ models: OpenRouter, no inference markup.
- Coding agents and apply edits: Morph, tuned to the code token distribution.
- Stay on Together: if you already run dedicated endpoints or GPU clusters there and value the single-vendor catalog; the serverless premium matters most at high token volume.
Frequently Asked Questions
What is the cheapest Together AI alternative for DeepSeek V4 Pro?
DeepInfra at $1.30/$2.60 per million input/output, against Together's $2.10/$4.40. Novita is $1.60/$3.20; Baseten and Fireworks list $1.74/$3.48.
Is DeepInfra cheaper than Together AI?
On serverless per-token rates, yes, on most shared models. Kimi K2.6 is $0.75/$3.50 on DeepInfra vs $1.20/$4.50 on Together; GLM-5.1 is $1.05/$3.50 vs $1.40/$4.40. DeepInfra also lists dedicated H100 at $1.79/hr against Together's $6.49/hr dedicated endpoint. DeepInfra has no free tier and caps at 200 concurrent requests.
What are Together AI's rate limits?
Together publishes no fixed per-model rate limits. Limits are dynamic and scale with sustained traffic; for a known ceiling, Together recommends a dedicated endpoint.
Together AI vs Fireworks: which is cheaper?
Close, with edges in both directions. DeepSeek V4 Pro is $1.74/$3.48 on Fireworks vs $2.10/$4.40 on Together; GLM-5.1 is $1.40/$4.40 on both; Kimi K2.6 is $0.95/$4.00 on Fireworks vs $1.20/$4.50 on Together. Fireworks adds a 6,000 RPM ceiling and 50 percent batch.
Which alternatives offer a free tier?
Modal Starter ($30/mo credits), Groq (free plan with per-model limits), OpenRouter (50 free req/day, 1,000 after a $10 credit), Fireworks ($1 new-account credit), and Baseten (free experimentation credits). DeepInfra has no free tier.
What is the cheapest H100 hourly rate?
DeepInfra at $1.79/hr dedicated. Modal is ~$3.95/hr, Replicate $5.49/hr, Together $6.49/hr dedicated endpoint ($5.49/hr cluster), Baseten $6.50/hr, Fireworks $7.00/hr.
Which alternative is fastest for coding agents?
Morph reaches about 255 tok/s on the same open models by tuning to the code token distribution, with morph-v3-fast at roughly 10,500 tok/s on apply edits. OpenAI-compatible at api.morphllm.com/v1, so switching is a one-string change.
Related Resources
Run the Code Path on a Codegen-Tuned Endpoint
Same open models the alternatives serve, generated at ~255 tok/s on code, with apply edits at ~10,500 tok/s. OpenAI-compatible, so switching from Together is a one-string change.
