Together AI Alternatives (2026): Pricing, Rate Limits, and 9 Compared

Together AI runs a broad open-model menu on an OpenAI-compatible API, but its serverless per-token rates sit above most of the field, and it publishes no fixed per-model rate limits. If you are shopping for a Together AI alternative, the right one depends on the exact model you run, whether you need dedicated GPUs, and how much you care about a free tier or a known RPM ceiling. Below: nine alternatives compared on exact pricing, with sources. Figures are as of June 2026.

$1.30/$2.60

DeepInfra DeepSeek V4 Pro vs Together $2.10/$4.40

$1.79/hr

DeepInfra H100 vs Together $6.49/hr dedicated

No fixed RPM

Together rate limits are dynamic per model

9 providers

Compared on exact per-token pricing

Why Teams Look Past Together AI

Together is solid infrastructure: a wide catalog, an OpenAI-compatible API, SOC 2 Type 2 certification, and a full GPU-cluster offering. Three things push teams to compare alternatives.

Serverless prices run above the field. On Together, DeepSeek V4 Pro is $2.10 input / $4.40 output per million tokens. DeepInfra lists the same model at $1.30/$2.60, Novita at $1.60/$3.20, and Baseten and Fireworks at $1.74/$3.48. Kimi K2.6 is $1.20/$4.50 on Together vs $0.75/$3.50 on DeepInfra. On GPT-OSS-120B, Together matches the field at $0.15/$0.60.

No fixed per-model rate limits. Together's limits are dynamic per model and scale with sustained traffic; there are no published per-model ceilings. For a known fixed limit, Together tells you to provision a dedicated endpoint. That is fine for steady traffic and awkward for bursty agent fan-out where you want a predictable ceiling.

Dedicated GPU rates are mid-pack. A Together dedicated H100 80GB endpoint is $6.49/hr; the same card is $1.79/hr on DeepInfra and about $3.95/hr on Modal. Together's on-demand GPU clusters are cheaper at $5.49/hr for an HGX H100, but that is a cluster product, not a per-model serverless endpoint.

Serverless Per-Token Pricing, Same Models

The cleanest comparison is the same open model across providers. Prices are per million tokens, input / output, with cached-input rate noted where published.

DeepSeek V4 Pro: serverless $/M (input / output)

Provider	Input	Output
DeepInfra	$1.30 ($0.10 cached)	$2.60
Novita	$1.60	$3.20
Baseten	$1.74 ($0.145 cached)	$3.48
Fireworks	$1.74 ($0.145 cached)	$3.48
Together AI	$2.10 ($0.20 cached)	$4.40
Morph (dsv4flash, 16-bit activations)	$0.139	$0.278

Kimi K2.6: serverless $/M (input / output)

Provider	Input	Output
DeepInfra	$0.75 ($0.15 cached)	$3.50
Novita	$0.80	$3.40
Baseten	$0.95 ($0.16 cached)	$4.00
Fireworks	$0.95 ($0.16 cached)	$4.00
Together AI	$1.20 ($0.20 cached)	$4.50

GLM-5.1: serverless $/M (input / output)

Provider	Input	Output
DeepInfra	$1.05 ($0.205 cached)	$3.50
Baseten	$1.30 ($0.26 cached)	$4.30
Novita	$1.38	$4.40
Fireworks	$1.40 ($0.26 cached)	$4.40
Together AI	$1.40 ($0.26 cached)	$4.40

GPT-OSS 120B: serverless $/M (input / output)

Provider	Input	Output
Baseten	$0.10	$0.50
Fireworks	$0.15 ($0.015 cached)	$0.60
Together AI	$0.15	$0.60
Groq (500 tok/s)	$0.15 ($0.075 cached)	$0.60

Cached input changes the math

Several providers discount cached input tokens by roughly half, so a prompt-heavy workload with repeated context can land well below the headline rate. Fireworks applies a 50 percent cached-input discount, OpenRouter passes provider rates through with no markup, and DeepInfra publishes cached rates per model. Compare on your real input/output ratio and cache-hit rate, not list price.

Dedicated GPU Pricing ($/hr)

If you want a model pinned to your own GPUs instead of serverless, the hourly rate is the number that matters. Published on-demand rates for an H100 80GB and a B200 180GB:

H100 80GB: on-demand $/hr

Provider	$/hr	Notes
DeepInfra	$1.79	dedicated
Modal	~$3.95	$0.001097/s, per-second billing
Together AI (cluster)	$5.49	HGX H100 GPU cluster
Replicate	$5.49	$0.001525/s
Together AI (endpoint)	$6.49	dedicated endpoint
Baseten	$6.50	$0.10833/min, dedicated
Fireworks	$7.00	on-demand

B200 180GB: on-demand $/hr

Provider	$/hr	Notes
DeepInfra	$2.79	dedicated
Modal	~$6.25	$0.001736/s
Baseten	$9.98	$0.16633/min
Together AI (cluster)	$9.95	HGX B200 cluster ($11.95/hr dedicated)
Fireworks	$10.00	on-demand

Replicate publishes no B200 rate. Modal and Replicate bill per second and scale to zero, so an endpoint with bursty traffic pays only for active compute. Together's reserved GPU clusters run $3.99 to $9.65/hr depending on hardware and a 7 to 180+ day commitment.

Rate Limits and Batch Discounts

Rate-limit design is where these providers diverge most. Together is dynamic with no published per-model ceiling. Fireworks publishes a hard 6,000 RPM cap with a payment method (10 RPM without one) and gates monthly budget by spending tier: $50/mo Tier 1, $500 Tier 2, $5,000 Tier 3, $50,000 Tier 4. DeepInfra caps at 200 concurrent requests per account. Groq publishes per-model RPM, RPD, TPM, and TPD on the free plan and lifts them on Developer.

Rate limits and batch pricing

Provider	Rate limit	Batch discount
Together AI	Dynamic per model, no fixed ceiling; dedicated endpoint for a known limit	Up to 50% off serverless, 24h window, 50k req/batch
Fireworks	6,000 RPM with card (10 RPM without)	50% of serverless
DeepInfra	200 concurrent requests/account	Not published
Groq	Per-model RPM/RPD/TPM/TPD on free; higher on Developer	50% lower, 24h-7d window
OpenRouter	50 free model req/day (1,000 after $10 credit); :free variants 20 RPM	Pass-through provider rates

Batch is the cheapest path for non-interactive jobs. Fireworks bills batch at 50 percent of serverless, Groq cuts cost by half with a 24h-to-7-day window, and Together discounts selected models up to 50 percent on a 24h best-effort window (up to 50,000 requests per batch, 100 MB per input file).

Compliance and Data Retention

For regulated workloads, the certification and retention posture often decides the provider before price does.

Compliance and data retention

Provider	Certifications	Retention default
Together AI	SOC 2 Type 2	Standard logging
Fireworks	SOC 2 Type II, HIPAA	Zero retention on open models (opt-in to log)
DeepInfra	SOC 2, ISO 27001; GDPR/HIPAA measures	Zero retention; only metadata logged
Baseten	SOC 2 Type II, HIPAA	Standard
Groq	SOC 2 Type II; HIPAA BAA with exclusions	Standard
Modal	SOC 2 from Starter; HIPAA on Enterprise	Standard
OpenRouter	Pass-through	No prompt/completion logging by default

The 9 Alternatives, Ranked by Use Case

DeepInfra: cheapest per-token and per-GPU on most models

Lowest serverless rates in this set on DeepSeek V4 Pro ($1.30/$2.60), Kimi K2.6 ($0.75/$3.50), and GLM-5.1 ($1.05/$3.50), plus the cheapest dedicated H100 at $1.79/hr. SOC 2 and ISO 27001, zero retention with metadata-only logging. No free tier, 200 concurrent requests per account, and postpaid billing with mid-month invoicing at $20/$100/$500/$2,000/$10,000 thresholds. Best when price per token is the deciding factor.

Fireworks AI: production serverless with HIPAA and 50% batch

DeepSeek V4 Pro at $1.74/$3.48, Kimi K2.6 at $0.95/$4.00, a 6,000 RPM ceiling with a card, 50 percent batch pricing, and a published zero-retention policy on open models. SOC 2 Type II and HIPAA. Fine-tuning runs $0.50 to $20 per million training tokens by model size. Best for teams that want a known RPM ceiling and compliance docs.

Baseten: dedicated deploys with autoscaling control

Per-minute dedicated GPUs (H100 $6.50/hr, B200 $9.98/hr) plus a model API. Autoscaling defaults to scale-to-zero (min_replica=0), but docs warn cold starts can take minutes for large models and recommend min_replica >= 2 for production. SOC 2 Type II and HIPAA. Best when you want fine autoscaling control on dedicated capacity.

Groq: fastest tok/s on supported models

Custom LPU hardware: GPT-OSS 120B at 500 tok/s, GPT-OSS 20B at 1,000 tok/s, Llama 3.1 8B at 840 tok/s, all at competitive per-token rates ($0.15/$0.60 on GPT-OSS 120B). Free plan with published per-model limits, 50 percent batch. SOC 2 Type II; HIPAA BAA with exclusions for preview features. Best when raw latency is the priority and your model is on Groq's list.

Modal: per-second billing and ~1s cold starts

GPUs billed per second (H100 ~$3.95/hr, B200 ~$6.25/hr) with no idle charge, ~1 second container boots, and memory snapshotting to cut cold-start penalty. Starter is free with $30/month in credits; Team is $250/month plus compute. SOC 2 from Starter, HIPAA on Enterprise. Best for spiky, scale-to-zero serving where you control the container.

Replicate: run-and-pay model hosting

Per-second hardware (H100 $5.49/hr, A100 80GB $5.04/hr) plus per-token rates on some LLMs. Scale-to-zero by default; cold boots can take several minutes for large models but you are only billed for active prediction time. Deployments expose min instances (keep warm) and max instances (cap spend). Best for image/video models and fine-tunes you only pay for while running.

Novita: low-cost serverless with a wide menu

DeepSeek V4 Pro $1.60/$3.20, Kimi K2.6 $0.80/$3.40, Llama 3.3 70B $0.135/$0.40, GLM-4.7-Flash $0.07/$0.40. Sits between DeepInfra and Fireworks on price across most models. Best when you want low rates on a broad catalog.

OpenRouter: one key, 400+ models, no markup

Pass-through pricing with no inference markup (you pay the provider's rate) plus a 5.5 percent fee on credit purchases. 400+ models across 60+ providers, zero prompt logging by default, and a free tier (50 req/day, 1,000 after a $10 credit). BYOK is free for the first 1M requests/month. Best for routing across many models behind one API.

Morph: tuned for the code path

Covered next: the same open models, served on a stack tuned to the code token distribution.

Morph: Tuned for the Code Path

If the workload is a coding agent or IDE assistant, most tokens are code: diffs, file rewrites, tool-call payloads. Code has a different token distribution than prose, heavy on brackets, identifiers, and indentation. A general serving stack treats every token type the same. Morph tunes its stack to that distribution with custom GPU kernels and speculative decoding, reaching about 255 tokens per second on the same open models, and runs apply edits on morph-v3-fast at roughly 10,500 tokens per second.

For DeepSeek specifically, the deciding factor is precision. Most serverless providers quantize activations to fp8 to cut cost, which degrades output against the reference weights. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize them, so output matches the reference model. That makes Morph the best place to run DeepSeek when fidelity matters, and morph-dsv4flash (DeepSeek V4 Flash) lists at $0.139 input / $0.278 output per million tokens. See Morph Models and pricing.

Morph is OpenAI-compatible at https://api.morphllm.com/v1, so switching from Together is a one-string base-URL change. Point your client at Morph and pick a model: morph-glm52-744b (744B MoE, 1M context), morph-minimax27-230b (230B MoE, agentic), morph-qwen36-27b (dense, 131k context), or deepseek-v4-flash (1M context).

Switching from Together to Morph

import OpenAI from "openai";

// Together
const together = new OpenAI({
  baseURL: "https://api.together.xyz/v1",
  apiKey: process.env.TOGETHER_API_KEY,
});

// Morph: same client, one string changes
const morph = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

const prompt = [{ role: "user", content: "Write a TypeScript LRU cache with tests." }];
const out = await morph.chat.completions.create({ model: "morph-glm52-744b", messages: prompt });

WarpGrep for agent code search

For the search half of an agent loop, Morph's WarpGrep is $0 for the first 100k requests and $1 per million after. See WarpGrep and pricing.

Which One to Pick

Lowest per-token cost: DeepInfra (cheapest on DeepSeek V4 Pro, Kimi K2.6, GLM-5.1) or Novita.
Known RPM ceiling and HIPAA: Fireworks (6,000 RPM, SOC 2 Type II + HIPAA, 50% batch).
Cheapest dedicated GPUs: DeepInfra ($1.79/hr H100) or Modal (per-second, scale-to-zero).
Fastest tok/s: Groq, when your model is on its list.
One key across 400+ models: OpenRouter, no inference markup.
Coding agents and apply edits: Morph, tuned to the code token distribution.
Stay on Together: if you already run dedicated endpoints or GPU clusters there and value the single-vendor catalog; the serverless premium matters most at high token volume.

Frequently Asked Questions

What is the cheapest Together AI alternative for DeepSeek V4 Pro?

DeepInfra at $1.30/$2.60 per million input/output, against Together's $2.10/$4.40. Novita is $1.60/$3.20; Baseten and Fireworks list $1.74/$3.48.

Is DeepInfra cheaper than Together AI?

On serverless per-token rates, yes, on most shared models. Kimi K2.6 is $0.75/$3.50 on DeepInfra vs $1.20/$4.50 on Together; GLM-5.1 is $1.05/$3.50 vs $1.40/$4.40. DeepInfra also lists dedicated H100 at $1.79/hr against Together's $6.49/hr dedicated endpoint. DeepInfra has no free tier and caps at 200 concurrent requests.

What are Together AI's rate limits?

Together publishes no fixed per-model rate limits. Limits are dynamic and scale with sustained traffic; for a known ceiling, Together recommends a dedicated endpoint.

Together AI vs Fireworks: which is cheaper?

Close, with edges in both directions. DeepSeek V4 Pro is $1.74/$3.48 on Fireworks vs $2.10/$4.40 on Together; GLM-5.1 is $1.40/$4.40 on both; Kimi K2.6 is $0.95/$4.00 on Fireworks vs $1.20/$4.50 on Together. Fireworks adds a 6,000 RPM ceiling and 50 percent batch.

Which alternatives offer a free tier?

Modal Starter ($30/mo credits), Groq (free plan with per-model limits), OpenRouter (50 free req/day, 1,000 after a $10 credit), Fireworks ($1 new-account credit), and Baseten (free experimentation credits). DeepInfra has no free tier.

What is the cheapest H100 hourly rate?

DeepInfra at $1.79/hr dedicated. Modal is ~$3.95/hr, Replicate $5.49/hr, Together $6.49/hr dedicated endpoint ($5.49/hr cluster), Baseten $6.50/hr, Fireworks $7.00/hr.

Which alternative is fastest for coding agents?

Morph reaches about 255 tok/s on the same open models by tuning to the code token distribution, with morph-v3-fast at roughly 10,500 tok/s on apply edits. OpenAI-compatible at api.morphllm.com/v1, so switching is a one-string change.

Related Resources

Private deployments

The fastest endpoints are private deployments

Morph's top speeds come from dedicated deployments, not shared public endpoints: speculators trained on your traffic, caching tuned to your workload, and volume discounts over public per-token rates. Over 100 billion tokens per day run this way.

Talk to us about a private deployment

Run the Code Path on a Codegen-Tuned Endpoint

Same open models the alternatives serve, generated at ~255 tok/s on code, with apply edits at ~10,500 tok/s. OpenAI-compatible, so switching from Together is a one-string change.

Get API Key

Read the Docs

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI Alternatives (2026): DeepInfra Runs DeepSeek V4 Pro at $1.30/$2.60 vs Together's $2.10/$4.40