Together AI vs Modal Pricing (2026): $1.04/M Tokens vs $0.001097/GPU-sec, and the ~91M-Token Crossover

You are comparing Together AI and Modal because both call themselves serverless and both run on the same Nvidia GPUs. Their bills are not comparable line for line, because they price on different axes.

Together AI sells finished tokens. You call an OpenAI-compatible endpoint, pick an open model, and pay per million tokens. Llama 3.3 70B is $1.04/M in and out. You never see a GPU.

Modal sells GPU seconds. You bring your own code and weights, Modal runs them in an isolated container, and you pay per GPU-second. An H100 is $0.001097/sec, about $3.95/hr base, before region and non-preemptible multipliers. There is no managed chat model behind a URL; if you want tokens you deploy vLLM or SGLang yourself.

Below: the exact prices, the token volume where Modal's per-second rate undercuts Together's per-token bill, rate limits, cold starts, and the workloads each one wins.

$1.04/M

Together Llama 3.3 70B

$0.001097/s

Modal H100 base

~91M tok/day

Crossover (one H100)

up to 3x

Modal non-preempt mult.

TL;DR

Pick Together AI for bursty or low-volume traffic on a stock open model. Per-token billing, $0 idle, 200+ models behind one OpenAI-compatible base URL, the ATLAS speculator (up to 400% faster on dedicated endpoints), serverless multi-LoRA, and a batch API at up to 50% off.
Pick Modal for raw serverless GPUs and isolated sandboxes to run your own code or model. H100 at $0.001097/sec base, scale-to-zero, ~1 second container boot, memory snapshotting, and 50,000+ concurrent sessions. You build the serving stack.
Crossover: at $1.04/M, one warm Modal H100 (~$2,845/mo if held 24/7) only beats Together once it serves more than about 91M tokens/day, and only if the card can push that volume. Modal's region and non-preemptible multipliers push the crossover higher.

Pricing Side by Side: Per Token vs Per GPU-Second

These two price on different axes, so a head-to-head only makes sense at a fixed workload. Together's number is a token rate; Modal's is a time rate.

Together AI vs Modal pricing (as of June 2026)

Item	Together AI	Modal
Llama 3.3 70B	$1.04 / 1M tokens	Run it yourself on a GPU
GPT-OSS-120B	$0.15 / $0.60 per 1M in/out	Run it yourself on a GPU
DeepSeek V4 Pro	$2.10 / $4.40 per 1M in/out	Run it yourself on a GPU
H100 80GB	$6.49/hr dedicated, $5.49/hr cluster	$0.001097/sec (~$3.95/hr) base
H200	$6.79/hr cluster	$0.001261/sec (~$4.54/hr) base
A100 80GB	Cluster / quote	$0.000694/sec (~$2.50/hr) base
B200 180GB	$11.95/hr dedicated, $9.95/hr cluster	$0.001736/sec (~$6.25/hr) base
GPU price multipliers	None (token rate is final)	1.5-1.75x region, up to 3x non-preemptible
Idle cost	$0 (per token)	$0 (scale-to-zero)
Batch discount	Up to 50% off (24h window)	N/A
Free credits	Trial credits	$30/mo (Starter), $100/mo (Team)

Together AI bills consumption only: no subscriptions, no setup fees, no minimums. The per-token rate already bundles their serving stack, so you do not separately pay for utilization. The batch API cuts serverless rates up to 50% on selected models with a 24h best-effort window (up to 50,000 requests per batch, 30B tokens enqueued per model).

Modal bills per GPU-second of actual execution, which is cheaper than Together's dedicated H100 rate on the sticker ($3.95/hr base vs $6.49/hr). But the headline price is the base. Region selection adds a 1.5 to 1.75x multiplier, and non-preemptible execution adds up to 3x. A non-preemptible US H100 lands well above the sticker. You also pay for every second the GPU is up, busy or not, plus CPU at $0.0000131 per physical core-second and memory at $0.00000222 per GiB-second.

$1.04

Together: Llama 3.3 70B / 1M tokens

$0.001097

Modal: H100 base / second

up to 50%

Together batch API discount

Which is actually cheaper

Cheaper depends on utilization, not on a sticker price. For bursty or low-volume traffic, Together wins: you pay per token and nothing when idle. Above the crossover, where you keep an H100 saturated with a model you have tuned, Modal's per-second rate undercuts a per-token bill. The worked scenario below puts a number on that crossover.

Cost on a Real Workload

Take one concrete job: serving Llama 3.3 70B at 50M output tokens per day. Here is the arithmetic, using only the list prices already on this page.

Together AI, serverless tokens: 50M tokens/day at $1.04 per 1M = 50 × $1.04 = $52/day, about $1,560/mo. Zero ops, scales to zero, no GPU to keep warm.
Modal, dedicated H100: one H100 at $0.001097/sec = $3.95/hr = $94.80/day = about $2,845/mo if you hold it warm 24/7. That is before Modal's region (1.5 to 1.75x) and non-preemptible (up to 3x) multipliers, which push an always-on US H100 well past $3,000/mo. You also still build and run the vLLM or SGLang server yourself.

So at 50M tokens/day, Together is cheaper outright (~$1,560 vs ~$2,845+) and involves no serving stack. Modal only wins once one warm H100 serves more than ~$2,845/mo worth of tokens, which is roughly 91M tokens/day at Together's $1.04/M rate, assuming the card can actually push that volume. Below that throughput, Together's per-token bill is lower; above it, owning the GPU on Modal pays off. The break-even moves further in Modal's favor if you batch (Together's up to 50% batch discount roughly doubles the crossover) or further against it once you add Modal's region and non-preemptible multipliers.

Together AI vs Modal H100 Price

The H100 line is the most-searched number in this comparison, so here it is across the wider market for context. Modal's base H100 rate is the cheapest of the two, but several other providers are cheaper still on dedicated H100s.

H100 80GB on-demand / dedicated rates (June 2026)

Provider	H100 rate	Note
DeepInfra	$1.79/hr	Dedicated
Modal	~$3.95/hr ($0.001097/sec)	Base, before region / non-preemptible multipliers
Together AI (cluster)	$5.49/hr	HGX H100 on-demand cluster
Replicate	$5.49/hr	$0.001525/sec
Together AI (dedicated)	$6.49/hr	Dedicated endpoint
Baseten	$6.50/hr	Dedicated, $0.10833/min
Fireworks AI	$7.00/hr	On-demand

For B200 180GB the order shifts: DeepInfra $2.79/hr, Modal ~$6.25/hr ($0.001736/sec base), Baseten $9.98/hr, Together $9.95/hr cluster ($11.95/hr dedicated), Fireworks $10.00/hr. If a raw H100 or B200 hourly rate is your only metric, neither Together nor Modal is the floor; DeepInfra undercuts both. The reason to pick Together or Modal is the layer they sit at, not the GPU sticker.

Who Wins Per Workload

The choice is rarely "which is better" and almost always "which fits this job." Map your workload to a row.

Who wins, by developer decision

Workload / decision	Together AI	Modal
Call a stock open model fast	Together (one base URL, 200+ models)	Modal (you build the server)
Bursty / low-volume traffic	Together (per token, $0 idle)	Modal (pays for warm GPUs)
Sustained, saturated throughput	Together (per-token markup)	Modal (per-second, you saturate)
Run your own model or engine	Together (catalog only)	Modal (BYO vLLM / SGLang)
Fine-tune and export weights	Together (managed LoRA + full)	Modal (run your own trainer)
Execute untrusted / agent code	Together (no sandbox)	Modal (isolated container per session)
Cheapest raw H100 hour	$5.49/hr cluster	~$3.95/hr base
Arbitrary non-LLM GPU jobs	Together (inference only)	Modal (any Python on a GPU)

What Each One Actually Is

The single most important decision here is whether you want to see a GPU or not.

Together AI: a token factory

Together AI is an inference API. You hit an OpenAI-compatible endpoint, name a model, and get tokens back. The catalog covers 200+ open models (Llama, DeepSeek, Qwen, Kimi, GLM, MiniMax, and more) plus embeddings, image, and audio. Together also runs dedicated endpoints where a model is pinned to reserved GPUs, GPU clusters (HGX H100 at $5.49/hr, H200 at $6.79/hr, B200 at $9.95/hr on-demand, reserved from $3.99/hr), and a managed fine-tuning service. The point of Together is that you never provision, patch, or saturate a GPU yourself.

Modal: a serverless GPU runtime

Modal is a compute platform built for secure sandboxed execution at scale, with GPU support on top. You define infrastructure in Python and Modal runs it in isolated containers with a custom filesystem and scheduler. Containers boot in about 1 second on Modal's custom stack, and functions scale to zero when idle (default 60s idle window via scaledown_window, configurable 2s to 20min). It is not a model catalog. If you want to serve Llama, you deploy your own vLLM or SGLang server on Modal GPUs and expose the port.

The clean dividing line

Together AI answers "give me tokens from model X." Modal answers "run my container on a GPU, securely, and scale it." If your app is "call an LLM," Together is less work. If your app is "run arbitrary code or a custom model," Modal is the right primitive.

Rate Limits and Throughput

These platforms gate scale differently. Together throttles by request rate; Modal caps by container and GPU concurrency.

Scaling limits

Limit	Together AI	Modal
Request rate	Dynamic per model, scales with sustained traffic	Not rate-gated; gated by containers
Fixed published limit	None (use a dedicated endpoint)	100 containers / 10 GPU concurrency (Starter)
Higher tier	Dedicated endpoint for fixed capacity	1000 containers / 50 GPU concurrency (Team)
Max concurrency	Absorbed by multi-tenant API	50,000+ concurrent sessions
Batch ceiling	50,000 requests/batch, 30B tokens enqueued	N/A (your code, your concurrency)

Together's serverless rate limits are not published as fixed per-model numbers; they scale with sustained traffic, and for a guaranteed fixed ceiling Together points you to a dedicated endpoint. Modal does not throttle request rate at all. It scales containers up to your plan's GPU-concurrency limit, so a traffic spike turns into more containers rather than a 429.

Speed and Cold Starts

Together optimizes tokens-per-second; Modal optimizes container-start time. Different metrics for different jobs.

Together AI: ATLAS speculator

Together's headline speed feature is ATLAS, a runtime-learning speculative decoder. A small draft model proposes tokens, the main model verifies them in parallel, and ATLAS keeps retraining the draft on live production traffic so it stays aligned as your workload shifts. Together reports up to a 400% speedup over the FP8 baseline, lifting DeepSeek-V3.1 from 105 to 501 tok/s for batch size 1 on 4 B200 GPUs. ATLAS ships on dedicated endpoints at no extra cost.

Modal: cold starts and snapshots

Modal's speed story is cold start. Containers boot in about 1 second on Modal's custom container stack. Memory snapshotting captures a container's memory state at user-controlled points and reuses it across boots to cut the cold-start penalty on initialization-heavy workloads. To eliminate cold starts entirely, min_containers keeps warm capacity and buffer_containers over-provisions during spikes. Since Modal scales to zero, cold start is the cost you pay for not holding idle GPUs.

~400%

Together ATLAS speedup vs FP8 baseline

~1 sec

Modal container boot time

Autoscaling and Scale-to-Zero

Both scale to zero, but the unit they scale is different.

Together AI's serverless tier has no scaling for you to manage: it is a shared, multi-tenant token API that absorbs your traffic. Its dedicated endpoints autoscale reserved GPU capacity for a single model with predictable latency. Serverless multi-LoRA lets you deploy fine-tuned adapters behind one base model at the base model's price.

Modal autoscales containers. It spins GPU containers up on demand and back down to zero between requests, scaling to 50,000+ concurrent sessions. You control concurrency, GPU type, and warm-pool size (min_containers, buffer_containers) in code. This is the right model when each request needs an isolated environment, like running untrusted agent code, rather than sharing one hot model.

Fine-Tuning and Customization

Together has a managed fine-tuning product; Modal gives you the GPUs to run your own.

Together AI offers LoRA and full fine-tuning as a service, priced per 1M training tokens by model size. LoRA SFT runs $0.48 (up to 16B), $1.50 (17B to 69B), and $2.90 (70B to 100B); LoRA DPO runs $0.54, $1.65, and $3.20 across the same tiers. Trained adapters deploy straight onto serverless multi-LoRA at the base model's inference price.

Modal does not ship a fine-tuning API. You run your own training script (Axolotl, TRL, Unsloth, whatever) on Modal GPUs and pay per second. That is more control and more setup. You own the training loop, checkpointing, and serving.

Security and Compliance

Both clear the enterprise bar. Modal's isolation story is built for untrusted code.

Security and Compliance

Capability	Together AI	Modal
SOC 2 Type II	Yes	Yes (from Starter tier)
HIPAA	Supported	Yes (Enterprise)
Sandbox isolation	Multi-tenant API	Isolated container per session
OpenAI-compatible API	Yes	DIY (your server)
Audit logs	Yes	Enterprise
Self-host / reserved capacity	Dedicated endpoints	Your code on their GPUs

Together AI is SOC 2 Type 2 certified, with an independent audit validating access management, encryption, incident response, and change management. Modal is SOC 2 compliant from the Starter tier and supports HIPAA compatibility and audit logs on Enterprise. Modal's differentiator is per-session container isolation, built specifically to run untrusted, model-generated code safely. If your agent executes code it just wrote, that isolation is the feature.

Best Place to Run DeepSeek and Codegen

If your workload is DeepSeek output fidelity or code generation specifically, neither Together nor Modal is the answer to optimize for. Most serverless providers, including Together's baseline, quantize activations to fp8 to cut cost, which degrades output quality. Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and no fp8/int8 quantization, so responses match the reference weights. That makes Morph the best place to run DeepSeek when output fidelity matters.

For coding agents, Morph runs codegen-specific speculative decoding plus custom low-level inference kernels built for code generation. morph-dsv4flash (DeepSeek V4 Flash) is $0.139/M input and $0.278/M output, and morph-v3-fast applies model-generated edits at ~10,500 tok/s. See full pricing.

Running DeepSeek and codegen

Need	Together AI	Modal	Morph
DeepSeek activations	fp8 baseline	Your config	16-bit (bf16), no quant
DeepSeek V4 Pro in/out	$2.10 / $4.40 per 1M	DIY on GPU	DeepSeek V4 Flash $0.139 / $0.278
Codegen-tuned decoding	General ATLAS	Your own	Code-tuned spec decoding + kernels
Apply throughput	General serving	Your own server	~10,500 tok/s (morph-v3-fast)

When to Use Together AI

You want an OpenAI-compatible model API. 200+ open models behind one base URL, per-token billing, no GPU provisioning.
Your traffic is bursty or low-volume. Per-token pricing with zero idle cost beats paying for a GPU you do not keep busy. Below ~91M tokens/day on a single model, Together is cheaper than a warm Modal H100.
You need fine-tuned adapters at scale. Managed LoRA fine-tuning from $0.48/1M training tokens, served on serverless multi-LoRA at the base model's price.
You want top tokens-per-second on open models. ATLAS delivers up to a 400% speedup on dedicated endpoints at no extra cost.
You need batch throughput cheaply. The batch API runs up to 50,000 requests per batch at up to 50% off serverless rates.

Frequently Asked Questions

Is Modal cheaper than Together AI for inference?

It depends on utilization. Together charges per token, so you pay nothing when idle (Llama 3.3 70B is $1.04 per million tokens). Modal charges per GPU-second whether the card is busy or not (H100 at $0.001097/sec, about $3.95/hr base). For 50M tokens/day Together costs about $52/day (~$1,560/mo) versus a warm Modal H100 at about $94.80/day (~$2,845/mo) before Modal's 1.5 to 1.75x region and up to 3x non-preemptible multipliers. Modal only wins once one warm H100 serves more than the equivalent of about 91M tokens/day, and only if the card can push that volume.

What is the Modal H100 price per second?

Modal lists the H100 at $0.001097 per second, about $3.95 per hour at the base rate. H200 is $0.001261/sec, B200 is $0.001736/sec (~$6.25/hr), and A100 80GB is $0.000694/sec (~$2.50/hr). These are base prices before Modal's 1.5 to 1.75x region selection multiplier and its up to 3x non-preemptible execution multiplier.

Does Modal serve LLMs as an API?

Not as a managed model API. Modal gives you serverless GPUs and a sandbox runtime. To serve an LLM you deploy your own vLLM or SGLang server on Modal and expose it. There is no catalog of hosted chat models behind a single base URL the way Together AI offers 200+ models. Modal is infrastructure; you bring the model.

What are Together AI and Modal's rate limits?

Together AI's serverless rate limits are dynamic per model and scale with sustained traffic; there are no fixed per-model limits published, and for a known fixed limit Together recommends a dedicated endpoint. Modal's scaling is on containers, not request rate: Starter allows 100 containers and 10 GPU concurrency, Team allows 1000 containers and 50 GPU concurrency, and it scales to 50,000+ concurrent sessions.

What is Together AI's ATLAS speculator?

ATLAS (AdapTive-LeArning Speculator System) is Together's runtime-learning speculative decoder. A small draft model proposes tokens that the main model verifies in parallel, and ATLAS retrains the draft on live traffic so it stays aligned with your workload. Together reports up to a 400% speedup over the FP8 baseline, lifting DeepSeek-V3.1 from 105 to 501 tok/s on 4 B200 GPUs. It runs on dedicated endpoints at no extra cost.

Can I use Together AI and Modal together?

Yes, and many teams do. A common split is to call Together AI for hosted token generation on a stock open model, and use Modal to run anything Together does not host: a custom model on your own vLLM or SGLang server, or an isolated container to execute code the model just wrote. You can also fine-tune on Together, export the weights, and serve them on your own Modal GPUs once you outgrow serverless pricing. They sit at different layers, so they compose rather than compete.

Related Comparisons

Together for Tokens, Modal for GPUs, Morph for Codegen

Call a model on Together or run your own stack on Modal. For DeepSeek at 16-bit fidelity and code-tuned inference, Morph serves morph-dsv4flash at $0.139/$0.278 per 1M and applies edits at ~10,500 tok/s.

See Morph Models

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs Modal Pricing (2026): $1.04/M Tokens vs $0.001097/GPU-sec, and the ~91M-Token Crossover

Pricing Side by Side: Per Token vs Per GPU-Second

Cost on a Real Workload

Together AI vs Modal H100 Price

Who Wins Per Workload

What Each One Actually Is

Together AI: a token factory

Modal: a serverless GPU runtime

Rate Limits and Throughput

Speed and Cold Starts

Together AI: ATLAS speculator

Modal: cold starts and snapshots

Autoscaling and Scale-to-Zero

Fine-Tuning and Customization

Security and Compliance

Best Place to Run DeepSeek and Codegen

When to Use Together AI

Frequently Asked Questions

Is Modal cheaper than Together AI for inference?

What is the Modal H100 price per second?

Does Modal serve LLMs as an API?

What are Together AI and Modal's rate limits?

What is Together AI's ATLAS speculator?

Can I use Together AI and Modal together?

Related Comparisons

Together for Tokens, Modal for GPUs, Morph for Codegen

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs Modal Pricing (2026): $1.04/M Tokens vs $0.001097/GPU-sec, and the ~91M-Token Crossover

Pricing Side by Side: Per Token vs Per GPU-Second

Cost on a Real Workload

Together AI vs Modal H100 Price

Who Wins Per Workload

What Each One Actually Is

Together AI: a token factory

Modal: a serverless GPU runtime

Rate Limits and Throughput

Speed and Cold Starts

Together AI: ATLAS speculator

Modal: cold starts and snapshots

Autoscaling and Scale-to-Zero

Fine-Tuning and Customization

Security and Compliance

Best Place to Run DeepSeek and Codegen

When to Use Together AI

When to Use Modal

Frequently Asked Questions

Is Modal cheaper than Together AI for inference?

What is the Modal H100 price per second?

Does Modal serve LLMs as an API?

What are Together AI and Modal's rate limits?

What is Together AI's ATLAS speculator?

Can I use Together AI and Modal together?

Related Comparisons

Together for Tokens, Modal for GPUs, Morph for Codegen