Together AI and Modal both call themselves serverless, both run on the same Nvidia GPUs, and both end up on the same shortlist. They solve different problems.
Together AI is a managed token API. You send messages to an OpenAI-compatible endpoint, pick from 200+ open models, and pay per token. Llama 3.3 70B costs $1.04 per million tokens (input and output, as of early 2026). You never touch a GPU.
Modal is a serverless compute platform. You bring your own code and model weights, Modal runs them in a gVisor sandbox, and you pay per GPU-second. An H100 is $0.001097/sec. There is no managed chat model behind a URL: if you want token generation, you deploy vLLM or SGLang yourself.
Pick Together to call or fine-tune a model and get tokens back. Pick Modal to run your own serving stack or arbitrary GPU jobs. Comparing their per-token price is a category error, because only one of them prices in tokens at all.
TL;DR
- Pick Together AI if you want to call an open model through an OpenAI-compatible API and pay per token. 200+ models, per-token billing with zero idle cost, the ATLAS speculator for up to 400% faster inference, and serverless multi-LoRA.
- Pick Modal if you want raw serverless GPUs and a secure sandbox to run your own code or models. H100 at $0.001097/sec, gVisor isolation, scale-to-zero, memory snapshots that cut cold starts up to 10x, and 50,000+ concurrent sessions.
Who Wins Per Workload
The choice is rarely "which is better" and almost always "which fits this job." Map your workload to a row.
| Workload / decision | Together AI | Modal |
|---|---|---|
| Call a stock open model fast | Together (one base URL, 200+ models) | Modal (you build the server) |
| Bursty / low-volume traffic | Together (per token, $0 idle) | Modal (pays for warm GPUs) |
| Sustained, saturated throughput | Together (per-token markup) | Modal (per-second, you saturate) |
| Run your own model or engine | Together (catalog only) | Modal (BYO vLLM / SGLang) |
| Fine-tune and export weights | Together (managed LoRA + full) | Modal (run your own trainer) |
| Execute untrusted / agent code | Together (no sandbox) | Modal (gVisor per session) |
| Multi-LoRA at the base price | Together (serverless multi-LoRA) | Modal (DIY adapter loading) |
| Arbitrary non-LLM GPU jobs | Together (inference only) | Modal (any Python on a GPU) |
Quick Comparison
The fastest way to see the split: Together sells finished tokens, Modal sells GPU time, and Morph sells a code-edit loop.
| Spec | Together AI | Modal | Morph |
|---|---|---|---|
| What you buy | Tokens | GPU seconds | Code-edit loop |
| Primary unit | Per 1M tokens | Per GPU-second | Per request / token |
| Managed model API | Yes, 200+ models | No (bring your own) | Yes, code-specific |
| Code-specific apply | No | No | Yes (/v1/code/apply) |
| Semantic code search | No | No | WarpGrep |
| Apply throughput | General serving | Your own server | 10,500 tok/s |
| First-pass apply accuracy | N/A | N/A | 98% |
| Llama 3.3 70B price | $1.04 / 1M | DIY on GPU | N/A |
| H100 price | $6.49 / hr (dedicated) | $0.001097 / sec | N/A |
| Sandbox / code execution | No | gVisor, 50k+ sessions | Built for sandbox loop |
| Best for | Token inference | BYO compute + sandbox | Coding agents |
What Each One Actually Is
The single most important decision here is whether you want to see a GPU or not.
Together AI: a token factory
Together AI is an inference API. You hit an OpenAI-compatible endpoint, name a model, and get tokens back. The catalog covers 200+ open models (Llama, DeepSeek, Qwen, Kimi, and more) plus embeddings, image, and audio models. Together also runs dedicated endpoints where a model is pinned to reserved GPUs, and a fine-tuning service. The point of Together is that you never provision, patch, or saturate a GPU yourself.
Modal: a serverless GPU runtime
Modal is a compute platform built for secure sandboxed execution at scale, with GPU support on top. You define infrastructure in Python, TypeScript, or Go (no YAML), and Modal runs it in gVisor-isolated containers with a custom filesystem and scheduler. Modal serves over 10,000 teams and scales to 50,000+ concurrent sessions. It is not a model catalog. If you want to serve Llama, you deploy your own vLLM or SGLang server on Modal GPUs and expose the port.
The clean dividing line
Together AI answers "give me tokens from model X." Modal answers "run my container on a GPU, securely, and scale it." If your app is "call an LLM," Together is less work. If your app is "run arbitrary code or a custom model," Modal is the right primitive.
Pricing: Per Token vs Per GPU-Second
These two price on different axes, so a head-to-head only makes sense at a fixed workload.
| Item | Together AI | Modal |
|---|---|---|
| Llama 3.3 70B | $1.04 / 1M tokens | Run it yourself on GPU |
| Llama 3 8B Lite | $0.14 / 1M tokens | Run it yourself on GPU |
| H100 80GB | $6.49 / hr (dedicated) | $0.001097 / sec ($3.95/hr) |
| H200 | Dedicated quote | $0.001261 / sec |
| A100 80GB | Dedicated quote | $0.000694 / sec ($2.50/hr) |
| B200 180GB | $11.95 / hr (dedicated) | $0.001736 / sec |
| Idle cost | $0 (per token) | $0 (scale-to-zero) |
| Batch discount | 50% off serverless | N/A |
| Free credits | Trial credits | $30 / month (Starter) |
Together AI bills consumption only: no subscriptions, no setup fees, no minimums. The per-token rate already bundles their serving stack, so you do not separately pay for utilization. The batch API cuts serverless rates 50% when you do not need real-time responses.
Modal bills per GPU-second of actual execution, which is cheaper than Together's dedicated H100 rate on paper ($3.95/hr vs $6.49/hr). But Modal's headline price is the base. Region selection adds a 1.5 to 1.75x multiplier, and non-preemptible execution adds up to 3x. A non-preemptible US H100 lands well above the sticker. You also pay for every second the GPU is up, busy or not, so cost efficiency depends entirely on how well you saturate the card.
Which is actually cheaper
Cheaper depends on utilization, not on a sticker price. For bursty or low-volume traffic, Together wins: you pay per token and nothing when idle. Above the crossover, where you keep an H100 saturated with a model you have tuned, Modal's per-second rate undercuts a per-token bill. The worked scenario below puts a number on that crossover.
Cost on a Real Workload
Take one concrete job: serving Llama 3.3 70B at 50M output tokens per day. Here is the arithmetic, using only the list prices already on this page (computed from list prices, June 2026).
- Together AI, serverless tokens: 50M tokens/day at $1.04 per 1M = 50 × $1.04 = $52/day, about $1,560/mo. Zero ops, scales to zero, no GPU to keep warm.
- Modal, dedicated H100: one H100 at $0.001097/sec = $3.95/hr = $94.80/day = about $2,845/mo if you hold it warm 24/7. That before Modal's region (1.5 to 1.75x) and non-preemptible (up to 3x) multipliers, which push a always-on US H100 well past $3,000/mo. You also still build and run the vLLM or SGLang server yourself.
So at 50M tokens/day, Together is cheaper outright (~$1,560 vs ~$2,845+) and involves no serving stack. Modal only wins once one warm H100 serves more than ~$2,845/mo worth of tokens, which is roughly 91M tokens/day at Together's $1.04/M rate, assuming the card can actually push that volume. Below that throughput, Together's per-token bill is lower; above it, owning the GPU on Modal pays off. The break-even moves further in Modal's favor if you batch (Together's 50% batch discount roughly doubles the crossover) or further against it once you add Modal's region and non-preemptible multipliers.
Speed and Cold Starts
Together optimizes tokens-per-second; Modal optimizes container-start time. Different metrics for different jobs.
Together AI: ATLAS speculator
Together's headline speed feature is ATLAS, a runtime-learning speculative decoder. A small draft model proposes tokens, the main model verifies them in parallel, and ATLAS keeps retraining the draft model on live production traffic so it stays aligned as your workload shifts. Together reports up to 400% faster inference versus baseline serving, around 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 fully adapted, a 2.65x speedup over standard decoding. ATLAS ships on dedicated endpoints at no extra cost.
Modal: cold starts and snapshots
Modal's speed story is cold start. Its custom filesystem and scheduler target sub-second container starts for CPU and a few seconds for typical GPU models. Memory snapshots restore a pre-initialized process state and cut cold starts up to 10x for initialization-heavy workloads. Typical numbers: 2 to 5 seconds for small models, 15 to 30 seconds for 7B+ models without snapshots. Since Modal scales to zero, cold start is the cost you pay for not holding idle GPUs.
Autoscaling and Scale-to-Zero
Both scale to zero, but the unit they scale is different.
Together AI's serverless tier has no scaling for you to manage: it is a shared, multi-tenant token API that absorbs your traffic. Its dedicated endpoints autoscale reserved GPU capacity for a single model with predictable latency. Serverless multi-LoRA lets you deploy hundreds of fine-tuned adapters behind one base model, with cross-LoRA continuous batching and adapter prefetching so adapters load without blowing GPU memory, all at the base model's price.
Modal autoscales containers. It spins GPU containers up on demand and back down to zero between requests, scaling to 50,000+ concurrent sessions. You control concurrency, GPU type, and warm-pool size in code. This is the right model when each request needs an isolated environment, like running untrusted agent code, rather than sharing one hot model.
Fine-Tuning and Customization
Together has a managed fine-tuning product; Modal gives you the GPUs to run your own.
Together AI offers LoRA and full fine-tuning as a service, with per-token training pricing tiered by model size (roughly $0.48 to $8.00 per 1M tokens for standard models up to 100B). Trained adapters deploy straight onto serverless multi-LoRA at the base model's inference price, keeping up to 90% of base performance. Models with the -Reference suffix can be fine-tuned but not served as dedicated endpoints.
Modal does not ship a fine-tuning API. You run your own training script (Axolotl, TRL, Unsloth, whatever) on Modal GPUs and pay per second. That is more control and more setup. You own the training loop, checkpointing, and serving.
Security and Compliance
Both clear the enterprise bar; Modal's isolation story is built for untrusted code.
| Capability | Together AI | Modal |
|---|---|---|
| SOC 2 Type II | Yes | Yes |
| HIPAA | Yes (BAA available) | Yes (Enterprise) |
| Sandbox isolation | Multi-tenant API | gVisor per-session |
| OpenAI-compatible API | Yes | DIY (your server) |
| Structured outputs / JSON schema | Yes | Depends on your model |
| Self-host / VPC | Dedicated endpoints | Your code on their GPUs |
Together AI is SOC 2 Type 2 compliant and supports HIPAA with BAAs, encryption in transit and at rest, and audit logging. It also exposes structured outputs with JSON-schema constrained decoding for reliable agent and extraction workflows.
Modal is SOC 2 Type II audited with HIPAA support for eligible Enterprise workloads. Its differentiator is gVisor container isolation per session, which is built specifically to run untrusted, model-generated code safely. If your agent executes code it just wrote, that isolation is the feature.
When to Use Together AI
- You want an OpenAI-compatible model API. 200+ open models behind one base URL, per-token billing, no GPU provisioning. The least work to get an LLM responding.
- Your traffic is bursty or low-volume. Per-token pricing with zero idle cost beats paying for a GPU you do not keep busy.
- You need fine-tuned adapters at scale. Serverless multi-LoRA serves hundreds of adapters at the base model's price with cross-LoRA continuous batching.
- You want top tokens-per-second on open models. ATLAS delivers up to 400% faster inference on dedicated endpoints at no extra cost.
- You need batch throughput cheaply. The batch API runs billions of tokens at 50% off serverless rates.
When to Use Modal
- You need raw serverless GPUs. Per-second billing on T4 through B200, scale-to-zero, no minimums. You control the runtime end to end.
- You run a custom or unsupported model. Bring your own weights and serving stack (vLLM, SGLang, TensorRT-LLM) and deploy in Python.
- You execute untrusted code. gVisor isolation and 50,000+ concurrent sessions make Modal a strong sandbox for agent-generated code.
- Cold start matters. Memory snapshots cut initialization-heavy cold starts up to 10x, so scale-to-zero stays responsive.
- You want infra-as-code, not YAML. Define GPUs, concurrency, and scaling directly in Python, TypeScript, or Go.
Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).
Frequently Asked Questions
What is the difference between Together AI and Modal?
Together AI is a managed token API: call an OpenAI-compatible endpoint, choose from 200+ open models, pay per token (Llama 3.3 70B is $1.04 per million tokens as of early 2026). Modal is a serverless compute platform: bring your own code and model weights, run them in a gVisor sandbox, pay per GPU-second (H100 at $0.001097/sec). Together abstracts the GPU away. Modal hands you the GPU.
Is Modal cheaper than Together AI for inference?
It depends on utilization. Together charges per token, so you pay nothing when idle. Modal charges per GPU-second whether the card is busy or not, so it only wins when you keep the GPU saturated. For bursty traffic, Together is usually cheaper. Note Modal's multipliers: 1.5 to 1.75x for region and up to 3x for non-preemptible execution.
Does Modal serve LLMs as an API?
Not as a managed model API. Modal gives you serverless GPUs and a sandbox runtime. To serve an LLM you deploy your own vLLM or SGLang server on Modal and expose it. There is no catalog of hosted chat models behind a single base URL. Modal is infrastructure; you bring the model.
What is Together AI's ATLAS speculator?
ATLAS (AdapTive-LeArning Speculator System) is Together's runtime-learning speculative decoder. A small draft model proposes tokens that the main model verifies in parallel, and ATLAS retrains the draft on live traffic so it stays aligned with your workload. Together reports up to 400% faster inference, around 500 tok/s on DeepSeek-V3.1. It runs on dedicated endpoints at no extra cost.
Can I use Together AI and Modal together?
Yes, and many teams do. A common split is to call Together AI for hosted token generation on a stock open model, and use Modal to run anything Together does not host: a custom or unsupported model on your own vLLM or SGLang server, or a gVisor sandbox to execute code the model just wrote. You can also fine-tune on Together, export the weights, and serve them on your own Modal GPUs once you outgrow serverless pricing. They sit at different layers, so they compose rather than compete.
Related Comparisons
Together for Tokens, Modal for GPUs
Call a model on Together or run your own stack on Modal. If applying model-generated edits is your bottleneck, Morph Fast Apply merges them at ~10,500 tok/s.