Together AI vs Modal: Call a Model vs Run Your Own GPU Stack

Together AI sells finished tokens (Llama 3.3 70B at $1.04/M, OpenAI-compatible API, ATLAS speculator, fine-tune and export). Modal sells GPU seconds (H100 at $0.001097/sec, gVisor sandboxes, bring your own model and engine). Pick Together to call or fine-tune a model; pick Modal to run your own serving stack or arbitrary GPU jobs. Comparing their per-token price is a category error.

June 3, 2026 · 1 min read

Together AI and Modal both call themselves serverless, both run on the same Nvidia GPUs, and both end up on the same shortlist. They solve different problems.

Together AI is a managed token API. You send messages to an OpenAI-compatible endpoint, pick from 200+ open models, and pay per token. Llama 3.3 70B costs $1.04 per million tokens (input and output, as of early 2026). You never touch a GPU.

Modal is a serverless compute platform. You bring your own code and model weights, Modal runs them in a gVisor sandbox, and you pay per GPU-second. An H100 is $0.001097/sec. There is no managed chat model behind a URL: if you want token generation, you deploy vLLM or SGLang yourself.

Pick Together to call or fine-tune a model and get tokens back. Pick Modal to run your own serving stack or arbitrary GPU jobs. Comparing their per-token price is a category error, because only one of them prices in tokens at all.

TL;DR

  • Pick Together AI if you want to call an open model through an OpenAI-compatible API and pay per token. 200+ models, per-token billing with zero idle cost, the ATLAS speculator for up to 400% faster inference, and serverless multi-LoRA.
  • Pick Modal if you want raw serverless GPUs and a secure sandbox to run your own code or models. H100 at $0.001097/sec, gVisor isolation, scale-to-zero, memory snapshots that cut cold starts up to 10x, and 50,000+ concurrent sessions.

Who Wins Per Workload

The choice is rarely "which is better" and almost always "which fits this job." Map your workload to a row.

Workload / decisionTogether AIModal
Call a stock open model fastTogether (one base URL, 200+ models)Modal (you build the server)
Bursty / low-volume trafficTogether (per token, $0 idle)Modal (pays for warm GPUs)
Sustained, saturated throughputTogether (per-token markup)Modal (per-second, you saturate)
Run your own model or engineTogether (catalog only)Modal (BYO vLLM / SGLang)
Fine-tune and export weightsTogether (managed LoRA + full)Modal (run your own trainer)
Execute untrusted / agent codeTogether (no sandbox)Modal (gVisor per session)
Multi-LoRA at the base priceTogether (serverless multi-LoRA)Modal (DIY adapter loading)
Arbitrary non-LLM GPU jobsTogether (inference only)Modal (any Python on a GPU)

Quick Comparison

The fastest way to see the split: Together sells finished tokens, Modal sells GPU time, and Morph sells a code-edit loop.

SpecTogether AIModalMorph
What you buyTokensGPU secondsCode-edit loop
Primary unitPer 1M tokensPer GPU-secondPer request / token
Managed model APIYes, 200+ modelsNo (bring your own)Yes, code-specific
Code-specific applyNoNoYes (/v1/code/apply)
Semantic code searchNoNoWarpGrep
Apply throughputGeneral servingYour own server10,500 tok/s
First-pass apply accuracyN/AN/A98%
Llama 3.3 70B price$1.04 / 1MDIY on GPUN/A
H100 price$6.49 / hr (dedicated)$0.001097 / secN/A
Sandbox / code executionNogVisor, 50k+ sessionsBuilt for sandbox loop
Best forToken inferenceBYO compute + sandboxCoding agents

What Each One Actually Is

The single most important decision here is whether you want to see a GPU or not.

Together AI: a token factory

Together AI is an inference API. You hit an OpenAI-compatible endpoint, name a model, and get tokens back. The catalog covers 200+ open models (Llama, DeepSeek, Qwen, Kimi, and more) plus embeddings, image, and audio models. Together also runs dedicated endpoints where a model is pinned to reserved GPUs, and a fine-tuning service. The point of Together is that you never provision, patch, or saturate a GPU yourself.

Modal: a serverless GPU runtime

Modal is a compute platform built for secure sandboxed execution at scale, with GPU support on top. You define infrastructure in Python, TypeScript, or Go (no YAML), and Modal runs it in gVisor-isolated containers with a custom filesystem and scheduler. Modal serves over 10,000 teams and scales to 50,000+ concurrent sessions. It is not a model catalog. If you want to serve Llama, you deploy your own vLLM or SGLang server on Modal GPUs and expose the port.

The clean dividing line

Together AI answers "give me tokens from model X." Modal answers "run my container on a GPU, securely, and scale it." If your app is "call an LLM," Together is less work. If your app is "run arbitrary code or a custom model," Modal is the right primitive.

Pricing: Per Token vs Per GPU-Second

These two price on different axes, so a head-to-head only makes sense at a fixed workload.

ItemTogether AIModal
Llama 3.3 70B$1.04 / 1M tokensRun it yourself on GPU
Llama 3 8B Lite$0.14 / 1M tokensRun it yourself on GPU
H100 80GB$6.49 / hr (dedicated)$0.001097 / sec ($3.95/hr)
H200Dedicated quote$0.001261 / sec
A100 80GBDedicated quote$0.000694 / sec ($2.50/hr)
B200 180GB$11.95 / hr (dedicated)$0.001736 / sec
Idle cost$0 (per token)$0 (scale-to-zero)
Batch discount50% off serverlessN/A
Free creditsTrial credits$30 / month (Starter)

Together AI bills consumption only: no subscriptions, no setup fees, no minimums. The per-token rate already bundles their serving stack, so you do not separately pay for utilization. The batch API cuts serverless rates 50% when you do not need real-time responses.

Modal bills per GPU-second of actual execution, which is cheaper than Together's dedicated H100 rate on paper ($3.95/hr vs $6.49/hr). But Modal's headline price is the base. Region selection adds a 1.5 to 1.75x multiplier, and non-preemptible execution adds up to 3x. A non-preemptible US H100 lands well above the sticker. You also pay for every second the GPU is up, busy or not, so cost efficiency depends entirely on how well you saturate the card.

$1.04
Together: Llama 3.3 70B / 1M tokens
$0.001097
Modal: H100 base / second
50%
Together batch API discount

Which is actually cheaper

Cheaper depends on utilization, not on a sticker price. For bursty or low-volume traffic, Together wins: you pay per token and nothing when idle. Above the crossover, where you keep an H100 saturated with a model you have tuned, Modal's per-second rate undercuts a per-token bill. The worked scenario below puts a number on that crossover.

Cost on a Real Workload

Take one concrete job: serving Llama 3.3 70B at 50M output tokens per day. Here is the arithmetic, using only the list prices already on this page (computed from list prices, June 2026).

  • Together AI, serverless tokens: 50M tokens/day at $1.04 per 1M = 50 × $1.04 = $52/day, about $1,560/mo. Zero ops, scales to zero, no GPU to keep warm.
  • Modal, dedicated H100: one H100 at $0.001097/sec = $3.95/hr = $94.80/day = about $2,845/mo if you hold it warm 24/7. That before Modal's region (1.5 to 1.75x) and non-preemptible (up to 3x) multipliers, which push a always-on US H100 well past $3,000/mo. You also still build and run the vLLM or SGLang server yourself.

So at 50M tokens/day, Together is cheaper outright (~$1,560 vs ~$2,845+) and involves no serving stack. Modal only wins once one warm H100 serves more than ~$2,845/mo worth of tokens, which is roughly 91M tokens/day at Together's $1.04/M rate, assuming the card can actually push that volume. Below that throughput, Together's per-token bill is lower; above it, owning the GPU on Modal pays off. The break-even moves further in Modal's favor if you batch (Together's 50% batch discount roughly doubles the crossover) or further against it once you add Modal's region and non-preemptible multipliers.

Speed and Cold Starts

Together optimizes tokens-per-second; Modal optimizes container-start time. Different metrics for different jobs.

Together AI: ATLAS speculator

Together's headline speed feature is ATLAS, a runtime-learning speculative decoder. A small draft model proposes tokens, the main model verifies them in parallel, and ATLAS keeps retraining the draft model on live production traffic so it stays aligned as your workload shifts. Together reports up to 400% faster inference versus baseline serving, around 500 tok/s on DeepSeek-V3.1 and 460 tok/s on Kimi-K2 fully adapted, a 2.65x speedup over standard decoding. ATLAS ships on dedicated endpoints at no extra cost.

Modal: cold starts and snapshots

Modal's speed story is cold start. Its custom filesystem and scheduler target sub-second container starts for CPU and a few seconds for typical GPU models. Memory snapshots restore a pre-initialized process state and cut cold starts up to 10x for initialization-heavy workloads. Typical numbers: 2 to 5 seconds for small models, 15 to 30 seconds for 7B+ models without snapshots. Since Modal scales to zero, cold start is the cost you pay for not holding idle GPUs.

~400%
Together ATLAS speedup vs baseline
~10x
Modal cold-start cut via snapshots

Autoscaling and Scale-to-Zero

Both scale to zero, but the unit they scale is different.

Together AI's serverless tier has no scaling for you to manage: it is a shared, multi-tenant token API that absorbs your traffic. Its dedicated endpoints autoscale reserved GPU capacity for a single model with predictable latency. Serverless multi-LoRA lets you deploy hundreds of fine-tuned adapters behind one base model, with cross-LoRA continuous batching and adapter prefetching so adapters load without blowing GPU memory, all at the base model's price.

Modal autoscales containers. It spins GPU containers up on demand and back down to zero between requests, scaling to 50,000+ concurrent sessions. You control concurrency, GPU type, and warm-pool size in code. This is the right model when each request needs an isolated environment, like running untrusted agent code, rather than sharing one hot model.

Fine-Tuning and Customization

Together has a managed fine-tuning product; Modal gives you the GPUs to run your own.

Together AI offers LoRA and full fine-tuning as a service, with per-token training pricing tiered by model size (roughly $0.48 to $8.00 per 1M tokens for standard models up to 100B). Trained adapters deploy straight onto serverless multi-LoRA at the base model's inference price, keeping up to 90% of base performance. Models with the -Reference suffix can be fine-tuned but not served as dedicated endpoints.

Modal does not ship a fine-tuning API. You run your own training script (Axolotl, TRL, Unsloth, whatever) on Modal GPUs and pay per second. That is more control and more setup. You own the training loop, checkpointing, and serving.

Security and Compliance

Both clear the enterprise bar; Modal's isolation story is built for untrusted code.

CapabilityTogether AIModal
SOC 2 Type IIYesYes
HIPAAYes (BAA available)Yes (Enterprise)
Sandbox isolationMulti-tenant APIgVisor per-session
OpenAI-compatible APIYesDIY (your server)
Structured outputs / JSON schemaYesDepends on your model
Self-host / VPCDedicated endpointsYour code on their GPUs

Together AI is SOC 2 Type 2 compliant and supports HIPAA with BAAs, encryption in transit and at rest, and audit logging. It also exposes structured outputs with JSON-schema constrained decoding for reliable agent and extraction workflows.

Modal is SOC 2 Type II audited with HIPAA support for eligible Enterprise workloads. Its differentiator is gVisor container isolation per session, which is built specifically to run untrusted, model-generated code safely. If your agent executes code it just wrote, that isolation is the feature.

When to Use Together AI

  • You want an OpenAI-compatible model API. 200+ open models behind one base URL, per-token billing, no GPU provisioning. The least work to get an LLM responding.
  • Your traffic is bursty or low-volume. Per-token pricing with zero idle cost beats paying for a GPU you do not keep busy.
  • You need fine-tuned adapters at scale. Serverless multi-LoRA serves hundreds of adapters at the base model's price with cross-LoRA continuous batching.
  • You want top tokens-per-second on open models. ATLAS delivers up to 400% faster inference on dedicated endpoints at no extra cost.
  • You need batch throughput cheaply. The batch API runs billions of tokens at 50% off serverless rates.

When to Use Modal

  • You need raw serverless GPUs. Per-second billing on T4 through B200, scale-to-zero, no minimums. You control the runtime end to end.
  • You run a custom or unsupported model. Bring your own weights and serving stack (vLLM, SGLang, TensorRT-LLM) and deploy in Python.
  • You execute untrusted code. gVisor isolation and 50,000+ concurrent sessions make Modal a strong sandbox for agent-generated code.
  • Cold start matters. Memory snapshots cut initialization-heavy cold starts up to 10x, so scale-to-zero stays responsive.
  • You want infra-as-code, not YAML. Define GPUs, concurrency, and scaling directly in Python, TypeScript, or Go.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

What is the difference between Together AI and Modal?

Together AI is a managed token API: call an OpenAI-compatible endpoint, choose from 200+ open models, pay per token (Llama 3.3 70B is $1.04 per million tokens as of early 2026). Modal is a serverless compute platform: bring your own code and model weights, run them in a gVisor sandbox, pay per GPU-second (H100 at $0.001097/sec). Together abstracts the GPU away. Modal hands you the GPU.

Is Modal cheaper than Together AI for inference?

It depends on utilization. Together charges per token, so you pay nothing when idle. Modal charges per GPU-second whether the card is busy or not, so it only wins when you keep the GPU saturated. For bursty traffic, Together is usually cheaper. Note Modal's multipliers: 1.5 to 1.75x for region and up to 3x for non-preemptible execution.

Does Modal serve LLMs as an API?

Not as a managed model API. Modal gives you serverless GPUs and a sandbox runtime. To serve an LLM you deploy your own vLLM or SGLang server on Modal and expose it. There is no catalog of hosted chat models behind a single base URL. Modal is infrastructure; you bring the model.

What is Together AI's ATLAS speculator?

ATLAS (AdapTive-LeArning Speculator System) is Together's runtime-learning speculative decoder. A small draft model proposes tokens that the main model verifies in parallel, and ATLAS retrains the draft on live traffic so it stays aligned with your workload. Together reports up to 400% faster inference, around 500 tok/s on DeepSeek-V3.1. It runs on dedicated endpoints at no extra cost.

Can I use Together AI and Modal together?

Yes, and many teams do. A common split is to call Together AI for hosted token generation on a stock open model, and use Modal to run anything Together does not host: a custom or unsupported model on your own vLLM or SGLang server, or a gVisor sandbox to execute code the model just wrote. You can also fine-tune on Together, export the weights, and serve them on your own Modal GPUs once you outgrow serverless pricing. They sit at different layers, so they compose rather than compete.

Related Comparisons

Together for Tokens, Modal for GPUs

Call a model on Together or run your own stack on Modal. If applying model-generated edits is your bottleneck, Morph Fast Apply merges them at ~10,500 tok/s.