Baseten vs Modal: Managed Inference vs Raw GPU Compute (Pick by Stack Ownership)

Baseten auto-tunes the serving engine and ships a model fast with production guardrails. Modal hands you per-second Python-native GPUs and gVisor sandboxes to own the stack. The right pick is decided by who owns the runtime, not by price.

June 3, 2026 · 1 min read

Pick Baseten to deploy a model fast with production guardrails: it auto-selects and tunes the serving engine for you (TensorRT-LLM, SGLang, or vLLM), ships per-token Model APIs and Chains, and offers VPC and HIPAA. Pick Modal to own the runtime or run arbitrary GPU and code workloads: it gives you lower-level Python-native compute billed per second, memory snapshots, and gVisor sandboxes for untrusted code. The decision is who owns the stack, not who is cheaper.

Baseten and Modal both run AI workloads on serverless GPUs, but they sit at different layers. Baseten is a managed inference platform: you point your OpenAI SDK at a Model API, or deploy your own model to a dedicated GPU with autoscaling, and Baseten tunes the serving stack for you. Modal is general serverless compute: you write Python, decorate a function with a GPU, and Modal runs it per second with sub-second cold starts and gVisor sandboxes.

The split matters. Baseten optimizes the model-serving path for you and bills per token or per GPU-minute. Modal hands you the raw compute and lets you build anything, from a fine-tuning job to a coding-agent sandbox, billed per second of execution. All prices below are as of early 2026 and change often, so confirm on each provider's pricing page before you commit.

TL;DR

  • Pick Baseten if you want a managed endpoint without running the server. Per-token Model APIs (OpenAI-compatible), dedicated deployments with TensorRT-LLM and speculative decoding, LoRA-aware routing, SOC 2 Type II, HIPAA, and VPC deployment.
  • Pick Modal if you want raw per-second compute you script in Python. H100 at roughly $3.95/hr, Memory Snapshots for 3-10x faster cold starts, and gVisor Sandboxes with official SWE-bench integration for code execution.

Who Wins Per Workload

The same two layers win different jobs. Match the row to your workload before you compare list prices.

Workload / decisionBasetenModal
Ship a hosted endpoint fastBaseten: OpenAI-compatible Model APIModal: you build the server
Own / swap the serving engineBaseten: auto-selected for youModal: full control, you tune it
Lowest raw GPU rateBaseten: H100 $6.50/hrModal: H100 ~$3.95/hr
Bursty / spiky trafficBaseten: scale-to-zero per minuteModal: per-second, pay active only
Run untrusted / agent codeBaseten: not offeredModal: gVisor Sandboxes
Strictest compliance / VPCBaseten: HIPAA + VPC + hybridModal: cloud-only today
Tuned LLM throughput, no teamBaseten: TensorRT-LLM + spec decodeModal: bring your own tuning
Multi-LoRA servingBaseten: LoRA-aware routingModal: you wire the routing
Arbitrary GPU compute (ETL, jobs)Baseten: serving-focusedModal: any Python function

Quick Comparison

Baseten serves models for you. Modal gives you the compute. Morph optimizes the coding-agent loop.

SpecBasetenModalMorph
Primary focusManaged inferenceServerless computeCoding-agent inner loop
Per-token Model APIsYes (OpenAI-compat)No (BYO code)Apply / search / compact
Dedicated GPU $/hrH100 $6.50, A100 $4.00H100 ~$3.95, A100 ~$2.50Managed, no GPU rental
Billing granularityPer token / per minutePer second / per msPer request / per token
Code-specific apply endpointNoNo/v1/code/apply
Semantic code searchNoNoWarpGrep
Apply throughputGeneral servingGeneral serving~10,500 tok/s
First-pass apply accuracyN/AN/A98%
Code sandboxesNogVisor SandboxesN/A (inference layer)
Best forHosted model endpointsCustom compute & sandboxesCoding agents

Platform Model: Managed Serving vs Raw Compute

The core difference is who owns the serving stack.

Baseten is the inference cloud. You can call a Model API with an OpenAI-compatible request and never touch a server, or you can package a model with Truss and deploy it to a dedicated GPU. Either way, Baseten benchmarks TensorRT-LLM, SGLang, and vLLM per workload and picks the fastest path, and it can scale a single model across regions and clouds with a multi-cloud capacity manager. Chains lets you wire multiple models into one workflow with independent autoscaling per step.

Modal is serverless compute. You write a Python function, attach gpu="H100", and Modal containerizes and runs it. There is no per-token API and no opinion about how you serve a model. That generality is the point: the same primitive runs a fine-tuning job, a batch ETL pipeline, a web endpoint, or a coding-agent sandbox.

60%+
Baseten TensorRT-LLM throughput gain (reported)
Per-sec
Modal billing granularity

Practically: if you want a model endpoint without managing infrastructure, Baseten does more for you. If you want a programmable compute primitive to build on, Modal gets out of your way.

Pricing: Per-Token and Per-Minute vs Per-Second

Modal is cheaper on raw GPU time. Baseten bundles a tuned serving stack into a higher GPU rate and adds per-token Model APIs.

GPUBaseten ($/hr)Modal ($/hr approx)
T4 (16GB)$0.63~$0.59
A10 / A10G$1.21~$1.10
L40SAvailable~$1.95
A100 80GB$4.00~$2.50
H100 80GB$6.50~$3.95
H200Available~$4.54
B200$9.98~$6.25

Modal lists GPUs per second (H100 at $0.001097/sec, A100 80GB at $0.000694/sec) and bills CPU ($0.0000131 per core/sec) and memory ($0.00000222 per GiB/sec) separately, so a GPU job's true cost is GPU plus CPU plus RAM. Baseten bills dedicated deployments per minute with no idle charge when scaled to zero.

Baseten Model APIs (per 1M tokens)

ModelInputOutput
GPT-OSS 120B$0.10$0.50
DeepSeek V3.1$0.50$1.50
DeepSeek V4$1.74$3.48
GLM 5.1$1.30$4.30
Nemotron 3 Super$0.30$0.75

Modal has no per-token equivalent. To serve those same models on Modal, you run vLLM or SGLang yourself and pay for GPU seconds. That can be cheaper at high steady utilization and more expensive at low utilization, where Baseten's scale-to-zero plus per-token billing wins.

Free credits

Modal gives every account $30/mo in free compute credits ($100/mo on the Team plan at $250/mo). Baseten's Basic plan has no monthly minimum, with volume discounts negotiated on Pro and Enterprise. New Baseten accounts typically start with promotional credits as well.

Cost on a Real Workload

Cost on a real workload (computed from list prices, early 2026)

Take one H100 serving an open-weight LLM around the clock. On Baseten's dedicated rate of $6.50/hr that is 6.50 x 24 = $156/day, or about $4,680/mo (730 hrs x $6.50 = $4,745 at a full month). On Modal's ~$3.95/hr that same H100 is 3.95 x 24 = $94.80/day, about $2,884/mo (730 hrs x $3.95 = $2,884) before Modal's separate CPU and RAM charges. So the raw-GPU gap is roughly $1,800/mo per H100 in Modal's favor at full utilization.

That gap is the price of the managed engine. Baseten reports TensorRT-LLM throughput gains above 60% on some models. If tuning lets one Baseten H100 do the work of 1.6 untuned Modal H100s, Modal's 1.6 H100s cost 1.6 x $2,884 = $4,614/mo, which lands right at Baseten's $4,680. Break-even: below about a 60% throughput edge from Baseten's stack, Modal wins on cost per token; above it, Baseten's tuning closes the gap. The crossover assumes you have no in-house serving expert. If you do, you can tune Modal yourself and keep the lower rate.

Cold Starts & Autoscaling

Modal's Memory Snapshots are the most concrete cold-start lever between the two.

Modal checkpoints a container's memory with CRIU and restores it on later boots, skipping initialization. Modal reports initialization-heavy functions starting 3-10x faster from snapshots, with GPU memory snapshots (alpha) hitting up to 10x. Containers also benefit from an optimized filesystem so large images do not stall startup. Autoscaling is automatic from zero, with GPU concurrency capped by plan (10 on Starter, 50 on Team).

Baseten advertises sub-second cold starts and 99.99% uptime on its managed stack, with scale-to-zero and configurable autoscaling policies per deployment. Because Baseten owns the serving path, cold-start tuning happens inside the platform rather than something you script. LoRA-aware routing keeps adapters warm by sending requests to replicas that already hold a given LoRA in memory.

3-10x
Modal cold start speedup (Memory Snapshots)
99.99%
Baseten advertised uptime

Serving Performance: Where Baseten Differentiates

For LLM throughput, Baseten ships more optimization out of the box.

Baseten built production speculative decoding on TensorRT-LLM and reports throughput gains above 60% for some customer models. Its Embeddings Inference (BEI) targets high-throughput RAG and search, and its Whisper implementation runs over 12x faster than the reference OpenAI implementation. The platform auto-selects between TensorRT-LLM, SGLang, and vLLM per workload, so you inherit kernel-level tuning without writing it.

Modal gives you the same frameworks, but you install and tune them. You can absolutely run vLLM with speculative decoding on Modal, but the optimization work is yours. Modal's advantage is flexibility: any framework, any model, any custom pre/post-processing, all in one Python file.

Who tunes the kernels?

Baseten tunes the serving stack for you and prices that in. Modal expects you to bring tuned serving code. If you have a serving expert on staff, Modal's lower GPU rate plus your own TensorRT-LLM setup can beat Baseten on cost per token. If you do not, Baseten's managed stack closes that gap without the engineering.

Sandboxes & Code Execution: Modal Wins

For running untrusted code, Modal is the clear pick.

Modal Sandboxes spin up an isolated container in one line of Python, then exec arbitrary commands inside. Isolation uses gVisor (runsc), a userspace kernel, so agent-generated code cannot touch the host kernel directly. Modal upstreamed official SWE-bench support: a --modal flag runs the full 500-task Verified benchmark in about 7 minutes in the cloud. Filesystem Snapshots persist sandbox state past the 24-hour limit and enable large fan-out experiments from a saved checkpoint.

Baseten does not offer arbitrary code-execution sandboxes. It is built to serve models, not to run an agent's shell. If your agent needs a filesystem and a CLI to execute generated code, Modal is the infrastructure layer for that step.

7 min
Modal: 500-task SWE-bench Verified run
gVisor
Modal sandbox isolation (runsc)

One caveat for security-sensitive teams: Modal uses gVisor isolation rather than Firecracker microVMs, and on-prem deployment is not offered today. For multi-tenant untrusted code at the highest isolation bar, evaluate whether gVisor meets your threat model.

Fine-Tuning & Training

Both run training jobs; Baseten makes the promote-to-production path shorter.

Baseten Training runs multi-node fine-tuning jobs that promote to production endpoints in the same platform, billed at the same GPU rates as inference. Combined with LoRA-aware routing, you can host many adapters behind one deployment and route by adapter. Modal runs training as just another GPU function: full control, any framework, but you wire serving and routing yourself afterward.

If your loop is fine-tune then serve then iterate, Baseten removes glue work. If training is a one-off batch job feeding a model you serve elsewhere, Modal's per-second compute is the cheaper place to run it.

Compliance & Deployment

Baseten goes further on enterprise deployment options.

CapabilityBasetenModal
SOC 2 Type IIYesYes
HIPAAYesAvailable
VPC / self-hostedCloud, VPC, or hybridNot offered today
Multi-cloud / multi-regionCapacity manager across cloudsMulti-region
OpenAI-compatible APIYes (fully)You build the endpoint
On-premHybrid optionsNo

Baseten supports deployment in its cloud, in a customer's VPC, or hybrid for data-residency needs, and spreads capacity across clouds. For regulated workloads that must stay inside your own network, Baseten's VPC and hybrid options are the differentiator. Modal is cloud-only today, with no BYOC or on-prem path.

When to Use Baseten

  • Hosted model endpoints without managing servers. OpenAI-compatible Model APIs let you swap a base URL and start calling. No deployment for popular open models.
  • High-throughput LLM serving. TensorRT-LLM with production speculative decoding, reported 60%+ throughput gains, and auto-selection across TensorRT-LLM, SGLang, and vLLM.
  • Multi-LoRA products. LoRA-aware routing keeps adapters warm and routes requests to replicas that already hold the right adapter.
  • Regulated or data-residency workloads. SOC 2 Type II, HIPAA, and VPC or hybrid deployment inside your own network.
  • Embeddings and audio at scale. BEI for high-throughput embeddings and reranking, plus Whisper transcription over 12x faster than the OpenAI reference.

When to Use Modal

  • Custom Python compute, not just model serving. Attach a GPU to any function. Run training, batch ETL, web endpoints, and inference from one primitive.
  • Bursty or spiky traffic. Per-second billing and automatic scale-from-zero mean you pay only for active execution, often the cleanest economics for uneven load.
  • Coding-agent and code-execution sandboxes. gVisor Sandboxes, filesystem snapshots, and official SWE-bench integration (500 tasks in about 7 minutes).
  • Initialization-heavy cold starts. Memory Snapshots restore container state for 3-10x faster boots, with GPU snapshots in alpha.
  • Lowest raw GPU rate. H100 at roughly $3.95/hr and A100 80GB at about $2.50/hr beat Baseten's dedicated list prices when you bring your own tuned serving code.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Frequently Asked Questions

What is the difference between Baseten and Modal?

Baseten is a managed inference platform. You call OpenAI-compatible Model APIs priced per token, or deploy your own model to a dedicated GPU and Baseten tunes the serving stack with TensorRT-LLM and speculative decoding. Modal is general serverless compute: you write Python functions, attach a GPU, and Modal runs them billed per second. Baseten optimizes serving for you. Modal gives you the raw compute to build anything.

Is Baseten or Modal cheaper for GPU inference?

It depends on the workload. As of early 2026, Modal bills an H100 at roughly $3.95/hr and an A100 80GB at about $2.50/hr. Baseten dedicated deployments list H100 at $6.50/hr and A100 80GB at $4.00/hr but bundle a tuned serving stack. For steady high-throughput LLM serving, Baseten's TensorRT-LLM optimizations can lower cost per token. For bursty or custom compute, Modal's per-second billing is often cheaper because you only pay for active execution.

Does Modal serve open models per token like Baseten?

No. Modal does not publish per-token Model APIs. You bring your own code and model and pay for the GPU seconds it runs. Baseten offers per-token Model APIs (for example GPT-OSS 120B at about $0.10 input / $0.50 output per 1M tokens, DeepSeek V3.1 at $0.50 / $1.50 as of early 2026) plus dedicated deployments. If you want a hosted endpoint without managing the server, Baseten is the closer fit.

Which has faster cold starts, Baseten or Modal?

Both are fast. Modal uses Memory Snapshots (checkpoint/restore via CRIU) to skip initialization, reporting 3-10x faster cold starts, with GPU memory snapshots in alpha hitting up to 10x. Baseten advertises sub-second cold starts and 99.99% uptime on its managed inference stack with scale-to-zero. For initialization-heavy custom code, Modal's snapshots are the bigger lever. For hosted model endpoints, Baseten handles cold starts inside its platform.

Should I pick Baseten or Modal to serve an open-weight LLM in production?

Pick Baseten if you want a production endpoint without owning the serving stack: it auto-selects between TensorRT-LLM, SGLang, and vLLM, ships a per-token Model API or a tuned dedicated deployment, and gives you SOC 2 Type II, HIPAA, and VPC deployment. Pick Modal if you have serving expertise and want to own the runtime: its lower GPU rate (H100 ~$3.95/hr vs Baseten $6.50/hr as of early 2026) beats Baseten on raw compute once you bring your own tuned vLLM or SGLang setup, but the kernel-level optimization work and the OpenAI-compatible endpoint are yours to build. Stack ownership decides it, not list price.

Related Comparisons

Baseten Ships the Model, Modal Owns the Stack

Pick by who owns the runtime, not by list price. If applying model-generated code edits is your bottleneck, Morph Fast Apply is the purpose-built layer.