Baseten vs Modal (2026): Modal H100 at $3.95/hr vs Baseten $6.50/hr

Modal bills an H100 at about $3.95/hr ($0.001097/sec) and an A100 80GB at about $2.50/hr, charged per second. Baseten lists a dedicated H100 at $6.50/hr ($0.10833/min) and an A100 80GB at $4.00/hr. The roughly $1,800/mo gap per H100 at full utilization is not a discount you are leaving on the table. It is the price of a managed serving stack and per-token Model APIs you do not get on Modal.

Baseten is a managed inference platform: call an OpenAI-compatible Model API priced per token, or deploy your own model to a dedicated GPU billed per minute. Modal is general serverless compute: write a Python function, attach a GPU, and pay per second of execution. There is no per-token API on Modal. You bring the model and the serving code.

This page leads with the numbers so you can pick on cost and workload. All prices are as of June 2026 and change often, so confirm on each provider's pricing page before you commit.

~$3.95/hr

Modal H100

$6.50/hr

Baseten H100

Per second

Modal billing

Per token

Baseten Model APIs

TL;DR

Pick Modal if you want the lowest raw GPU rate and per-second billing. H100 at ~$3.95/hr, A100 80GB at ~$2.50/hr, B200 at ~$6.25/hr, plus separate CPU ($0.0000131/core/sec) and memory ($0.00000222/GiB/sec) charges. You write Python and own the serving stack.
Pick Baseten if you want a hosted endpoint without running the server. Per-token Model APIs (GPT-OSS 120B at $0.10/$0.50 per 1M tokens, DeepSeek V4 at $1.74/$3.48), dedicated deployments billed per minute, SOC 2 Type II, and HIPAA.

GPU Pricing Side by Side

Modal is cheaper on every GPU tier because it bills raw compute per second with no managed serving layer. Baseten bundles a tuned serving stack into a higher per-minute rate and adds per-token Model APIs on top.

Dedicated GPU Pricing (June 2026)

GPU	Baseten ($/hr)	Modal ($/hr approx)
T4 (16GB)	$0.63	~$0.59
L4 (24GB)	$0.85	~$0.80
A10 / A10G (24GB)	$1.21	~$1.10
L40S	Available	~$1.95
A100 80GB	$4.00	~$2.50
H100 80GB	$6.50	~$3.95
H200	Available	~$4.54
B200 180GB	$9.98	~$6.25

Modal lists GPUs per second (H100 at $0.001097/sec, H200 at $0.001261/sec, B200 at $0.001736/sec, A100 80GB at $0.000694/sec) and bills CPU ($0.0000131 per physical core/sec) and memory ($0.00000222 per GiB/sec) separately, so a GPU job's true cost is GPU plus CPU plus RAM. Baseten bills dedicated deployments per minute (H100 at $0.10833/min, A100 80GB at $0.06667/min, B200 180GB at $0.16633/min) with no charge when a deployment is scaled to zero.

Baseten also publishes an H100 MIG 40GB partition at $0.0625/min ($3.75/hr) for workloads that fit in 40GB, which undercuts a full H100 on either platform.

Cost of One H100 for a Month

One H100 running around the clock (computed from June 2026 list prices)

Take one H100 serving an open-weight LLM 24/7. On Baseten's dedicated rate of $6.50/hr that is 6.50 x 24 = $156/day, about $4,745/mo over a 730-hour month. On Modal's ~$3.95/hr that same H100 is 3.95 x 24 = $94.80/day, about $2,884/mo before Modal's separate CPU and RAM charges. The raw-GPU gap is roughly $1,800/mo per H100 in Modal's favor at full utilization.

That gap buys the managed engine. Baseten auto-selects and tunes the serving stack (TensorRT-LLM, SGLang, or vLLM) so you do not write kernel-level optimization. If you have a serving expert who can match that throughput on Modal's lower rate, Modal wins on cost per token. If you do not, Baseten's tuning narrows or closes the gap without the engineering hire.

For low or bursty utilization the math flips. Modal bills per second of active execution and scales to zero, so an endpoint that serves traffic a few hours a day costs a fraction of a reserved month. Baseten's scale-to-zero plus per-token Model APIs win the same low-utilization case if you do not want to run a server at all.

Baseten Per-Token Model APIs (Modal Has None)

This is the sharpest split. Baseten ships OpenAI-compatible Model APIs priced per token, so you swap a base URL and start calling. Modal has no per-token equivalent: to serve these models you run vLLM or SGLang yourself and pay for GPU seconds.

Baseten Model API Pricing (per 1M tokens, June 2026)

Model	Input	Output
GPT-OSS 120B	$0.10	$0.50
Nemotron 3 Super	$0.30	$0.75
Nemotron 3 Ultra	$0.60	$2.40
Kimi K2.6	$0.95 ($0.16 cached)	$4.00
GLM 5.1	$1.30 ($0.26 cached)	$4.30
DeepSeek V4	$1.74 ($0.145 cached)	$3.48

Per-token Model APIs win at low or spiky utilization: you pay only for tokens, with no GPU to keep warm. Running the same model on Modal can be cheaper at high steady utilization, where amortizing a per-second H100 across heavy traffic beats per-token rates, but you own the serving stack and the OpenAI-compatible endpoint.

Plan Tiers & Free Credits

Plans (June 2026)

Tier	Baseten	Modal
Entry plan	Basic $0 pay-as-you-go	Starter free
Free credits	Promotional credits for new accounts	$30/mo (Starter), $100/mo (Team)
Mid tier	Pro, volume discounts	Team $250/mo + compute
Concurrency cap	Autoscaling per deployment	10 GPU (Starter), 50 GPU (Team)
Container cap	Per deployment	100 (Starter), 1000 (Team)
Top tier	Enterprise custom	Enterprise custom

Modal's Starter plan is free and includes $30/mo in compute credits (100 containers and 10 GPU concurrency). The Team plan adds $100/mo in credits, 1000 containers, and 50 GPU concurrency for $250/mo plus compute. Baseten's Basic plan has no monthly minimum and is pure pay-as-you-go, with volume discounts negotiated on Pro and Enterprise; new accounts come with free credits to experiment.

Cold Starts & Scale-to-Zero

Both platforms scale to zero by default. The cold-start behavior differs in who pays for the wake-up and how fast it is.

On Modal, functions scale to zero when idle (default scaledown_window of 60s, configurable 2s to 20min) and containers boot in about one second on Modal's container stack. Memory snapshotting captures a container's memory state at user-controlled points and reuses it across boots to cut the cold-start penalty. min_containers keeps warm capacity and buffer_containers over-provisions during traffic spikes.

On Baseten, min_replica=0 is the default, and a deployment scaled to zero incurs no charges. The next request triggers a cold start that can take minutes for large models, and during wake-up billing is per minute even though the replica is not yet serving. Baseten's docs recommend min_replica >= 2 for production to eliminate cold starts. Autoscaling defaults to min 0 / max 1 replicas with a concurrency target of 1 request per replica and a 60s scaling window (configurable 10 to 3600s).

~1 sec

Modal container boot time

min_replica >= 2

Baseten production recommendation

The cost trap is the same on both: a model serving real traffic should not sit at zero replicas with cold starts on the request path. Keep warm capacity (Modal min_containers, Baseten min_replica) for production endpoints and reserve scale-to-zero for dev and low-traffic services.

Who Wins Per Workload

The two platforms win different jobs. Match the row to your workload before comparing list prices.

Who wins per workload

Workload / decision	Baseten	Modal
Lowest raw GPU rate	H100 $6.50/hr	H100 ~$3.95/hr
Hosted per-token endpoint	OpenAI-compatible Model API	Not offered (BYO code)
Own / swap the serving engine	Auto-selected for you	Full control, you tune it
Bursty / spiky traffic	Scale-to-zero, per-minute	Per-second, pay active only
Arbitrary GPU compute (ETL, jobs)	Serving-focused	Any Python function
Tuned LLM throughput, no team	Managed serving stack	Bring your own tuning
HIPAA / regulated workloads	HIPAA + SOC 2 Type II	HIPAA on Enterprise
Fastest cold start	Cold start can take minutes	~1s container boot

Quick Comparison

Baseten serves models for you. Modal gives you the compute. Morph optimizes the coding-agent loop.

Baseten vs Modal vs Morph at a Glance

Spec	Baseten	Modal	Morph
Primary focus	Managed inference	Serverless compute	Coding-agent inner loop
Per-token Model APIs	Yes (OpenAI-compat)	No (BYO code)	Apply / search / compact
Dedicated GPU $/hr	H100 $6.50, A100 $4.00	H100 ~$3.95, A100 ~$2.50	Managed, no GPU rental
Billing granularity	Per token / per minute	Per second	Per request / per token
Scale to zero	Yes (min_replica=0)	Yes (scaledown 60s)	Managed
Code-specific apply endpoint	No	No	/v1/code/apply
Semantic code search	No	No	WarpGrep
Apply throughput	General serving	General serving	~10,500 tok/s
Best for	Hosted model endpoints	Custom compute & cheap GPUs	Coding agents

If you are running DeepSeek for output quality, where you serve it matters. Most serverless providers quantize activations to fp8 to cut cost, which degrades output. Morph serves DeepSeek with 16-bit (bf16) activations and does not quantize activations to fp8, so output matches the reference weights. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. For coding agents specifically, Morph runs codegen-tuned speculative decoding plus custom low-level inference kernels, making it the fastest and highest-fidelity option for code generation. See Morph Open Source Models and pricing.

Compliance & Deployment

Both carry SOC 2. Baseten ships HIPAA on its standard plans; Modal gates HIPAA compatibility and audit logs to Enterprise.

Compliance & Deployment (June 2026)

Capability	Baseten	Modal
SOC 2 Type II	Yes	Yes (from Starter)
HIPAA	Yes	Enterprise only
Audit logs	Available	Enterprise
Per-token Model APIs	Yes	You build the endpoint
Billing model	Per token / per minute	Per second

Baseten states SOC 2 Type II and HIPAA compliance on its pricing page. Modal offers SOC 2 from the Starter tier, with HIPAA compatibility and audit logs on Enterprise. For a regulated workload that needs HIPAA without an Enterprise contract, Baseten is the shorter path.

When to Use Baseten

Hosted model endpoints without managing servers. OpenAI-compatible Model APIs (GPT-OSS 120B at $0.10/$0.50, DeepSeek V4 at $1.74/$3.48 per 1M tokens) mean you swap a base URL and start calling.
You do not have a serving expert. Baseten auto-selects and tunes TensorRT-LLM, SGLang, or vLLM, so you inherit serving optimization instead of building it.
HIPAA without an Enterprise contract. SOC 2 Type II and HIPAA are stated on the standard plans.
Low or spiky utilization. Per-token billing plus scale-to-zero means no warm GPU to pay for between requests.

Frequently Asked Questions

Is Baseten or Modal cheaper for an H100?

Modal is cheaper on raw GPU time. As of June 2026, Modal bills an H100 at $0.001097/sec (about $3.95/hr) and an A100 80GB at $0.000694/sec (about $2.50/hr). Baseten lists a dedicated H100 at $0.10833/min ($6.50/hr) and an A100 80GB at $0.06667/min ($4.00/hr). At full utilization that is roughly $2,884/mo per H100 on Modal versus about $4,745/mo on Baseten. Modal bills CPU and memory separately, so add those in.

What is the difference between Baseten and Modal?

Baseten is a managed inference platform. You call OpenAI-compatible Model APIs priced per token, or deploy your own model to a dedicated GPU billed per minute. Modal is general serverless compute: you write Python functions, attach a GPU, and Modal runs them billed per second. Modal has no per-token Model API. Baseten optimizes serving for you; Modal gives you the raw compute to build anything.

Does Modal serve open models per token like Baseten?

No. Modal does not publish per-token Model APIs. You bring your own code and model and pay for the GPU seconds it runs. Baseten offers per-token Model APIs: GPT-OSS 120B at $0.10 input / $0.50 output per 1M tokens, DeepSeek V4 at $1.74 / $3.48, GLM 5.1 at $1.30 / $4.30, Kimi K2.6 at $0.95 / $4.00, and Nemotron 3 Super at $0.30 / $0.75. If you want a hosted endpoint without managing the server, Baseten is the closer fit.

Do Baseten and Modal both scale to zero?

Yes. On Baseten, min_replica=0 is the default and a deployment scaled to zero incurs no charges, but the next request triggers a cold start that can take minutes for large models, and billing is per minute during wake-up before the replica serves. Baseten recommends min_replica >= 2 in production. On Modal, functions scale to zero when idle (default 60s scaledown, configurable 2s to 20min), containers boot in about one second, and min_containers keeps warm capacity.

Should I pick Baseten or Modal to serve an open-weight LLM in production?

Pick Baseten if you want a hosted endpoint without owning the serving stack: per-token Model APIs, dedicated deployments with autoscaling, SOC 2 Type II, and HIPAA. Pick Modal if you have serving expertise and want the lower rate: an H100 at about $3.95/hr versus Baseten's $6.50/hr beats Baseten on raw compute once you bring your own tuned vLLM or SGLang setup, but the serving optimization and the OpenAI-compatible endpoint are yours to build.

Related Comparisons

Modal Wins on GPU Price, Baseten on Managed Serving

Pick Modal for the lowest H100 rate when you own the stack. Pick Baseten for per-token endpoints with no server to run. If applying model-generated code edits is your bottleneck, Morph Fast Apply is the purpose-built layer.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Baseten vs Modal (2026): Modal H100 at $3.95/hr vs Baseten $6.50/hr, and Why That Gap Exists

GPU Pricing Side by Side

Cost of One H100 for a Month

Baseten Per-Token Model APIs (Modal Has None)

Plan Tiers & Free Credits

Cold Starts & Scale-to-Zero

Who Wins Per Workload

Quick Comparison

Compliance & Deployment

When to Use Baseten

Frequently Asked Questions

Is Baseten or Modal cheaper for an H100?

What is the difference between Baseten and Modal?

Does Modal serve open models per token like Baseten?

Do Baseten and Modal both scale to zero?

Should I pick Baseten or Modal to serve an open-weight LLM in production?

Related Comparisons

Modal Wins on GPU Price, Baseten on Managed Serving

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Baseten vs Modal (2026): Modal H100 at $3.95/hr vs Baseten $6.50/hr, and Why That Gap Exists

GPU Pricing Side by Side

Cost of One H100 for a Month

Baseten Per-Token Model APIs (Modal Has None)

Plan Tiers & Free Credits

Cold Starts & Scale-to-Zero

Who Wins Per Workload

Quick Comparison

Compliance & Deployment

When to Use Baseten

When to Use Modal

Frequently Asked Questions

Is Baseten or Modal cheaper for an H100?

What is the difference between Baseten and Modal?

Does Modal serve open models per token like Baseten?

Do Baseten and Modal both scale to zero?

Should I pick Baseten or Modal to serve an open-weight LLM in production?

Related Comparisons

Modal Wins on GPU Price, Baseten on Managed Serving