Baseten vs Replicate (2026): H100 at $6.50 vs $5.49/hr, Per-Minute vs Per-GPU-Second Billing

You are choosing where to serve a model and the two providers price it differently. Baseten bills an H100 80GB at $6.50/hr ($0.10833/min) per minute of active compute, idle excluded. Replicate bills the same class of GPU at $5.49/hr ($0.001525/s) per GPU-second, charging only while a prediction runs. On A100 80GB the order flips: Baseten is $4.00/hr, Replicate is $5.04/hr.

That billing split decides the bill more than the sticker rate. On steady, high-throughput traffic Baseten's per-minute model and token-priced Model APIs win. On bursty, idle-heavy traffic Replicate's pay-only-while-running model wins because a scaled-to-zero model costs nothing while idle. Replicate, now owned by Cloudflare, also carries 50,000+ models you call with one request through Cog. Baseten carries a tuned serving stack you control and SOC 2 Type II plus HIPAA.

$6.50/hr

Baseten H100 (per min)

$5.49/hr

Replicate H100 (per sec)

$4.00/hr

Baseten A100 80GB

50,000+

Replicate catalog

All prices are list rates as of June 9, 2026 and change often. Check the Baseten and Replicate pricing pages before committing.

TL;DR

Pick Baseten if you run steady, high-throughput inference and want tuned serving engines (TensorRT-LLM, SGLang, vLLM), token-priced Model APIs (GPT-OSS 120B at $0.10/$0.50 per M in/out), SOC 2 Type II plus HIPAA, and per-minute billing that excludes idle time. H100 dedicated is $6.50/hr, A100 80GB $4.00/hr, B200 $9.98/hr.
Pick Replicate if you want to prototype fast, run image/video/audio models from a 50,000+ catalog, or publish a custom model in one cog push. Per-GPU-second billing, H100 at $5.49/hr, A100 80GB $5.04/hr; scales to zero so you pay nothing while idle, but cold boots on large models can take several minutes.

Who Wins Per Workload

Match the row to what you are actually building. The winner flips on workload, not on a single spec.

Who Wins Per Workload

Workload / decision	Baseten	Replicate
Sustained high-volume serving	Baseten: per-minute, idle excluded	Replicate: per-second adds up
Bursty / low volume	Baseten: pays for warm replicas	Replicate: scales to zero, pay per run
Fastest first call (no cold start)	Baseten: warm (min_replica >= 2)	Replicate: minutes on big models
Cheapest A100 80GB	Baseten: $4.00/hr	Replicate: $5.04/hr
Cheapest H100 sticker	Baseten: $6.50/hr	Replicate: $5.49/hr
Cheapest at scale (LLM tokens)	Baseten: token-priced Model APIs	Replicate: mostly compute-time billing
Multimodal (image/video/audio)	Baseten: bring your own	Replicate: 50,000+ catalog models
Publish & share a model fast	Baseten: Truss, more setup	Replicate: one cog push
Tuned serving engine	Baseten: TensorRT-LLM / SGLang / vLLM	Replicate: bring your own
Strictest compliance (HIPAA)	Baseten: SOC 2 Type II + HIPAA	Replicate: SOC 2

Quick Comparison

Baseten optimizes for production reliability; Replicate optimizes for breadth and speed-to-ship. Morph is in the last column only as a reference point for one narrow case, applying model-generated code edits, where neither general host is the right tool.

Baseten vs Replicate vs Morph at a Glance

Spec	Baseten	Replicate	Morph
Built for	Production inference at scale	Model catalog + prototyping	Coding-agent inner loop
Billing model	Per-minute active compute	Per GPU-second (only while running)	Per token / per request
H100 80GB	$6.50/hr	$5.49/hr	N/A (managed endpoints)
A100 80GB	$4.00/hr	$5.04/hr	N/A
B200 180GB	$9.98/hr	No published rate	N/A
Serverless token API	Yes (Model APIs)	Limited (mostly compute-time)	Yes (apply/search/compact)
Catalog size	Curated Model APIs	50,000+ models	Apply/search/compact
DeepSeek fidelity	fp8 activations common	fp8 activations common	16-bit (bf16) activations, no fp8
DeepSeek V4 Flash price	DeepSeek V4 $1.74/$3.48	Runtime-billed	$0.139/$0.278 per 1M in/out
Code-specific apply	No	No	Yes (/v1/code/apply)
Semantic code search	No	No	WarpGrep ($0/100k)
Apply throughput	General token serving	General token serving	~10,500 tok/s
Cold starts	Warm (min_replica >= 2)	Minutes on big models	Always warm
Custom model tooling	Truss + Chains	Cog (open source)	Managed
Compliance	SOC 2 Type II, HIPAA	SOC 2	SOC 2

GPU Pricing: Per-Minute vs Per-GPU-Second

Raw hardware is close. The billing granularity separates the bill. Baseten bills per minute of active compute with idle excluded; Replicate bills per GPU-second of run time and charges only while a prediction is running.

Dedicated GPU Pricing (June 9, 2026)

GPU	Baseten	Replicate
CPU	$0.035/hr ($0.00058/min, 1x2)	$0.36/hr ($0.000100/s)
T4 (16 GiB)	$0.63/hr ($0.01052/min)	$0.81/hr ($0.000225/s)
L4 (24 GiB)	$0.85/hr ($0.01414/min)	N/A
L40S (48 GiB)	N/A	$3.51/hr ($0.000975/s)
A10G (24 GiB)	$1.21/hr ($0.02012/min)	N/A
A100 80GB	$4.00/hr ($0.06667/min)	$5.04/hr ($0.001400/s)
H100 80GB	$6.50/hr ($0.10833/min)	$5.49/hr ($0.001525/s)
B200 180GB	$9.98/hr ($0.16633/min)	No published rate
Billing granularity	Per minute, idle excluded	Per second, only while running

On A100 80GB, Baseten ($4.00/hr) undercuts Replicate ($5.04/hr) by about 21%. On H100 80GB, Replicate ($5.49/hr) is about 16% cheaper on the sticker. The catch is what you are billed for. Replicate bills per GPU-second of run time including setup, model load, and cold boots on private models, but only while a prediction runs, so a scaled-to-zero public model costs nothing while idle. Baseten bills per minute of active compute and explicitly excludes idle time. Replicate publishes no B200 or H200 rate; multi-GPU configs (2x H100 at $10.98/hr, 4x A100 at $20.16/hr) require a committed contract.

Where each billing model bites

Replicate's per-second model is cheap for steady batch jobs and for bursty traffic that sits idle, because a scaled-to-zero model charges nothing until the next request. It punishes interactive traffic on large models: each wake-up pays a cold boot that can take several minutes, and on private models you also pay for setup and idle time. Baseten's per-minute model excludes idle time but its docs recommend keeping min_replica at 2 or more in production, which holds the GPU warm and bills for that capacity. For high-volume LLM token serving, neither raw GPU rate is the cheapest option: Baseten's token-priced Model APIs are.

Per-Token Model API Pricing

Baseten publishes token-priced Model APIs on always-warm endpoints. Replicate prices a few language models per token (DeepSeek R1 at $3.75/M input and $10/M output, Claude 3.7 Sonnet at $3.00/M input and $15/M output) but bills most models by runtime, and image and video models by output: $0.04 per output image, $0.09 per second of output video.

Baseten Model API per-token pricing ($/M tokens)

Model	Input	Cached input	Output
GPT-OSS 120B	$0.10	N/A	$0.50
Kimi K2.6	$0.95	$0.16	$4.00
DeepSeek V4	$1.74	$0.145	$3.48
GLM 5.1	$1.30	$0.26	$4.30
Nemotron 3 Super	$0.30	N/A	$0.75
Nemotron 3 Ultra	$0.60	N/A	$2.40

A token-priced endpoint that stays warm sidesteps both the cold-boot wait and the cost of renting a whole GPU you may not saturate. For steady LLM traffic this is usually cheaper than either provider's dedicated GPU rate. If you only call a model occasionally, Replicate's scale-to-zero per-second billing on a shared GPU can be cheaper than holding a Baseten replica warm.

Cost on a Real Workload

Take one dedicated H100 running an LLM around the clock and ask which platform bills less for the same GPU. The arithmetic uses only the H100 rates on this page, so you can redo it.

Cost on a real H100 workload (computed from list prices, June 9, 2026)

One H100 running 24/7 for a month (730 hours): Replicate at $5.49/hr = 730 x $5.49 = $4,008/mo. Baseten at $6.50/hr = 730 x $6.50 = $4,745/mo. At full utilization Replicate is ~$737/mo cheaper on the same hardware.

Now make it bursty. Replicate bills per GPU-second only while a prediction runs and a scaled-to-zero public model charges nothing while idle. If the model is actually serving requests 40% of the month, billed time is roughly 0.40 x 730 x $5.49 = ~$1,603/mo, plus the cold-boot seconds each wake-up adds. Baseten on a warm dedicated deployment holds the GPU available, so a production setup with min_replica of 2 or more stays near the steady $4,745/mo; setting min_replica=0 lets Baseten scale to zero too, trading idle cost for a cold start of minutes on the next request.

Break-even: the more idle your GPU sits, the more Replicate's pay-per-second model wins, because you stop paying between predictions. As utilization climbs toward 24/7, Replicate's $5.49/hr beats Baseten's $6.50/hr on the same warm H100. The reason to pick Baseten at high utilization is not the H100 sticker, it is the token-priced Model APIs (for example GPT-OSS 120B at $0.10 input and $0.50 output per million tokens) that undercut renting any whole GPU for steady LLM traffic.

The numbers above are arithmetic on the published GPU rates, not a throughput benchmark. Actual tok/s, cold-boot length, and the exact break-even depend on your model and traffic shape.

Cold Starts, Scale-to-Zero & Autoscaling

Both providers scale to zero by default, and on both a large model's first request after idle pays a cold boot measured in minutes. The difference is what that idle period costs you and how you configure warm capacity.

Replicate turns a model off when it has not been used for a little while; cold boots can take several minutes on large models. Because only running prediction time is charged on public models, those cold boots do not add cost, but a user waiting on the first request still feels the wait. Replicate's Deployments let you set a minimum instance count that stays warm to eliminate the cold start, and a maximum to cap spend, scaling from zero to hundreds of instances based on traffic.

Baseten defaults to min_replica=0, so a scaled-to-zero deployment incurs no charges, but the next request triggers a cold start that can take minutes for large models, and during wake-up billing is per minute even before the replica serves a response. Baseten's docs recommend min_replica of 2 or more for production to eliminate cold starts. Autoscaling defaults: max replicas 1, concurrency target 1 request per replica (model-specific guidance ranges from 1 for image generation to 256 for batched Whisper), and a 60-second autoscaling window configurable from 10 to 3600 seconds.

Autoscaling config at a glance

Baseten: min_replica (default 0), max_replica (default 1), concurrency target (default 1 req/replica), autoscaling window (default 60s, range 10-3600s). Set min_replica >= 2 to remove cold starts.

Replicate: Deployments expose min instances (keep warm) and max instances (cap spend), scaling from zero to hundreds based on traffic. Default behavior is scale to zero when idle.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Rate Limits

Neither provider publishes a fixed per-model rate-limit table the way some serverless competitors do. On Baseten, concurrency is governed by your autoscaling config: the default concurrency target is 1 request per replica, so total throughput scales with max_replica. On Replicate, throughput scales with the instance count in a Deployment, from zero to hundreds of instances based on traffic. For a guaranteed fixed ceiling on either platform, you provision dedicated capacity rather than relying on a published shared limit. Serverless-first providers differ here: Fireworks caps at 6,000 requests per minute with a payment method, DeepInfra at 200 concurrent requests per account, and Together scales limits dynamically per model with sustained traffic.

The Serving Stack

Baseten gives you a tuned engine; Replicate gives you a container.

Baseten lets you pick the serving engine per workload, choosing between TensorRT-LLM, SGLang, and vLLM, and exposes token-priced Model APIs in an OpenAI-compatible format with function calling. Chains wires multiple models into one pipeline.

Replicate's serving surface is Cog: it generates a Docker image with sensible defaults (Nvidia base images, dependency caching, pinned Python) and exposes an HTTP API. You bring the framework. There is no built-in engine-selection layer the way Baseten ships one; you get a clean container boundary and a 50,000+ model catalog instead.

Serving & Optimization Features

Feature	Baseten	Replicate
Engine selection	TensorRT-LLM / SGLang / vLLM	Bring your own (via Cog)
OpenAI-compatible API	Yes (Model APIs)	Partial
Multi-model pipelines	Chains	Compose models manually
Custom model packaging	Truss (open source)	Cog (open source)
Run container off-platform	Truss portable	Cog portable
Function calling	Yes	Model-dependent

Deploying Your Own Model

Replicate is faster to publish; Baseten gives more control over how it runs.

On Replicate, you write a cog.yaml, define a predict function, and run cog push. The model gets an interactive GUI, a public or private HTTP API, and rolling updates with no downtime. Cog is open source, so you can run the same container on your own infrastructure. Fine-tuning is a first-class flow: Cog's training API lets users bring their own data to create derivative fine-tunes (SDXL on images, Llama on structured text).

On Baseten, you package with Truss and pick your serving engine, then tune autoscaling, concurrency, and hardware. Chains lets you wire multiple models into one pipeline. Baseten also ships a training product that keeps your weights portable. The tradeoff: more knobs and more setup than a single cog push.

One-Command vs Tuned Deploy

If your goal is "publish this model and share a link by lunch," Replicate's Cog wins. If your goal is "serve this model at p99 latency under load with autoscaling I control," Baseten's Truss plus engine selection is built for it.

Reliability & Compliance

Baseten is the heavier compliance story; Replicate covers the basics.

Baseten states on its pricing page that it is SOC 2 Type II certified and HIPAA compliant, the posture regulated and revenue-critical workloads need. Replicate provides SOC 2 but does not market the same HIPAA story. For healthcare or finance workloads with strict compliance needs, Baseten is the safer default.

Compliance across inference providers

Provider	SOC 2 Type II	HIPAA	Zero retention by default
Baseten	Yes	Yes	No
Replicate	Yes (SOC 2)	Not marketed	No
Fireworks	Yes	Yes	Yes (open models)
Modal	Yes (from Starter)	Enterprise	No
Groq	Yes	BAA, with exclusions	No
DeepInfra	Yes (+ISO 27001)	Measures in place	Yes

How They Compare to Other Inference Providers

Baseten and Replicate are two points on a wider map. If you are price-shopping raw GPUs or per-token serverless inference, the cheapest provider is often neither of them.

Dedicated H100 80GB on-demand (June 9, 2026)

Provider	$/hr	Notes
DeepInfra	$1.79	Cheapest published dedicated rate
Modal	~$3.95	$0.001097/s, per-second
Together (cluster)	$5.49	$6.49/hr dedicated endpoint
Replicate	$5.49	$0.001525/s, only while running
Baseten	$6.50	Per-minute, idle excluded
Fireworks	$7.00	On-demand

GPT-OSS 120B serverless, per-token ($/M in/out)

Provider	Input	Output	Notes
Baseten	$0.10	$0.50	Cheapest
Fireworks	$0.15	$0.60	$0.015 cached input
Together	$0.15	$0.60	Serverless
Groq	$0.15	$0.60	Publishes 500 tok/s

DeepInfra publishes the cheapest dedicated H100 at $1.79/hr, and Modal's per-second billing lands near $3.95/hr. On serverless tokens, Baseten's GPT-OSS 120B at $0.10/$0.50 undercuts Fireworks, Together, and Groq. For a full provider-by-provider breakdown, see the related comparisons below.

One caveat if you are running DeepSeek: cheap serverless rates usually come from quantizing activations to fp8, which moves output away from the reference weights. Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and do not quantize activations to fp8, so output matches the reference weights when fidelity matters. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens. For coding agents specifically, Morph adds codegen-tuned speculative decoding and custom low-level inference kernels built for code generation, which makes it the fastest and highest-quality option for that workload. See Morph pricing.

When to Use Baseten

High-volume LLM serving. Token-priced Model APIs on an always-warm, OpenAI-compatible endpoint (GPT-OSS 120B at $0.10/$0.50 per M in/out) beat renting a whole GPU once traffic is steady.
Regulated workloads. SOC 2 Type II plus HIPAA make Baseten the default for healthcare, finance, and other compliance-bound deployments.
Performance-sensitive serving. Per-workload engine selection across TensorRT-LLM, SGLang, and vLLM, plus Chains for multi-model pipelines.
Cost control on steady traffic. Per-minute billing that excludes idle time, with A100 80GB at $4.00/hr undercutting Replicate's $5.04/hr by about 21%.
Warm production latency. Setting min_replica >= 2 removes cold starts for user-facing traffic that cannot wait minutes for a model to load.

When to Use Replicate

Prototyping and shipping fast. A 50,000+ model catalog behind one API request, no serving stack to build.
Image, video, and audio features. Models bill per output ($0.04 per image, $0.09 per second of video), a few lines of code to add a generative feature.
Publishing your own model publicly. Cog plus cog push gives your model a GUI, an HTTP API, and a shareable page.
Open-source portability. Cog is open source, so the same container runs on Replicate or on your own infrastructure with no lock-in.
Bursty or low-volume traffic. Scale-to-zero plus per-GPU-second billing means an idle model costs nothing; you pay only while a prediction runs.

Frequently Asked Questions

Is Baseten or Replicate cheaper?

It depends on utilization. Replicate's H100 is $5.49/hr ($0.001525/s) versus Baseten's $6.50/hr ($0.10833/min), so the H100 sticker is lower on Replicate. On A100 80GB it flips: Baseten is $4.00/hr versus Replicate's $5.04/hr. Replicate bills per GPU-second only while a prediction runs, so a scaled-to-zero model costs nothing while idle, which wins for bursty traffic. Baseten bills per minute of active compute with idle excluded. For steady, high-throughput LLM serving, Baseten's token-priced Model APIs (GPT-OSS 120B at $0.10/$0.50 per M in/out) beat renting either GPU outright.

Does Replicate have cold starts and how long are they?

Yes. Replicate scales to zero by default and turns a model off after it sits idle; cold boots can take several minutes on large models. Only running prediction time is charged on public models, so cold boots do not add cost, but the user waiting on the first request still feels the wait. Replicate's Deployments set a minimum instance count that stays warm to remove the cold start, at the cost of an idle-GPU bill. Baseten also scales to zero (min_replica=0 default) but its docs warn the cold start can take minutes for large models and recommend min_replica >= 2 for production.

What is the difference between Baseten and Replicate?

Replicate, now owned by Cloudflare, is a model catalog: 50,000+ models behind one API request, one-command deployment via the open-source Cog tool, and per-GPU-second billing. It is built for prototyping and shipping generative features fast. Baseten is a production inference platform: engine selection across TensorRT-LLM, SGLang, and vLLM, token-priced Model APIs, per-minute dedicated GPUs, and SOC 2 Type II plus HIPAA. It is built for steady inference at scale.

Can I deploy a custom model on both?

Yes. Replicate uses Cog to package your model into a container and generate an API server, published with cog push, and the same container runs off-platform. Baseten uses Truss and lets you pick the serving engine plus Chains for multi-model pipelines. Baseten gives more control over the serving engine and autoscaling; Replicate is faster to publish and share publicly.

Which inference provider has the cheapest H100?

Among providers that publish a rate, DeepInfra is cheapest at $1.79/hr for a dedicated H100 80GB, then Modal at about $3.95/hr. Replicate and Together's GPU cluster are both $5.49/hr, Baseten is $6.50/hr dedicated, and Fireworks is $7.00/hr. For serverless LLM token serving instead of a raw GPU, compare per-token rates: GPT-OSS 120B runs $0.10/$0.50 per M in/out on Baseten versus $0.15/$0.60 on Fireworks, Together, and Groq.

Does Cloudflare owning Replicate change anything?

Cloudflare announced the acquisition of Replicate in November 2025 and is folding its model library into Workers AI, pushing the combined catalog past 50,000 models. For most users the product is unchanged: Cog packaging, the public catalog, and per-GPU-second billing all still apply. The direction is tighter Workers integration and edge-adjacent inference, not the dedicated, compliance-heavy production posture Baseten sells. If you want tuned serving engines and HIPAA, Baseten fits; if you want breadth of models and Cloudflare-native deployment, Replicate does.

Related Comparisons

Baseten vs DeepInfra (cheapest H100 at $1.79/hr)
Replicate vs DeepInfra
Baseten vs Modal (per-second billing)
Replicate vs Modal
Baseten vs Groq
Fireworks vs Replicate
Morph pricing

Serving code edits, not whole models?

Baseten and Replicate host general models. If applying model-generated code edits is the bottleneck, Morph Fast Apply runs at ~10,500 tok/s on a managed endpoint, and WarpGrep code search is $0 up to 100k requests.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Baseten vs Replicate (2026): H100 at $6.50 vs $5.49/hr, Per-Minute vs Per-GPU-Second Billing

Who Wins Per Workload

Quick Comparison

GPU Pricing: Per-Minute vs Per-GPU-Second

Per-Token Model API Pricing

Cost on a Real Workload

Cold Starts, Scale-to-Zero & Autoscaling

Rate Limits

The Serving Stack

Deploying Your Own Model

Reliability & Compliance

How They Compare to Other Inference Providers

When to Use Baseten

When to Use Replicate

Frequently Asked Questions

Is Baseten or Replicate cheaper?

Does Replicate have cold starts and how long are they?

What is the difference between Baseten and Replicate?

Can I deploy a custom model on both?

Which inference provider has the cheapest H100?

Does Cloudflare owning Replicate change anything?

Related Comparisons

Serving code edits, not whole models?