Replicate vs Modal (2026): H100 $3.95/hr vs $5.49/hr, Cold Starts ~5s vs 60s+

You are choosing where to run a model and the decision comes down to two questions: do you want to deploy an existing model in minutes, or run your own code on GPUs as cheaply as possible. Replicate answers the first, Modal answers the second.

Replicate is a model catalog: package a model with Cog, push it, and call it through one API, with thousands of community models already deployed. Modal is a Python-native serverless cloud: write a function, decorate it with a GPU, and Modal containerizes and autoscales it.

The split shows up in two numbers. Modal's H100 runs $0.001097/sec ($3.95/hr); Replicate's H100 runs $0.001525/sec ($5.49/hr), about 39% more. And Modal's containers boot in about 1 second, with GPU memory snapshots cutting a vLLM cold start from 45s to roughly 5s, while Replicate private deployments commonly see 60s+ cold boots.

Pick Replicate when you want a model running as an endpoint today, especially image and video, with a ready GUI to test and share. Pick Modal when you want cheaper, more flexible compute and are willing to write the serving code. Prices below are list rates as of June 9, 2026; confirm on each provider's pricing page.

TL;DR

Pick Replicate if you want to deploy an existing model in minutes. A huge Cog catalog, Official Models with output-based pricing, one-line LoRA fine-tunes, and a clean API make it the fastest path from idea to a working endpoint.
Pick Modal if you run your own code and care about cost and cold starts. Lower per-second GPU rates, GPU memory snapshots for ~10x faster cold starts, sandboxes, and a Python-native SDK make it the better production serverless platform.

Who Wins Per Workload

The split is package-and-run a model versus general serverless GPU compute. Here is who wins for the decisions developers actually make.

Who Wins Per Workload

Workload / decision	Replicate	Modal
Deploy an off-the-shelf model now	Replicate: one API call, no code	Modal: write serving code
Image / audio / video generation	Replicate: huge catalog + GUI	Modal: DIY
Run your own arbitrary code	Replicate: must fit Cog	Modal: any Python function
Cheapest per GPU-hour	Replicate: H100 $5.49/hr	Modal: H100 $3.95/hr
Fastest cold start (large model)	Replicate: 60s+	Modal: ~5s (snapshots)
Spiky, low-volume calls	Replicate: per-token Official Models	Modal: per-second GPU
Sandboxes for agent code	Replicate: none	Modal: isolated sandboxes
Share a model via web GUI	Replicate: built-in	Modal: build it yourself
Batch / parallel fan-out jobs	Replicate: prediction queues	Modal: per-second fan-out

Quick Comparison

Replicate wins on catalog breadth and time-to-first-call. Modal wins on price, cold starts, and control. Morph is in the third column for reference: it is a managed inference layer tuned for code generation and for running DeepSeek at full fidelity, not a general GPU host.

Replicate vs Modal vs Morph at a Glance

Spec	Replicate	Modal	Morph
What it is	Model catalog + API	Python serverless GPU	Managed model inference for codegen + DeepSeek
Primary unit	Cog model	Python function	Per-token model API
Billing model	Per GPU-second / per-token	Per GPU-second	Per token
H100 ($/hr)	$5.49	$3.95	N/A, token-priced
A100 80GB ($/hr)	$5.04	$2.50	N/A, token-priced
Cold start (large model)	60s+	~5s (snapshots)	Always warm
Bring your own arbitrary model	Yes (Cog)	Yes (any Python)	No, fixed model menu
DeepSeek activations	Depends on model	Depends on your code	16-bit (bf16), no fp8/int8
Best for	Deploying existing models	Custom code at scale	Codegen agents + DeepSeek fidelity

Pricing: Modal Undercuts Replicate on Raw GPU Rates

On per-second GPU rates, Modal is the cheaper host, often by close to half. Both bill per second and scale to zero, so you pay for active compute only.

GPU Pricing (per hour, list rates June 9, 2026)

GPU	Replicate	Modal
Nvidia T4	$0.81	$0.59
Nvidia L4	N/A	$0.80
Nvidia A10 / A10G	N/A	$1.10
Nvidia L40S	$3.51	$1.95
Nvidia A100 80GB	$5.04	$2.50
Nvidia H100	$5.49	$3.95
Nvidia H200	N/A	$4.54
Nvidia B200	N/A	$6.25
Free credits	No	$30 / month

Modal converts directly from its per-second rates: $0.001097/sec for H100, $0.001261/sec for H200, $0.001736/sec for B200, $0.000694/sec for A100 80GB, $0.000164/sec for T4. CPU is $0.0000131 per physical core per second and memory is $0.00000222 per GiB per second, with no charge for idle resources. The Starter tier is free with $30/month in credits (100 containers, 10 GPU concurrency); Team is $250/month plus compute with $100/month in credits (1000 containers, 50 GPU concurrency).

Replicate has a second pricing track. Some language models bill per token instead of per second: DeepSeek R1 lists at $3.75 per million input tokens and $0.01 per thousand output tokens (so $10 per million output), and Claude 3.7 Sonnet at $3.00 per million input and $0.015 per thousand output (so $15 per million output). Image and video models bill per output, around $0.04 per output image or $0.09 per second of output video. For spiky, low-volume calls against a popular model, that output-based pricing can beat renting a whole GPU on either host.

The hidden cost is idle time

On Replicate private models you pay for setup, idle, and active time (fast-booting fine-tunes are the exception, billing active time only). Modal's pitch is the inverse: pay by the CPU cycle, never for idle resources. The cheaper sticker rate plus faster wakes is why bursty production traffic tends to land cheaper on Modal.

Cost on a Real Workload

Both hosts bill per GPU-second, so the question is not price-per-token but how many GPU-hours your traffic actually burns. Take one workload: a 13B model that fits on a single A100 80GB, serving a feature that draws steady traffic 8 hours a day on weekdays and scales to zero otherwise (about 176 active GPU-hours per month).

Cost on a real workload (computed from list prices, June 2026)

176 active GPU-hours/mo on one A100 80GB: Modal at $2.50/hr = ~$440/mo; Replicate at $5.04/hr = ~$887/mo. Modal is ~$447/mo cheaper for the same active compute.
Always-on, 24/7 (720 GPU-hours/mo): Modal = 720 x $2.50 = ~$1,800/mo; Replicate = 720 x $5.04 = ~$3,629/mo.
Break-even on Modal: an always-on A100 costs $1,800/mo, so scale-to-zero is cheaper than pinning one instance until your GPU is busy more than ~720 hours/mo, i.e. effectively ~100% utilization. Below that, per-second billing wins; at full utilization the two converge.

At every utilization level Modal's lower per-hour rate carries through, so the dollar gap is roughly proportional to active hours. The case for Replicate here is not price; it is that you skip writing and maintaining the serving code, and you get a model URL and GUI on day one. Cheaper depends on volume only in absolute terms: Modal is the lower rate at any utilization, but the savings only matter once active hours are high enough to dwarf the engineering time you spend wiring up Modal yourself.

Running DeepSeek or Codegen Models

Neither Replicate nor Modal is the right answer when the workload is a DeepSeek endpoint or a coding agent and you care about output quality, not just GPU rental. On Replicate you rent or call a model whose serving config you do not control; on Modal you write that serving code yourself and own every quantization and kernel decision. For most teams that means accepting whatever fp8 activation path the catalog or your vLLM config defaults to.

Morph serves DeepSeek with 16-bit (bf16) activations, no fp8 or int8 quantization. Most serverless providers quantize activations to fp8 to cut cost, which degrades output quality; Morph keeps full 16-bit activations, so responses match the reference weights. When DeepSeek output fidelity matters, that is the difference between the reference model and a cheaper approximation of it.

For code generation specifically, Morph runs codegen-tuned speculative decoding plus custom low-level inference kernels built for code output. That makes it the fastest and highest-quality option for coding agents, not a general-purpose GPU menu. morph-dsv4flash (DeepSeek V4 Flash) is priced at $0.139 per 1M input tokens and $0.278 per 1M output tokens, billed per token with no GPU to keep warm. See the full menu on Morph Models and rates on pricing.

Cold Starts: Modal's Snapshots Change the Math

This is Modal's biggest technical edge. Cold starts decide whether scale-to-zero is usable in production or just a billing trap.

Modal's containers boot in about 1 second on its custom container stack, and functions scale to zero when idle (default max idle 60s, configurable from 2s to 20 minutes via scaledown_window). Memory snapshotting captures the state of a container's memory at user-controlled points and reuses it across boots, cutting the cold-start penalty further: a vLLM server that previously took 45s to boot restores in about 5s. min_containers keeps warm capacity and buffer_containers over-provisions during spikes.

Replicate's answer is operational, not architectural: a deployment can keep a minimum number of warm instances so requests never hit a cold boot, scaling from zero to hundreds based on traffic. That works, but a warm minimum deletes the scale-to-zero savings, because you now pay for a GPU 24/7. Left at zero, Replicate cold boots can take several minutes for large models. Replicate only charges running prediction time on catalog models, so a cold boot adds latency but not cost; private models bill for setup, idle, and active time, except fast-booting fine-tunes which bill active processing only.

~1s

Modal container boot

~5s

Modal vLLM cold start (snapshots)

Minutes

Replicate cold boot (large model)

For a model that wakes from zero on every burst, this gap is the difference between a usable serverless endpoint and one you have to keep warm by hand.

Developer Experience: Catalog vs Code

The two tools want you to do different things, and the SDKs reflect that.

Replicate is built around Cog, an open-source tool that packages a model and its dependencies into a container with a standard prediction interface. Push the container, get a private API endpoint, call it from any language. For thousands of community and Official Models you skip packaging entirely and call an existing model in one HTTP request. This is the fastest route from "I want to run Flux" to a working URL.

Modal is Python-first. You write ordinary Python, decorate a function with a GPU and an image, and Modal containerizes, deploys, and autoscales it. No Dockerfile to hand-write, no separate packaging step. The tradeoff: there is no pre-built catalog to click, so you are responsible for wiring up the serving code (vLLM, a model loader, your own logic). For teams that want arbitrary code on GPUs rather than a fixed model interface, Modal's DX is the draw.

Rule of thumb

If the thing you want to run already exists as a model, Replicate gets you there faster. If the thing you want to run is your own code, batch job, or custom pipeline, Modal gets you there cleaner.

Fine-Tuning & Deployments

Replicate leans into managed fine-tuning; Modal gives you the GPUs to do it yourself.

Replicate offers one-line LoRA fine-tuning for image models like FLUX.1 and SDXL: point its API at your images and it trains and hosts the result. Fine-tuned LoRAs run with no cold boot on top of the loaded base model, so a tuned variant is instantly callable. Deployments give you a private, dedicated endpoint with min and max instance controls, set the minimum to 1 for always-on, or 0 for scale-to-zero.

Modal does not ship a managed fine-tuning product. Instead you run your own training job as a Python function on its GPUs, full control over framework, dataset, and checkpoints, billed per second like any other workload. That is more setup but no ceiling on what you can train.

Sandboxes & Batch: Modal's Wider Surface

Modal does more than inference. It also runs batch jobs, training, and sandboxes, secure containers for running untrusted or agent-generated code.

Modal's sandboxes give AI agents disposable, isolated compute on demand, with buffer_containers to over-provision during spikes. Replicate stays focused on the model-prediction surface and does not offer a general sandbox primitive.

For batch inference and scheduled jobs, Modal's function model maps cleanly: fan out across many containers, pay per second, scale to zero when done. Replicate's batch story runs through prediction queues against deployed models rather than arbitrary parallel compute.

Compliance & Enterprise

Both have an enterprise path, with different ownership stories behind them.

Enterprise & Compliance

Feature	Replicate	Modal
SOC 2	Confirm directly	From Starter tier
HIPAA	Confirm directly	Enterprise plan
Audit logs	Confirm directly	Enterprise plan
Cloud marketplace	Cloudflare ecosystem	AWS & GCP Marketplace
Owner	Cloudflare (Dec 2025)	Independent

Cloudflare acquired Replicate in late 2025, with the deal closing December 1, 2025. That pulls Replicate into Cloudflare's network and compliance footprint over time, but enterprise terms still vary, so confirm current attestation directly. Modal lists SOC 2 compliance available from the Starter tier, with HIPAA compatibility and audit logs on Enterprise, plus AWS and GCP Marketplace billing so committed cloud spend can apply toward Modal usage.

When to Use Replicate

Deploying an existing model fast. Thousands of community and Official Models are a single API call away. No packaging, no serving code.
Image and video generation. Output-based pricing on FLUX, SDXL, and video models, plus one-line LoRA fine-tuning, make it a strong fit for generative media pipelines.
Spiky, low-volume traffic. Per-token and per-output pricing on Official Models means you do not rent a whole GPU for occasional calls.
Cog-packaged custom models. If your model is already a Cog container, pushing it to a private dedicated endpoint is straightforward.
Cloudflare-adjacent stacks. Post-acquisition, Replicate fits teams already building on Cloudflare's network.

Frequently Asked Questions

Is Modal cheaper than Replicate?

On raw GPU rates, yes. Modal bills an H100 at $0.001097/sec ($3.95/hr) versus Replicate's $0.001525/sec ($5.49/hr), and an A100 80GB at $0.000694/sec versus Replicate's $0.001400/sec, roughly half. Both bill per second and scale to zero. Replicate also bills some language models per token (DeepSeek R1 at $3.75/M input, $10/M output) and image/video per output, which can be cheaper for spiky low-volume use. Prices are list rates as of June 9, 2026.

Which has faster cold starts, Replicate or Modal?

Modal. Its containers boot in about 1 second, and memory snapshotting cuts a vLLM cold start from about 45s to 5s. Replicate cold boots can take several minutes for large models; you avoid that only by keeping a warm minimum of instances, which removes the scale-to-zero savings.

What is the difference between Replicate and Modal?

Replicate is a model catalog: you package a model with Cog and call it through one API, alongside thousands of pre-built community and Official Models. Modal is a Python-native serverless cloud: you write a Python function, decorate it with a GPU, and Modal containerizes and autoscales it. Replicate optimizes for clicking deploy on an existing model; Modal optimizes for running your own arbitrary code on GPUs.

Does Modal or Replicate support SOC 2 and HIPAA?

Modal offers SOC 2 compliance from its Starter tier, with HIPAA compatibility and audit logs on Enterprise. Replicate was acquired by Cloudflare in late 2025, which brings Cloudflare's compliance posture, though enterprise terms vary. For regulated workloads, confirm current attestation directly with each vendor.

Can I deploy an image or video model faster on Replicate or Modal?

Replicate. Thousands of community and Official Models, including FLUX, SDXL, and video models, are a single API call away with no packaging, plus a ready web GUI to test and share. Replicate also offers one-line LoRA fine-tuning for FLUX.1 and SDXL. On Modal you write the serving code yourself (load the model, wire up a vLLM or diffusers function, expose an endpoint), which is more flexible but slower to a first working URL for an off-the-shelf model.

Related Comparisons

Replicate for Instant Model Endpoints, Modal for Cheaper Custom Compute

Pick Replicate to run a packaged model with a GUI today; pick Modal for lower GPU rates and faster cold starts on your own code. If applying model-generated code edits is the bottleneck, that is a different tool.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Replicate vs Modal (2026): Modal's H100 Is $3.95/hr vs Replicate's $5.49/hr, and Cold Starts Are ~5s vs 60s+

Who Wins Per Workload

Quick Comparison

Pricing: Modal Undercuts Replicate on Raw GPU Rates

Cost on a Real Workload

Running DeepSeek or Codegen Models

Cold Starts: Modal's Snapshots Change the Math

Developer Experience: Catalog vs Code

Fine-Tuning & Deployments

Sandboxes & Batch: Modal's Wider Surface

Compliance & Enterprise

When to Use Replicate

Frequently Asked Questions

Is Modal cheaper than Replicate?

Which has faster cold starts, Replicate or Modal?

What is the difference between Replicate and Modal?

Does Modal or Replicate support SOC 2 and HIPAA?

Can I deploy an image or video model faster on Replicate or Modal?

Related Comparisons

Replicate for Instant Model Endpoints, Modal for Cheaper Custom Compute

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Replicate vs Modal (2026): Modal's H100 Is $3.95/hr vs Replicate's $5.49/hr, and Cold Starts Are ~5s vs 60s+

Who Wins Per Workload

Quick Comparison

Pricing: Modal Undercuts Replicate on Raw GPU Rates

Cost on a Real Workload

Running DeepSeek or Codegen Models

Cold Starts: Modal's Snapshots Change the Math

Developer Experience: Catalog vs Code

Fine-Tuning & Deployments

Sandboxes & Batch: Modal's Wider Surface

Compliance & Enterprise

When to Use Replicate

When to Use Modal

Frequently Asked Questions

Is Modal cheaper than Replicate?

Which has faster cold starts, Replicate or Modal?

What is the difference between Replicate and Modal?

Does Modal or Replicate support SOC 2 and HIPAA?

Can I deploy an image or video model faster on Replicate or Modal?

Related Comparisons

Replicate for Instant Model Endpoints, Modal for Cheaper Custom Compute