Replicate vs DeepInfra (2026): $5.49/hr GPU Seconds vs $0.10/$0.32 per 1M Tokens

Pick DeepInfra for cheap high-volume text inference behind one OpenAI-compatible API; pick Replicate for image, video, and arbitrary model deploys. Replicate, now Cloudflare-owned, runs any model packaged as a Cog container and bills GPU seconds, so it handles diffusion, video, audio, and custom checkpoints that no managed catalog will host, at the cost of cold starts. DeepInfra serves a curated set of open-weight text LLMs per token on its own US data centers down to B200 hardware.

That billing split decides almost everything. A custom diffusion model or a one-off research checkpoint belongs on Replicate, where you pay $5.49/hr for an H100 and scale to zero when idle. A high-volume Llama or DeepSeek chat workload belongs on DeepInfra, where Llama 3.3 70B Turbo runs at $0.10 input and $0.32 output per million tokens with no per-second hardware math.

Prices are as of June 2026 and change often, so check each provider's pricing page before you commit.

TL;DR

Pick Replicate if you need to run a custom, image, video, or audio model packaged as a Cog container, pay per GPU second, and scale to zero between requests. It is the most flexible GPU host of the two.
Pick DeepInfra if you run a popular open-weight text LLM at volume and want per-token pricing, an OpenAI-compatible endpoint, and SOC 2 / HIPAA compliance. Llama 3.3 70B Turbo is $0.10/$0.32 per 1M tokens.

Who Wins Per Workload

The right choice is decided by the shape of the workload, not by an overall winner. This table maps the real decisions to the provider that wins each one.

Who Wins Per Workload

Workload / decision	Replicate	DeepInfra
Cheap high-volume text LLM	Loses on per-second math	DeepInfra ($0.10/$0.32 per 1M)
Image / video / audio generation	Replicate (Cog catalog)	Limited model set
Run an arbitrary custom model	Replicate (any Cog container)	Catalog + LoRA only
Bursty / low-volume traffic	Replicate (scale to zero)	Warm pools, no $0 idle
Fastest first call (no cold start)	Cold boot 30-120s	DeepInfra (warm pools)
Drop-in OpenAI-compatible API	Per-model schema	DeepInfra (uniform)
Embeddings & reranking	Via catalog models	DeepInfra (Qwen3 series)
Strictest compliance	Standard	DeepInfra (SOC 2, ISO, HIPAA)
Lowest dedicated GPU cost	$5.49/hr H100 instance	DeepInfra ($1.79/hr GPU-hr)

Quick Comparison

Replicate is a flexible per-second GPU host; DeepInfra is a per-token API for popular open models. Morph is the code-specific layer for the agent inner loop.

Replicate vs DeepInfra vs Morph at a Glance

Spec	Replicate	DeepInfra	Morph
Primary focus	Any model as a Cog container	Open-weight model token API	Coding-agent inner loop
Billing model	Per GPU second	Per token	Per token / per request
H100 dedicated	$5.49/hr	$1.79/hr (GPU-hr)	Managed fleet
Llama 3.3 70B (per 1M)	Per-second GPU time	$0.10 / $0.32	N/A
Code-specific apply	No	No	Yes (/v1/code/apply)
Semantic code search	No	No	Yes (WarpGrep)
Apply throughput	General serving	General serving	~10,500 tok/s
First-pass apply accuracy	N/A	N/A	98%
Image / video / audio	Yes (catalog strength)	Yes (some models)	No
Scale to zero	Yes (cold boot on wake)	Warm serverless pools	Always-on fleet
OpenAI-compatible API	Partial (per model)	Yes (drop-in)	Yes
Compliance	Standard	SOC 2, ISO 27001, HIPAA, GDPR	Standard
Best for	Custom & multimodal models	High-volume open LLM serving	Coding agents

Billing Model: Per Second vs Per Token

This is the core difference. Replicate bills the GPU clock; DeepInfra bills the tokens.

Replicate charges for hardware by the second. A public model is billed for the time it takes to run, and a private deployment is billed for setup time, idle time, and active processing time. You pick the GPU and pay its per-second rate whether the model is generating tokens or waiting. Some official models on Replicate use output-based per-token pricing instead, which removes cold-start charges, but the default for custom work is per-second hardware.

DeepInfra charges per input and output token on its catalog models, with no setup fee and no per-second hardware accounting. You send a request, you pay for the tokens, and the warm shared pool absorbs idle time. For dedicated capacity, DeepInfra also rents GPUs by the hour ($1.79/hr for an H100), but the default path is per-token serverless.

The Utilization Break-Even

Per-second hardware only beats per-token pricing when the GPU stays near 100% busy. A model that sits idle between requests still bills full GPU seconds on Replicate's per-second track. DeepInfra's per-token model charges nothing for idle time, which is why bursty or low-volume chat traffic is almost always cheaper there.

Pricing: The Numbers

For standard open LLMs, DeepInfra's per-token rates are hard to beat. For raw GPU time, the two are priced for different purposes.

GPU Hardware Pricing (per hour, June 2026)

GPU	Replicate	DeepInfra (dedicated)
T4	$0.81/hr	N/A
L40S	$3.51/hr	N/A
A100 80GB	$5.04/hr	$0.89/hr (GPU-hr)
H100	$5.49/hr	$1.79/hr (GPU-hr)
H200 141GB	N/A	$2.19/hr (GPU-hr)
B200 180GB	N/A	$2.79/hr (GPU-hr)
B300 270GB	N/A	$4.20/hr (GPU-hr)

Replicate's rates are for a full dedicated instance you control end to end. DeepInfra's dedicated rate is a per-GPU-hour figure inside its managed fleet, which is why the headline numbers look much lower. They are not measuring the same thing, but for raw access to a chip, DeepInfra is cheaper per GPU.

Per-Token Pricing on DeepInfra (per 1M tokens, June 2026)

Model	Input	Output
Llama 3.3 70B Turbo	$0.10	$0.32
Meta-Llama-3.1-70B-Instruct	$0.40	$0.40
Meta-Llama-3.1-8B-Instruct	$0.02	$0.05
Mistral-Nemo-Instruct	$0.02	$0.04
DeepSeek-V4-Flash	$0.10	$0.20
Embeddings (per 1M input)	$0.005 to $0.01

Replicate does run some LLMs on per-token pricing too. Its hosted Claude 3.7 Sonnet is $3.00 per million input tokens, and DeepSeek R1 lists at $3.75 per million input tokens. But Replicate is not where you go to serve a cheap open Llama at scale; that is DeepInfra's home turf.

Cost on a Real Workload

Here is the break-even worked out on the list prices above, so you can redo the arithmetic.

Cost on a real workload (computed from list prices, June 2026)

Serving Llama 3.3 70B at 50M output tokens/day on DeepInfra serverless: 50 x $0.32 per 1M output = $16/day for output, plus input. Even doubling that for input, call it roughly $30/day, or about $900/mo, with $0 paid for idle time.

The same on a dedicated DeepInfra H100 at $1.79/hr: $1.79 x 24 = ~$43/day = ~$1,290/mo per GPU, billed whether or not it is busy. The same on a Replicate dedicated H100 at $5.49/hr: $5.49 x 24 = ~$132/day = ~$3,950/mo per GPU.

So for steady 70B text traffic, DeepInfra serverless per-token (~$900/mo) beats a dedicated H100 on either host until utilization is high enough that the token bill would exceed the GPU rental. A dedicated DeepInfra H100 (~$1,290/mo) only wins once your token spend would clear that figure, roughly above ~50M output tokens/day sustained at near-full utilization. Replicate's full-instance H100 is priced for control and custom containers, not for undercutting DeepInfra on cheap text.

Cold Starts: The Tax on Scale-to-Zero

Replicate trades latency for the right to pay $0 while idle; DeepInfra keeps popular models warm so most requests skip the boot entirely.

On Replicate, a deployment scales to zero after a few idle minutes, and the next request triggers a cold boot. Custom Cog models can take 30 to 120 seconds to come back from fully cold. Replicate has attacked this for fine-tunes specifically: fine-tuned models on supported bases now boot in under one second, billed only for active time. But a generic custom container still pays the full cold-boot tax.

DeepInfra runs popular catalog models in warm shared serverless pools, so a Llama or DeepSeek request usually hits an already-loaded model. There is no scale-to-zero on those, which is the point: you trade the possibility of $0 idle for consistently low first-token latency.

30-120s

Replicate cold custom model boot

<1s

Replicate fast fine-tune boot

Warm

DeepInfra catalog pools

The decision is workload-shaped. Spiky traffic with long idle gaps favors Replicate's scale-to-zero on the per-second track, even with cold boots. Steady, latency-sensitive traffic on a popular model favors DeepInfra's warm pools.

Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).

Model Catalog: Open-Ended vs Curated

Replicate runs anything you can containerize; DeepInfra runs a curated set of open-weight models very well.

Replicate's catalog is open-ended by design. Because every model is a Cog container, the platform hosts the widest variety of open-source models including image, video, audio, and custom checkpoints, plus whatever you publish yourself. If you need FLUX, a Stable Diffusion variant, a speech model, and an LLM behind one account, Replicate covers all of them.

DeepInfra lists 190+ open-source models across text generation, vision and OCR, embeddings, rerankers, image, video, and speech, and it also resells Anthropic models (Claude Haiku 4.5 at $1.00/$5.00, Sonnet 4.6 at $3.00/$15.00, Opus 4.8 at $5.00/$25.00 per 1M in/out). It is curated rather than open-ended, but each open model runs on DeepInfra's own hardware, and the catalog includes Qwen3 embeddings and rerankers in 0.6B, 4B, and 8B sizes.

Replicate's Real Strength

Replicate is strongest when the model is not a plain text LLM. Image and video generation, audio, and bespoke research models all package cleanly as Cog containers and bill per GPU second. DeepInfra has some image and video models, but its center of gravity is high-throughput text and embeddings.

Fine-Tuning and Custom Models

Both support fine-tuning, but the deployment shape differs.

Replicate lets you fine-tune language and image models with your own data and serve the result as a private deployment with a dedicated endpoint. Fast-booting fine-tunes are billed only for active time, which sidesteps the idle-GPU cost that normally comes with dedicated hardware.

DeepInfra serves LoRA adapters on top of supported base models through the same OpenAI-compatible API, so you point your adapter at a hosted base and get standard per-token serving. For teams that need isolation, DeepInfra also offers private dedicated deployments of custom Hugging Face models and LoRA adapters on A100, H100, H200, B200, or B300 GPUs with autoscaling.

Fine-Tuning and Deployment

Capability	Replicate	DeepInfra
Fine-tune LLMs	Yes	LoRA adapters
Fine-tune image models	Yes	LoRA image adapters
Fast fine-tune boot	Under 1s (supported bases)	Adapter on warm base
Private dedicated deploy	Yes (Cog)	Yes (A100-B300)
Idle billing on dedicated	Setup + idle + active	Per GPU-hour

Running DeepSeek: Quantization Decides Output Quality

DeepInfra serves DeepSeek V4-Flash at $0.10 input and $0.20 output per 1M tokens, and DeepSeek V4-Pro at $1.30 input and $2.60 output. Those are among the cheapest DeepSeek rates published, and for high-volume traffic that is the point. But cheap serverless DeepSeek almost always means fp8 activations.

Most serverless providers quantize activations to fp8 to fit more requests on each GPU and cut cost. fp8 lowers the precision of the math the model runs at inference time, so outputs drift from the reference weights. For chat that drift is often invisible; for code generation and structured output it shows up as worse edits and more retries.

Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. That makes Morph the best place to run DeepSeek when output fidelity matters. morph-dsv4flash (DeepSeek V4 Flash) is $0.139 per 1M input tokens and $0.278 per 1M output tokens, close to DeepInfra's fp8 rate but at full 16-bit precision.

For coding agents specifically, Morph also runs codegen-tuned speculative decoding plus custom low-level inference kernels built for code generation. That is what makes it the fastest and highest-quality option for the agent inner loop, not a general-purpose menu. See Morph Open Source Models and pricing.

Running DeepSeek V4 Flash: Where to Serve It

Provider	Input / Output (per 1M)	Activations	Best for
Morph (morph-dsv4flash)	$0.139 / $0.278	16-bit (bf16)	Codegen, full-fidelity output
DeepInfra	$0.10 / $0.20	fp8 (typical)	Cheapest high-volume text
Replicate	Per GPU second	Your container	Custom DeepSeek deploys

Features and Compliance

DeepInfra leads on the API primitives a production text app needs; Replicate leads on packaging flexibility.

DeepInfra exposes a drop-in OpenAI-compatible endpoint, so switching providers is usually a base-URL and API-key change. It supports streaming, function calling, JSON mode and structured output, webhooks for async callbacks, embeddings, and reranking. The default rate limit is 200 concurrent requests per account, with postpaid billing and no free tier. On compliance it holds SOC 2, ISO 27001, GDPR, and HIPAA, and it runs its own US infrastructure including NVIDIA Blackwell B200 systems. On data handling, DeepInfra deletes prompts and completions from disk and memory after a short retention period and logs only metadata, except for Google models where Google logs prompts for abuse detection.

Replicate's API surface is per model rather than a single uniform chat schema, since each Cog container defines its own inputs and outputs. That flexibility is the trade: you can run literally any model, but you do not get one consistent OpenAI-compatible contract across the whole catalog the way DeepInfra gives you for text.

Feature Matrix

Feature	Replicate	DeepInfra
OpenAI-compatible chat	Per model	Yes (uniform)
JSON mode / structured output	Per model	Yes
Function calling	Per model	Yes
Embeddings & rerank	Via catalog models	Yes (Qwen3 series)
Webhooks / async	Yes	Yes
Run any custom container	Yes (Cog)	Limited to catalog + LoRA
SOC 2 / HIPAA	Standard	Yes

When to Use Replicate

Custom or research models. Anything you can package as a Cog container runs on Replicate, including checkpoints no managed catalog will ever host. Per-second billing fits unpredictable, bespoke workloads.
Image, video, and audio generation. The catalog's real strength is non-text models. FLUX, Stable Diffusion variants, and speech models all run cleanly with per-GPU-second billing.
Spiky, low-volume traffic. Scale-to-zero means you pay $0 while idle. If your model sees a few requests an hour, accepting a cold boot to avoid paying for idle GPU time is the right trade.
Fast-booting fine-tunes. Fine-tunes on supported bases boot in under one second and bill only for active time, giving you private endpoints without the usual idle-GPU cost.
Full control of the container. You own the Cog image, the hardware choice, and the scaling policy on dedicated deployments.

When to Use DeepInfra

High-volume open LLM serving. Per-token pricing on Llama 3.3 70B Turbo ($0.10/$0.32 per 1M) and DeepSeek-V4-Flash ($0.10/$0.20) is built for steady chat and agent traffic where idle billing would hurt.
Drop-in OpenAI compatibility. One uniform endpoint, base-URL swap, and your existing OpenAI SDK code works against 190+ open models with streaming, JSON mode, and function calling.
Embeddings and reranking. Qwen3 embedding and rerank models at $0.005 to $0.01 per million input tokens cover retrieval pipelines without a second vendor.
Compliance requirements. SOC 2, ISO 27001, HIPAA, and GDPR on US infrastructure (including B200) make it a fit for regulated workloads.
Lowest per-GPU dedicated cost. If you do need reserved hardware, $1.79/hr for an H100 GPU-hour undercuts Replicate's full-instance pricing.

Frequently Asked Questions

Is Replicate or DeepInfra cheaper?

For standard open-weight chat models, DeepInfra is cheaper because it bills per token. Llama 3.3 70B Turbo is $0.10 per million input and $0.32 per million output tokens. On Replicate you pay for GPU seconds (about $5.49/hr for an H100), which only beats per-token pricing when the GPU stays near 100% busy. For bursty or low-volume traffic, DeepInfra usually wins. For custom or non-text models, Replicate is often the only fit.

What is the difference between Replicate and DeepInfra?

Replicate runs any model you package as a Cog container and bills GPU time per second, so it handles image, video, audio, and custom checkpoints alongside LLMs. DeepInfra serves a curated catalog of 190+ open-weight models through a per-token OpenAI-compatible API on its own US infrastructure. Replicate is a flexible GPU host; DeepInfra is a managed token API for popular open models.

Does Replicate have cold starts?

Yes. Replicate deployments scale to zero when idle, so the next request triggers a cold boot. Custom Cog models can take 30 to 120 seconds from fully cold, while fine-tunes on supported bases boot in under one second. DeepInfra keeps popular models warm in shared pools, so its first-token latency on cataloged models is generally lower than a cold custom Replicate model.

Can I fine-tune models on both?

Yes. Replicate lets you fine-tune language and image models and serves the result as a private deployment, with fast-booting fine-tunes billed only for active time. DeepInfra serves LoRA adapters on supported base models through its OpenAI-compatible API and offers private dedicated deployments of custom Hugging Face models on A100 through B300 GPUs.

Can I run image or video models on DeepInfra, or do I need Replicate?

DeepInfra carries some image and video models, but its center of gravity is high-throughput text and embeddings. For arbitrary diffusion checkpoints, custom video models, audio, or any research model that is not in a managed catalog, Replicate is the better fit, because every model is a Cog container you can publish yourself. Replicate hosts the widest variety of non-text and custom models and bills per GPU second; DeepInfra is built for serving popular open-weight text LLMs cheaply per token.

What is the best provider to run DeepSeek models?

For cheapest high-volume DeepSeek text, DeepInfra serves DeepSeek V4-Flash at $0.10 input and $0.20 output per 1M tokens, typically on fp8 activations. For output fidelity, Morph serves DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference weights. morph-dsv4flash is $0.139 input and $0.278 output per 1M tokens. For code generation, Morph also runs codegen-tuned speculative decoding and custom inference kernels, which makes it the fastest, highest-quality option for coding agents. Replicate is the choice only if you want to deploy your own DeepSeek container per GPU second.

Related Comparisons

DeepInfra for Cheap Text, Replicate for Anything You Can Containerize

If applying model-generated code edits is your bottleneck, that is a separate problem. Morph Fast Apply lands edits at ~10,500 tok/sec with published benchmarks.

Try Morph Free

Fast Apply benchmarks

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Replicate vs DeepInfra (2026): $5.49/hr GPU Seconds vs $0.10/$0.32 per 1M Tokens

Who Wins Per Workload

Quick Comparison

Billing Model: Per Second vs Per Token

Pricing: The Numbers

Cost on a Real Workload

Cold Starts: The Tax on Scale-to-Zero

Model Catalog: Open-Ended vs Curated

Fine-Tuning and Custom Models

Running DeepSeek: Quantization Decides Output Quality

Features and Compliance

When to Use Replicate

When to Use DeepInfra

Frequently Asked Questions

Is Replicate or DeepInfra cheaper?

What is the difference between Replicate and DeepInfra?

Does Replicate have cold starts?

Can I fine-tune models on both?

Can I run image or video models on DeepInfra, or do I need Replicate?

What is the best provider to run DeepSeek models?

Related Comparisons

DeepInfra for Cheap Text, Replicate for Anything You Can Containerize