Pick DeepInfra for cheap high-volume text inference behind one OpenAI-compatible API; pick Replicate for image, video, and arbitrary model deploys. Replicate, now Cloudflare-owned, runs any model packaged as a Cog container and bills GPU seconds, so it handles diffusion, video, audio, and custom checkpoints that no managed catalog will host, at the cost of cold starts. DeepInfra serves a curated set of open-weight text LLMs per token on its own US data centers down to B200 hardware.
That billing split decides almost everything. A custom diffusion model or a one-off research checkpoint belongs on Replicate, where you pay $5.49/hr for an H100 and scale to zero when idle. A high-volume Llama or DeepSeek chat workload belongs on DeepInfra, where Llama 3.3 70B Turbo runs at $0.10 input and $0.32 output per million tokens with no per-second hardware math.
Prices are as of early 2026 and change often, so check each provider's pricing page before you commit.
TL;DR
- Pick Replicate if you need to run a custom, image, video, or audio model packaged as a Cog container, pay per GPU second, and scale to zero between requests. It is the most flexible GPU host of the two.
- Pick DeepInfra if you run a popular open-weight text LLM at volume and want per-token pricing, an OpenAI-compatible endpoint, and SOC 2 / HIPAA compliance. Llama 3.3 70B Turbo is $0.10/$0.32 per 1M tokens.
Who Wins Per Workload
The right choice is decided by the shape of the workload, not by an overall winner. This table maps the real decisions to the provider that wins each one.
| Workload / decision | Replicate | DeepInfra |
|---|---|---|
| Cheap high-volume text LLM | Loses on per-second math | DeepInfra ($0.10/$0.32 per 1M) |
| Image / video / audio generation | Replicate (Cog catalog) | Limited model set |
| Run an arbitrary custom model | Replicate (any Cog container) | Catalog + LoRA only |
| Bursty / low-volume traffic | Replicate (scale to zero) | Warm pools, no $0 idle |
| Fastest first call (no cold start) | Cold boot 30-120s | DeepInfra (warm pools) |
| Drop-in OpenAI-compatible API | Per-model schema | DeepInfra (uniform) |
| Embeddings & reranking | Via catalog models | DeepInfra (Qwen3 series) |
| Strictest compliance | Standard | DeepInfra (SOC 2, ISO, HIPAA) |
| Lowest dedicated GPU cost | $5.49/hr H100 instance | DeepInfra ($1.79/hr GPU-hr) |
Quick Comparison
Replicate is a flexible per-second GPU host; DeepInfra is a per-token API for popular open models. Morph is the code-specific layer for the agent inner loop.
| Spec | Replicate | DeepInfra | Morph |
|---|---|---|---|
| Primary focus | Any model as a Cog container | Open-weight model token API | Coding-agent inner loop |
| Billing model | Per GPU second | Per token | Per token / per request |
| H100 dedicated | $5.49/hr | $1.79/hr (GPU-hr) | Managed fleet |
| Llama 3.3 70B (per 1M) | Per-second GPU time | $0.10 / $0.32 | N/A |
| Code-specific apply | No | No | Yes (/v1/code/apply) |
| Semantic code search | No | No | Yes (WarpGrep) |
| Apply throughput | General serving | General serving | ~10,500 tok/s |
| First-pass apply accuracy | N/A | N/A | 98% |
| Image / video / audio | Yes (catalog strength) | Yes (some models) | No |
| Scale to zero | Yes (cold boot on wake) | Warm serverless pools | Always-on fleet |
| OpenAI-compatible API | Partial (per model) | Yes (drop-in) | Yes |
| Compliance | Standard | SOC 2, ISO 27001, HIPAA, GDPR | Standard |
| Best for | Custom & multimodal models | High-volume open LLM serving | Coding agents |
Billing Model: Per Second vs Per Token
This is the core difference. Replicate bills the GPU clock; DeepInfra bills the tokens.
Replicate charges for hardware by the second. A public model is billed for the time it takes to run, and a private deployment is billed for setup time, idle time, and active processing time. You pick the GPU and pay its per-second rate whether the model is generating tokens or waiting. Some official models on Replicate use output-based per-token pricing instead, which removes cold-start charges, but the default for custom work is per-second hardware.
DeepInfra charges per input and output token on its catalog models, with no setup fee and no per-second hardware accounting. You send a request, you pay for the tokens, and the warm shared pool absorbs idle time. For dedicated capacity, DeepInfra also rents GPUs by the hour ($1.79/hr for an H100), but the default path is per-token serverless.
The Utilization Break-Even
Per-second hardware only beats per-token pricing when the GPU stays near 100% busy. A model that sits idle between requests still bills full GPU seconds on Replicate's per-second track. DeepInfra's per-token model charges nothing for idle time, which is why bursty or low-volume chat traffic is almost always cheaper there.
Pricing: The Numbers
For standard open LLMs, DeepInfra's per-token rates are hard to beat. For raw GPU time, the two are priced for different purposes.
| GPU | Replicate | DeepInfra (dedicated) |
|---|---|---|
| T4 | $0.81/hr | N/A |
| L40S | $3.51/hr | N/A |
| A100 80GB | $5.04/hr | $0.89/hr (GPU-hr) |
| H100 | $5.49/hr | $1.79/hr (GPU-hr) |
| H200 141GB | N/A | $2.19/hr (GPU-hr) |
| B200 180GB | N/A | $2.79/hr (GPU-hr) |
| B300 270GB | N/A | $4.20/hr (GPU-hr) |
Replicate's rates are for a full dedicated instance you control end to end. DeepInfra's dedicated rate is a per-GPU-hour figure inside its managed fleet, which is why the headline numbers look much lower. They are not measuring the same thing, but for raw access to a chip, DeepInfra is cheaper per GPU.
| Model | Input | Output |
|---|---|---|
| Llama 3.3 70B Turbo | $0.10 | $0.32 |
| Meta-Llama-3.1-70B-Instruct | $0.40 | $0.40 |
| Meta-Llama-3.1-8B-Instruct | $0.02 | $0.05 |
| Mistral-Nemo-Instruct | $0.02 | $0.04 |
| DeepSeek-V4-Flash | $0.10 | $0.20 |
| Embeddings (per 1M input) | $0.005 to $0.01 |
Replicate does run some LLMs on per-token pricing too. Its hosted Claude 3.7 Sonnet is $3.00 per million input tokens, and DeepSeek R1 lists at $3.75 per million input tokens. But Replicate is not where you go to serve a cheap open Llama at scale; that is DeepInfra's home turf.
Cost on a Real Workload
Here is the break-even worked out on the list prices above, so you can redo the arithmetic.
Cost on a real workload (computed from list prices, June 2026)
Serving Llama 3.3 70B at 50M output tokens/day on DeepInfra serverless: 50 x $0.32 per 1M output = $16/day for output, plus input. Even doubling that for input, call it roughly $30/day, or about $900/mo, with $0 paid for idle time.
The same on a dedicated DeepInfra H100 at $1.79/hr: $1.79 x 24 = ~$43/day = ~$1,290/mo per GPU, billed whether or not it is busy. The same on a Replicate dedicated H100 at $5.49/hr: $5.49 x 24 = ~$132/day = ~$3,950/mo per GPU.
So for steady 70B text traffic, DeepInfra serverless per-token (~$900/mo) beats a dedicated H100 on either host until utilization is high enough that the token bill would exceed the GPU rental. A dedicated DeepInfra H100 (~$1,290/mo) only wins once your token spend would clear that figure, roughly above ~50M output tokens/day sustained at near-full utilization. Replicate's full-instance H100 is priced for control and custom containers, not for undercutting DeepInfra on cheap text.
Cold Starts: The Tax on Scale-to-Zero
Replicate trades latency for the right to pay $0 while idle; DeepInfra keeps popular models warm so most requests skip the boot entirely.
On Replicate, a deployment scales to zero after a few idle minutes, and the next request triggers a cold boot. Custom Cog models can take 30 to 120 seconds to come back from fully cold. Replicate has attacked this for fine-tunes specifically: fine-tuned models on supported bases now boot in under one second, billed only for active time. But a generic custom container still pays the full cold-boot tax.
DeepInfra runs popular catalog models in warm shared serverless pools, so a Llama or DeepSeek request usually hits an already-loaded model. There is no scale-to-zero on those, which is the point: you trade the possibility of $0 idle for consistently low first-token latency.
The decision is workload-shaped. Spiky traffic with long idle gaps favors Replicate's scale-to-zero on the per-second track, even with cold boots. Steady, latency-sensitive traffic on a popular model favors DeepInfra's warm pools.
Neither is built for the coding-agent apply loop; if applying model-generated code edits is the bottleneck, that is a different tool (Morph Fast Apply, ~10,500 tok/s, with published benchmarks).
Model Catalog: Open-Ended vs Curated
Replicate runs anything you can containerize; DeepInfra runs a curated set of open-weight models very well.
Replicate's catalog is open-ended by design. Because every model is a Cog container, the platform hosts the widest variety of open-source models including image, video, audio, and custom checkpoints, plus whatever you publish yourself. If you need FLUX, a Stable Diffusion variant, a speech model, and an LLM behind one account, Replicate covers all of them.
DeepInfra tracks 84 models on Artificial Analysis and lists 190+ open-source models across text generation, vision and OCR, embeddings, rerankers, image, video, and speech. It is curated rather than open-ended, but each model is tuned for throughput on DeepInfra's own hardware. Output speed on small models reaches roughly 316 tok/s (Qwen3.5 2B FP8), and the catalog includes Qwen3 embeddings and rerankers in 0.6B, 4B, and 8B sizes.
Replicate's Real Strength
Replicate is strongest when the model is not a plain text LLM. Image and video generation, audio, and bespoke research models all package cleanly as Cog containers and bill per GPU second. DeepInfra has some image and video models, but its center of gravity is high-throughput text and embeddings.
Fine-Tuning and Custom Models
Both support fine-tuning, but the deployment shape differs.
Replicate lets you fine-tune language and image models with your own data and serve the result as a private deployment with a dedicated endpoint. Fast-booting fine-tunes are billed only for active time, which sidesteps the idle-GPU cost that normally comes with dedicated hardware.
DeepInfra serves LoRA adapters on top of supported base models through the same OpenAI-compatible API, so you point your adapter at a hosted base and get standard per-token serving. For teams that need isolation, DeepInfra also offers private dedicated deployments of custom Hugging Face models and LoRA adapters on A100, H100, H200, B200, or B300 GPUs with autoscaling.
| Capability | Replicate | DeepInfra |
|---|---|---|
| Fine-tune LLMs | Yes | LoRA adapters |
| Fine-tune image models | Yes | LoRA image adapters |
| Fast fine-tune boot | Under 1s (supported bases) | Adapter on warm base |
| Private dedicated deploy | Yes (Cog) | Yes (A100-B300) |
| Idle billing on dedicated | Setup + idle + active | Per GPU-hour |
Features and Compliance
DeepInfra leads on the API primitives a production text app needs; Replicate leads on packaging flexibility.
DeepInfra exposes a drop-in OpenAI-compatible endpoint, so switching providers is usually a base-URL and API-key change. It supports streaming, function calling, JSON mode and structured output, webhooks for async callbacks, embeddings, and reranking. On compliance it holds SOC 2, ISO 27001, GDPR, and HIPAA, and it runs its own US infrastructure including NVIDIA Blackwell B200 systems.
Replicate's API surface is per model rather than a single uniform chat schema, since each Cog container defines its own inputs and outputs. That flexibility is the trade: you can run literally any model, but you do not get one consistent OpenAI-compatible contract across the whole catalog the way DeepInfra gives you for text.
| Feature | Replicate | DeepInfra |
|---|---|---|
| OpenAI-compatible chat | Per model | Yes (uniform) |
| JSON mode / structured output | Per model | Yes |
| Function calling | Per model | Yes |
| Embeddings & rerank | Via catalog models | Yes (Qwen3 series) |
| Webhooks / async | Yes | Yes |
| Run any custom container | Yes (Cog) | Limited to catalog + LoRA |
| SOC 2 / HIPAA | Standard | Yes |
When to Use Replicate
- Custom or research models. Anything you can package as a Cog container runs on Replicate, including checkpoints no managed catalog will ever host. Per-second billing fits unpredictable, bespoke workloads.
- Image, video, and audio generation. The catalog's real strength is non-text models. FLUX, Stable Diffusion variants, and speech models all run cleanly with per-GPU-second billing.
- Spiky, low-volume traffic. Scale-to-zero means you pay $0 while idle. If your model sees a few requests an hour, accepting a cold boot to avoid paying for idle GPU time is the right trade.
- Fast-booting fine-tunes. Fine-tunes on supported bases boot in under one second and bill only for active time, giving you private endpoints without the usual idle-GPU cost.
- Full control of the container. You own the Cog image, the hardware choice, and the scaling policy on dedicated deployments.
When to Use DeepInfra
- High-volume open LLM serving. Per-token pricing on Llama 3.3 70B Turbo ($0.10/$0.32 per 1M) and DeepSeek-V4-Flash ($0.10/$0.20) is built for steady chat and agent traffic where idle billing would hurt.
- Drop-in OpenAI compatibility. One uniform endpoint, base-URL swap, and your existing OpenAI SDK code works against 190+ open models with streaming, JSON mode, and function calling.
- Embeddings and reranking. Qwen3 embedding and rerank models at $0.005 to $0.01 per million input tokens cover retrieval pipelines without a second vendor.
- Compliance requirements. SOC 2, ISO 27001, HIPAA, and GDPR on US infrastructure (including B200) make it a fit for regulated workloads.
- Lowest per-GPU dedicated cost. If you do need reserved hardware, $1.79/hr for an H100 GPU-hour undercuts Replicate's full-instance pricing.
Frequently Asked Questions
Is Replicate or DeepInfra cheaper?
For standard open-weight chat models, DeepInfra is cheaper because it bills per token. Llama 3.3 70B Turbo is $0.10 per million input and $0.32 per million output tokens. On Replicate you pay for GPU seconds (about $5.49/hr for an H100), which only beats per-token pricing when the GPU stays near 100% busy. For bursty or low-volume traffic, DeepInfra usually wins. For custom or non-text models, Replicate is often the only fit.
What is the difference between Replicate and DeepInfra?
Replicate runs any model you package as a Cog container and bills GPU time per second, so it handles image, video, audio, and custom checkpoints alongside LLMs. DeepInfra serves a curated catalog of 190+ open-weight models through a per-token OpenAI-compatible API on its own US infrastructure. Replicate is a flexible GPU host; DeepInfra is a managed token API for popular open models.
Does Replicate have cold starts?
Yes. Replicate deployments scale to zero when idle, so the next request triggers a cold boot. Custom Cog models can take 30 to 120 seconds from fully cold, while fine-tunes on supported bases boot in under one second. DeepInfra keeps popular models warm in shared pools, so its first-token latency on cataloged models is generally lower than a cold custom Replicate model.
Can I fine-tune models on both?
Yes. Replicate lets you fine-tune language and image models and serves the result as a private deployment, with fast-booting fine-tunes billed only for active time. DeepInfra serves LoRA adapters on supported base models through its OpenAI-compatible API and offers private dedicated deployments of custom Hugging Face models on A100 through B300 GPUs.
Can I run image or video models on DeepInfra, or do I need Replicate?
DeepInfra carries some image and video models, but its center of gravity is high-throughput text and embeddings. For arbitrary diffusion checkpoints, custom video models, audio, or any research model that is not in a managed catalog, Replicate is the better fit, because every model is a Cog container you can publish yourself. Replicate hosts the widest variety of non-text and custom models and bills per GPU second; DeepInfra is built for serving popular open-weight text LLMs cheaply per token.
Related Comparisons
DeepInfra for Cheap Text, Replicate for Anything You Can Containerize
If applying model-generated code edits is your bottleneck, that is a separate problem. Morph Fast Apply lands edits at ~10,500 tok/sec with published benchmarks.