Together AI vs Baseten (2026): Serverless Tokens vs Dedicated GPUs, Priced Side by Side

You want to serve an open model in production and the two names on the shortlist are Together AI and Baseten. They bill differently, so the cheaper one flips with your traffic shape. Together AI prices per token across a broad serverless catalog: DeepSeek V4 Pro at $2.10 in / $4.40 out per million, GLM-5.1 at $1.40 / $4.40, Llama 3.3 70B at $1.04 flat, with batch jobs at half rate. Baseten prices per GPU-hour on dedicated replicas: H100 at $6.50/hr, B200 at $9.98/hr, A100 at $4.00/hr, with scale-to-zero that drops idle cost to nothing but adds a cold start.

The split is serverless tokens versus dedicated GPUs. Together optimizes calling and fine-tuning many models with no provisioning and portable weights. Baseten optimizes running one model in production where you control hardware, scaling, placement, and data residency, including inside your own VPC.

One question decides it: a pay-per-token API across many models with exportable weights, or a production home for a specific model on GPUs you control? Together wins the first. Baseten wins the second.

$6.49/hr

Together H100 dedicated

$6.50/hr

Baseten H100 dedicated

Per minute

Baseten billing

~50%

Together batch discount

TL;DR

Pick Together AI if traffic is bursty, multi-model, or low volume. Serverless tokens with $0 idle: Llama 3.3 70B $1.04/1M, gpt-oss-120B $0.15/$0.60, DeepSeek V4 Pro $2.10/$4.40, batch jobs at half rate, downloadable fine-tuned weights, ATLAS adaptive speculative decoding. Rent raw GPU clusters from $5.49/hr.
Pick Baseten if one model carries steady high load or must run in your own infra. Dedicated H100 $6.50/hr (per-minute billing) with scale-to-zero, Chains for compound pipelines at 6x GPU usage, BEI embeddings at 2x throughput, and self-hosted or hybrid VPC deployment with HIPAA.
Running DeepSeek? Morph Open Source Models serve it in 16-bit (bf16) activations with no fp8 quantization, so output matches the reference weights, at $0.139/$0.278 per 1M on DeepSeek V4 Flash.

Who Wins Per Workload

The verdict changes with the job. This is the fast lookup.

Who Wins Per Workload

Workload / decision	Together AI	Baseten
Bursty / low volume	Together AI: serverless, $0 idle	Loses: pays for provisioned GPUs
Sustained high volume, one model	Loses: per-token adds up	Baseten: dedicated GPU amortizes
Call many models fast	Together AI: broad serverless catalog	Loses: curated token-priced set
Serve in your own VPC	Loses: managed cloud only	Baseten: self-hosted + hybrid
Fine-tune and export weights	Together AI: priced API, downloadable	Loses: serves, not exports
Rent a raw GPU cluster	Together AI: HGX H100/H200/B200 clusters	Loses: model-ops, not bare GPUs
Compound multi-model pipeline	Loses: no per-step graph	Baseten: Chains, 6x GPU usage
Embedding / reranker throughput	Standard serving	Baseten: BEI, 2x throughput
HIPAA / strict data residency	Managed cloud	Baseten: HIPAA in-VPC
DeepSeek at full quality	Serverless, fp8 typical	Dedicated, you control precision

Quick Comparison

Together AI optimizes serverless breadth and portability. Baseten optimizes dedicated control in your own infra.

Together AI vs Baseten at a Glance

Spec	Together AI	Baseten	Morph
Primary Focus	Serverless API + training cloud	Dedicated production inference	Open models + code apply
Billing	Per token / per GPU-hour	Per minute (GPU-hour)	Per token
Llama 3.3 70B (per 1M)	$1.04 in / $1.04 out	Run on dedicated GPU	N/A
DeepSeek V4 Flash (per 1M)	DeepSeek V4 Flash N/L	Token-priced set	$0.139 / $0.278
Dedicated H100 ($/hr)	$6.49	$6.50	N/A
Dedicated B200 ($/hr)	$11.95	$9.98	N/A
Speculative Decoding	ATLAS adaptive (learns live)	TensorRT-LLM based	Tuned to code
Export Fine-Tuned Weights	Yes, downloadable	Serves, not exports	N/A
Self-Hosted / VPC	Managed cloud only	Cloud, self-hosted, hybrid	Managed API
Best For	Variable, multi-model, portable	Steady single-model in your infra	DeepSeek + coding agents

Serverless vs Dedicated: The Core Split

The two providers sit on opposite sides of the same tradeoff: provisioning.

Together AI is serverless-first. You send a request to a shared, already-warm pool and pay per token. There is no GPU to reserve, no autoscaling to configure, and no idle cost. The catalog is broad: Llama, Qwen3.5, gpt-oss, DeepSeek, GLM, MiniMax, plus image, audio, and embedding models. Together also offers on-demand dedicated endpoints, reserved capacity, and rentable GPU clusters (HGX H100 $5.49/hr, HGX H200 $6.79/hr, HGX B200 $9.95/hr) when you outgrow shared serverless.

Baseten is dedicated-first. You deploy a model onto your own autoscaling replicas and pay per GPU-minute. Baseten also sells token-priced Model APIs for a curated set (DeepSeek V4 $1.74/$3.48, Kimi K2.6 $0.95/$4.00, GLM 5.1 $1.30/$4.30, GPT-OSS 120B $0.10/$0.50), but the platform is built around running your model with full control over hardware, scaling, and placement. Multi-Cloud Capacity Management treats GPUs across regions and clouds as one pool, packing replicas wherever capacity exists.

Serverless

Together AI default mode

Dedicated

Baseten default mode

The practical read: Together gets you to first token in seconds with zero setup. Baseten gives you a dedicated, tunable home for a model you plan to run at scale, including inside your own cloud.

Serverless Token Pricing, Same Models

Both providers publish per-token prices for popular open models. Together's serverless rates sit at the higher end of the market; Baseten's Model APIs come in lower on most of the same models. Prices below are per 1M tokens, input / output, as of June 2026.

Serverless Per-Token Pricing (per 1M in / out)

Model	Together AI	Baseten	Morph
DeepSeek V4 Pro	$2.10 / $4.40	$1.74 / $3.48	DeepSeek-tuned
DeepSeek V4 Flash	Catalog	Token-priced	$0.139 / $0.278
GLM-5.1	$1.40 / $4.40	$1.30 / $4.30	Open models
Kimi K2.6	$1.20 / $4.50	$0.95 / $4.00	Open models
Qwen3.6 Plus	$0.50 / $3.00	Curated set	Open models
GPT-OSS 120B	$0.15 / $0.60	$0.10 / $0.50	Open models
MiniMax M2.7	$0.30 / $1.20	Curated set	Open models
Llama 3.3 70B	$1.04 / $1.04	Dedicated GPU	N/A

On the open models both list, Baseten's Model API undercuts Together's serverless rate (DeepSeek V4 Pro $1.74 vs $2.10 input, GPT-OSS 120B $0.10 vs $0.15 input). Together's advantage is catalog breadth and exportable fine-tunes, not per-token price. For DeepSeek specifically, see the Morph rate in the section below.

Dedicated GPU Pricing, Per Hour

When you commit a GPU, Together and Baseten land within cents of each other on H100, and Baseten is cheaper on B200.

Dedicated GPU Pricing

GPU	Together AI	Baseten
H100 80GB	$6.49/hr endpoint ($5.49/hr cluster)	$6.50/hr ($0.10833/min)
B200 180GB	$11.95/hr endpoint ($9.95/hr cluster)	$9.98/hr ($0.16633/min)
A100 80GB	Cluster reserved	$4.00/hr ($0.06667/min)
H100 MIG 40GB	Not published	$3.75/hr ($0.0625/min)
L4 24GB	Not published	$0.85/hr ($0.01414/min)
T4 16GB	Not published	$0.63/hr ($0.01052/min)
Billing granularity	Per hour	Per minute
Idle cost	$0 serverless / billed on dedicated	$0 with scale-to-zero

Baseten publishes a deeper GPU menu (down to T4 at $0.63/hr) and bills per minute, which is finer-grained than Together's per-hour dedicated endpoints. Together's rentable GPU clusters undercut its own dedicated endpoints ($5.49/hr HGX H100) but require renting whole HGX nodes rather than a single replica.

Cost on a Real Workload

Take one concrete scenario and compute it from the list prices on this page. Serve Llama 3.3 70B at 50M output tokens per day, every day.

Cost on a real workload

Together AI serverless: 50M output tokens/day at $1.04/1M is 50 x $1.04 = $52/day on output, about $1,560/month (output only; input adds at the same $1.04/1M). You pay this whether the GPU is busy or idle, and there is nothing to provision.

Baseten dedicated H100: one H100 at $6.50/hr running 24/7 is $6.50 x 24 = $156/day, about $4,680/month per GPU. 50M output tokens/day averages roughly 579 output tok/s over 24 hours (50,000,000 / 86,400), well within one H100. So at steady 50M/day, one dedicated H100 costs about $4,680/mo against Together's about $1,560/mo. Serverless wins here.

Break-even: dedicated only pulls ahead once that one $4,680/mo GPU serves enough tokens to beat per-token pricing. At $1.04/1M, $4,680 buys about 4.5 billion output tokens, roughly 150M output tokens/day (about 1,736 sustained output tok/s) before the dedicated GPU is cheaper. Below that rate, Together serverless wins; above it, Baseten dedicated wins.

Speed: ATLAS vs TensorRT-LLM

Both providers ship production speculative decoding, but Together's is the more novel design.

Together AI's ATLAS (AdapTive-LeArning Speculator System) is a runtime-learning accelerator. Instead of a fixed draft model, it adapts the speculator to live traffic. Once adapted to Arena-Hard traffic, it took DeepSeek-V3.1 from 105 to 501 output tok/s on 4 B200 GPUs, a 400% speedup over the FP8 baseline, and reached about 460 tok/s on Kimi-K2. It runs on dedicated endpoints at no extra cost. The headline number is workload-dependent: speculative decoding pays off most on predictable, repetitive output.

Baseten builds speculative decoding on TensorRT-LLM with output-preserving guarantees, meaning speculated tokens produce identical results to standard decoding. It also ships Baseten Embeddings Inference (BEI), which it reports at over 2x throughput and 10% lower latency versus prior solutions (3.3x over vLLM and 3.6x over TEI on H100s when running on B200s). For embedding-heavy retrieval pipelines, BEI is a real differentiator.

400%

Together ATLAS speedup over FP8 (DeepSeek-V3.1)

Baseten BEI embedding throughput

Neither speedup is code-specific. Both accelerate general token generation. For the coding-agent apply loop, that is a different tool: Morph Fast Apply runs at about 10,500 tok/s with code-tuned speculative decoding, with published benchmarks.

Cold Starts & Autoscaling

Serverless hides cold starts; dedicated exposes them, then optimizes them.

On Together AI serverless, you call a pre-warmed shared pool, so cold starts are not your problem. The provider absorbs scaling. The tradeoff is less control over the exact hardware and placement of your requests.

On Baseten dedicated, scale-to-zero is the default (min_replica = 0): a deployment scaled to zero replicas incurs no charges, but the next request triggers a cold start that can take minutes for large models, and billing is per minute during wake-up even before the replica serves a response. Baseten's docs recommend min_replica of 2 or more for production to eliminate cold starts. Autoscaling defaults are min replicas 0, max replicas 1, a concurrency target of 1 request per replica, and a 60-second scaling window (configurable 10 to 3600 seconds). You tune these to trade idle cost against latency.

Autoscaling Cost Warning

Baseten's autoscaling fires up new GPU replicas under traffic bursts, which protects latency but can spike your bill just as fast. Set max-replica caps and watch utilization. Together AI serverless avoids this by abstracting scaling entirely, at the cost of per-token pricing that can exceed dedicated economics at high steady volume.

Running DeepSeek at Full Quality

If you are choosing between these two to serve DeepSeek, the variable that matters most is activation precision. Most serverless providers quantize activations to fp8 to cut cost. That moves output away from the reference weights, and on reasoning-heavy or code tasks the drift shows up in quality. Together's and Baseten's public DeepSeek prices ($2.10/$4.40 and $1.74/$3.48 per 1M) are competitive, but neither commits to 16-bit activations on its shared serverless path.

Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference model. Morph prices DeepSeek V4 Flash at $0.139 per 1M input and $0.278 per 1M output, below Together, Baseten, and Fireworks on the same model. For coding agents, Morph adds code-tuned speculative decoding and low-level kernels built for code generation, which makes it the fastest and highest-fidelity option for that specific workload rather than a general-purpose menu. See Morph pricing for the full rate card.

Fine-Tuning

Together AI exposes fine-tuning as a priced API. Baseten treats it as infrastructure.

Together AI publishes a fine-tuning price list per 1M training tokens: LoRA SFT up to 16B at $0.48, 17B to 69B at $1.50, and 70B to 100B at $2.90; LoRA DPO at $0.54, $1.65, and $3.20 across the same sizes. You fine-tune through an API call, deploy the result behind an endpoint, and can download the weights to run elsewhere. This is the cleaner path for managed fine-tuning with predictable per-token cost and no platform lock-in on the artifact.

Baseten runs fine-tuning on dedicated GPU instances rather than a fixed per-token menu, which fits teams that already own a training pipeline and want full control over hardware and process. It is more flexible and less prescriptive, but you manage more of the loop yourself and the trained model is served on Baseten rather than exported.

Compliance & Deployment Modes

Both clear enterprise compliance bars; Baseten goes further on where the model runs.

Compliance & Deployment

Capability	Together AI	Baseten
SOC 2 Type II	Yes	Yes
HIPAA	Enterprise	Yes
Managed Cloud	Yes	Yes
Self-Hosted / In-VPC	Limited	Yes (full stack)
Hybrid (burst to cloud)	No	Yes
Multi-Cloud Pooling	No	MCM across clouds/regions

Together AI is SOC 2 Type II certified, which satisfies most procurement, but it is a managed cloud: your inference runs on Together's infrastructure.

Baseten states SOC 2 Type II certification and HIPAA compliance, and it splits into three modes on one inference stack: fully managed cloud, self-hosted inside your own VPC, and hybrid that runs core workloads in your cloud and bursts to Baseten on demand. For data-residency-sensitive or air-gapped deployments, Baseten is the clearer fit.

When to Use Together AI

Serverless, multi-model traffic. One endpoint, broad catalog, $1.04/1M on Llama 3.3 70B and far less on small models. Zero provisioning, zero idle cost.
Bursty or unpredictable load. Pay-per-token means you never over-provision GPUs for a spike that may not come.
Managed fine-tuning with exportable weights. A clean per-token price list for LoRA SFT and DPO, deployed behind an API, and downloadable to run elsewhere.
Batch jobs. The batch API cuts cost up to 50% for non-interactive workloads (24h window, up to 50,000 requests per batch).
Renting raw GPU clusters. HGX H100 at $5.49/hr, H200 $6.79/hr, B200 $9.95/hr on-demand, with reserved rates from $3.99/hr.

When to Use Baseten

Steady, high-volume single model. Dedicated H100 at $6.50/hr with per-minute billing and scale-to-zero amortizes better than per-token past about 150M output tokens/day for a 70B model.
In-VPC or hybrid deployment. Self-hosted and hybrid modes run the full inference stack inside your own cloud for strict data residency.
Multi-cloud capacity. MCM pools GPUs across clouds and regions, useful when single-region capacity is tight.
Embedding and reranker pipelines. BEI reports over 2x throughput and 10% lower latency for retrieval-heavy systems.
Compound, multi-model workflows. Chains gives granular per-step hardware and autoscaling, reporting 6x better GPU usage.

Frequently Asked Questions

Is Together AI or Baseten cheaper?

It flips with traffic. For bursty or low volume, Together AI serverless is cheaper because you pay per token with no idle cost: Llama 3.3 70B is $1.04/1M, gpt-oss-20B is $0.05 in / $0.20 out. For one model under steady high load, Baseten dedicated is cheaper because an H100 at $6.50/hr (about $4,680/mo run 24/7) amortizes across the tokens it serves, and scale-to-zero charges nothing when idle. The crossover for a 70B model sits near 150M output tokens/day, roughly 1,736 sustained output tok/s. Below that, Together wins; above it, Baseten wins.

What does Baseten charge per GPU hour?

Baseten dedicated GPUs bill per minute: H100 80GB $6.50/hr ($0.10833/min), B200 180GB $9.98/hr, A100 80GB $4.00/hr, H100 MIG 40GB $3.75/hr, plus L4 $0.85, A10G $1.21, and T4 $0.63 per hour. Together AI dedicated endpoints are $6.49/hr for one H100 80GB and $11.95/hr for an HGX B200 180GB, with cheaper GPU clusters at $5.49/hr for HGX H100.

Does Together AI or Baseten have faster speculative decoding?

Both ship production speculative decoding. Together AI's ATLAS is an adaptive speculator that learns from live traffic; once adapted to Arena-Hard traffic it took DeepSeek-V3.1 from 105 to 501 tok/s on 4 B200 GPUs, a 400% speedup over the FP8 baseline, and reaches about 460 tok/s on Kimi-K2. Baseten builds speculative decoding on TensorRT-LLM with output-preserving guarantees. For coding workloads, Morph runs code-tuned speculative decoding at about 10,500 tok/s on its apply path.

Can I run inference in my own VPC?

Baseten has the stronger story. It offers managed Baseten Cloud, self-hosted inside your own VPC, and hybrid that bursts from your cloud to Baseten on demand, with Multi-Cloud Capacity Management pooling GPUs across regions. Together AI focuses on its managed cloud with serverless, on-demand dedicated, and reserved endpoints. For strict data residency or in-VPC requirements, Baseten fits better.

Are Together AI and Baseten SOC 2 and HIPAA compliant?

Yes, both. Together AI is SOC 2 Type II certified. Baseten states SOC 2 Type II certification and HIPAA compliance on its pricing page, which opens regulated healthcare and financial-services use. Either clears the bar for most enterprise procurement.

Where should I run DeepSeek for the best output quality?

Output quality depends on activation precision. Most serverless providers quantize activations to fp8 to cut cost, which moves output away from the reference weights. Morph Open Source Models serve DeepSeek with 16-bit (bf16) activations and no fp8 or int8 quantization, so responses match the reference model, at $0.139/$0.278 per 1M on DeepSeek V4 Flash. If output fidelity on DeepSeek matters, Morph is the place to run it.

Related Comparisons

Run DeepSeek and Open Models at Full Quality

Together prices per token; Baseten prices per GPU-hour. If you are serving DeepSeek or open-source models for a coding agent, Morph serves them in 16-bit activations (no fp8 quantization) with code-tuned speculative decoding, DeepSeek V4 Flash at $0.139/$0.278 per 1M.

See Morph Models

Morph Pricing

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Together AI vs Baseten (2026): Serverless Tokens vs Dedicated GPUs, Priced Side by Side

Who Wins Per Workload

Quick Comparison

Serverless vs Dedicated: The Core Split

Serverless Token Pricing, Same Models

Dedicated GPU Pricing, Per Hour

Cost on a Real Workload

Speed: ATLAS vs TensorRT-LLM

Cold Starts & Autoscaling

Running DeepSeek at Full Quality

Fine-Tuning

Compliance & Deployment Modes

When to Use Together AI

When to Use Baseten

Frequently Asked Questions

Is Together AI or Baseten cheaper?

What does Baseten charge per GPU hour?

Does Together AI or Baseten have faster speculative decoding?

Can I run inference in my own VPC?

Are Together AI and Baseten SOC 2 and HIPAA compliant?

Where should I run DeepSeek for the best output quality?

Related Comparisons

Run DeepSeek and Open Models at Full Quality