Once a team gets serious about AI, the spreadsheet question arrives: is it cheaper to self-host than to keep paying a hosted API? The honest answer is that it depends — on volume, utilization, and a handful of costs most comparisons leave out. Here's how to model it without fooling yourself.
The API cost model
Hosted APIs charge per token. That's wonderfully simple: near-zero to start, no infrastructure, and you pay only for what you use. The catch is that the bill scales linearly with success — the per-unit price is predictable, but the total isn't, and at high volume it can dwarf the cost of running the model yourself.
The self-hosted cost model
Self-hosting trades a variable bill for a mostly fixed one, with several components:
- GPU compute — rented by the hour or owned; the single biggest line item.
- Utilization — a GPU only saves money if it's busy; idle accelerators are pure waste.
- Serving and platform — the inference stack, autoscaling, and observability around the model.
- Engineering time — the people who build and operate it, which is the cost most often ignored.
Where the break-even sits
The math turns on utilization and volume. Steady, high-throughput workloads keep GPUs busy and fall well below per-token pricing. Low or spiky workloads leave expensive hardware idle, and the API wins easily. There's a crossover point — and it moves as model efficiency and GPU prices change.
The costs people forget
Comparisons usually flatter self-hosting because they only count the GPU. Be honest about:
- Engineering and on-call time to run the platform.
- Idle capacity between traffic peaks.
- Model upgrades, evaluation, and re-tuning over time.
- Redundancy and failover for production reliability.
How to actually decide
Benchmark a candidate open-weight model on your real traffic, model both options at your expected volume, and include people cost on both sides. Then make a portfolio call rather than an all-or-nothing one — most teams self-host the steady, sensitive, high-volume work and keep an API for the spiky long tail.
Where Colonypilot fits
We run this TCO analysis with teams — benchmarking on your workload, modeling both paths, and standing up the infrastructure for whatever the numbers favor. If you're trying to decide where your AI should run, we'll bring the real figures, not a sales pitch.