The Real Cost of LLM Routing: What Nobody Is Measuring
Most teams track token costs. Almost nobody tracks the full cost of routing decisions. We analyzed 2.3 million requests across 47 production deployments and found that routing overhead — not token pricing — is the dominant cost driver at scale.
There is an important question about LLM infrastructure that almost no one is asking: what does routing actually cost?
Not the sticker price of tokens. Not the monthly bill from OpenAI or Anthropic. The full cost — the latency overhead, the failed requests, the wasted compute from suboptimal model selection, and the engineering time spent maintaining routing logic.
We analyzed 2.3 million requests across 47 production deployments using OpenTracy over the last six months. The results surprised us.
The Conventional View
The standard approach to LLM cost optimization focuses on three things:
This is a reasonable starting point. Token pricing varies by 100x between the cheapest and most expensive models. A team switching from GPT-4 to GPT-4o-mini for simple classification tasks can see immediate 20x savings.
But this view is incomplete. It treats each API call as an isolated event. In production, calls are part of a system — and systems have emergent costs.
What We Found
1. Retry overhead is larger than most teams realize
Across our dataset, 8.4% of all requests required at least one retry. The average retry added 2.1 seconds of latency and cost 1.6x the original request (because the retry often went to a more expensive fallback model).
When we factor in retries, the effective cost per successful request is 12-18% higher than the nominal token cost.
This means a team budgeting based on sticker prices is systematically underestimating their actual spend.
2. Latency costs compound non-linearly
For synchronous user-facing applications, each additional 100ms of latency reduces user engagement by approximately 1.2% (consistent with published research from Google and Amazon). In our dataset, the median routing overhead — the time between receiving a request and dispatching it to a provider — was 14ms for single-provider setups but 89ms for multi-provider configurations with fallback chains.
That 75ms difference seems trivial. But at 10,000 requests per hour, it adds up to 208 hours of cumulative user-facing latency per day. For an application with a $0.50 CPM, that latency translates to measurable revenue impact.
3. Model selection accuracy degrades over time
Teams that implement static routing rules (e.g., "send all classification tasks to Haiku, all generation tasks to Sonnet") see those rules become less optimal over time. In our data, the accuracy of static routing rules — defined as the percentage of requests that would have been cheapest on the selected model while meeting quality thresholds — declined from 87% at deployment to 61% after 90 days.
The causes are predictable: providers update pricing, release new models, and change rate limits. But the effect is larger than expected. A 26 percentage-point decline in routing accuracy translates to roughly 30% excess spend compared to optimal routing.
4. The hidden cost of provider lock-in
Teams using a single provider spend on average 2.3x more per equivalent quality than teams routing across 3+ providers. This isn't just about picking the cheapest option — it's about matching the right model to each request type.
For example, we observed that:
No single provider dominates across all task types. Teams that can dynamically route based on task characteristics capture significant savings.
The Routing Cost Equation
Based on our analysis, we propose a more complete cost equation:
Total Cost = Token Cost + Retry Overhead + Latency Impact + Selection Inefficiency + Integration Maintenance
In our dataset, these components broke down as follows for the median deployment:
The conventional view — focusing only on token cost — captures barely more than half of the real expense.
What This Means
Three practical implications:
Limitations
This analysis has several important caveats:
Conclusions
We will be publishing the full dataset and methodology in a follow-up post. If you're running LLMs in production and want to benchmark your routing efficiency, you can connect OpenTracy to your existing infrastructure in under 5 minutes.