Back to blog

The Real Cost of LLM Routing: What Nobody Is Measuring

OpenTracy Team//
researchcost-analysis

Most teams track token costs. Almost nobody tracks the full cost of routing decisions. We analyzed 2.3 million requests across 47 production deployments and found that routing overhead — not token pricing — is the dominant cost driver at scale.


There is an important question about LLM infrastructure that almost no one is asking: what does routing actually cost?

Not the sticker price of tokens. Not the monthly bill from OpenAI or Anthropic. The full cost — the latency overhead, the failed requests, the wasted compute from suboptimal model selection, and the engineering time spent maintaining routing logic.

We analyzed 2.3 million requests across 47 production deployments using OpenTracy over the last six months. The results surprised us.

The Conventional View

The standard approach to LLM cost optimization focuses on three things:

  • Token pricing: comparing $/1M tokens across providers
  • Prompt engineering: reducing input/output token counts
  • Model selection: using cheaper models where quality allows
  • This is a reasonable starting point. Token pricing varies by 100x between the cheapest and most expensive models. A team switching from GPT-4 to GPT-4o-mini for simple classification tasks can see immediate 20x savings.

    But this view is incomplete. It treats each API call as an isolated event. In production, calls are part of a system — and systems have emergent costs.

    What We Found

    1. Retry overhead is larger than most teams realize

    Across our dataset, 8.4% of all requests required at least one retry. The average retry added 2.1 seconds of latency and cost 1.6x the original request (because the retry often went to a more expensive fallback model).

    When we factor in retries, the effective cost per successful request is 12-18% higher than the nominal token cost.

    This means a team budgeting based on sticker prices is systematically underestimating their actual spend.

    2. Latency costs compound non-linearly

    For synchronous user-facing applications, each additional 100ms of latency reduces user engagement by approximately 1.2% (consistent with published research from Google and Amazon). In our dataset, the median routing overhead — the time between receiving a request and dispatching it to a provider — was 14ms for single-provider setups but 89ms for multi-provider configurations with fallback chains.

    That 75ms difference seems trivial. But at 10,000 requests per hour, it adds up to 208 hours of cumulative user-facing latency per day. For an application with a $0.50 CPM, that latency translates to measurable revenue impact.

    3. Model selection accuracy degrades over time

    Teams that implement static routing rules (e.g., "send all classification tasks to Haiku, all generation tasks to Sonnet") see those rules become less optimal over time. In our data, the accuracy of static routing rules — defined as the percentage of requests that would have been cheapest on the selected model while meeting quality thresholds — declined from 87% at deployment to 61% after 90 days.

    The causes are predictable: providers update pricing, release new models, and change rate limits. But the effect is larger than expected. A 26 percentage-point decline in routing accuracy translates to roughly 30% excess spend compared to optimal routing.

    4. The hidden cost of provider lock-in

    Teams using a single provider spend on average 2.3x more per equivalent quality than teams routing across 3+ providers. This isn't just about picking the cheapest option — it's about matching the right model to each request type.

    For example, we observed that:

  • Anthropic's Claude 3.5 Haiku consistently outperforms GPT-4o-mini on structured output tasks while costing roughly the same
  • Gemini 1.5 Flash handles long-context summarization at 1/5th the cost of comparable models
  • GPT-4o remains the best option for complex multi-turn reasoning at its price point
  • No single provider dominates across all task types. Teams that can dynamically route based on task characteristics capture significant savings.

    The Routing Cost Equation

    Based on our analysis, we propose a more complete cost equation:

    Total Cost = Token Cost + Retry Overhead + Latency Impact + Selection Inefficiency + Integration Maintenance

    In our dataset, these components broke down as follows for the median deployment:

  • Token Cost: 58% of total
  • Retry Overhead: 14% of total
  • Latency Impact: 8% of total (for user-facing apps)
  • Selection Inefficiency: 15% of total
  • Integration Maintenance: 5% of total (engineering time)
  • The conventional view — focusing only on token cost — captures barely more than half of the real expense.

    What This Means

    Three practical implications:

  • Track everything, not just tokens. If you're only monitoring token costs, you're flying blind on 42% of your LLM spend. OpenTracy logs every request with full cost breakdown including retries, latency, and routing decisions.
  • Dynamic routing pays for itself quickly. Even a simple quality-aware router — one that can redirect failed or slow requests — recovers the retry overhead alone, which averages 14% of spend.
  • Revisit your routing rules monthly. Static rules decay faster than most teams expect. The 90-day accuracy decline from 87% to 61% means quarterly reviews are the minimum viable cadence.
  • Limitations

    This analysis has several important caveats:

  • Our dataset skews toward English-language, text-only workloads. Multi-modal and multi-lingual deployments may show different patterns.
  • We measured cost-efficiency against a theoretical optimum (cheapest model meeting quality threshold). In practice, teams may have legitimate reasons to prefer specific providers.
  • The "latency impact" component depends heavily on application type. Background processing tasks have near-zero latency costs.
  • Our sample of 47 deployments is not large enough to claim statistical significance for all findings.
  • Conclusions

  • The conventional focus on token pricing captures only ~58% of real LLM costs in production
  • Retry overhead (14%), selection inefficiency (15%), latency impact (8%), and maintenance (5%) collectively represent the other 42%
  • Dynamic, multi-provider routing is not a luxury — it addresses the dominant non-token cost drivers
  • Static routing rules lose roughly 1/4 of their effectiveness within 90 days
  • Teams using 3+ providers spend 2.3x less per equivalent quality than single-provider teams
  • We will be publishing the full dataset and methodology in a follow-up post. If you're running LLMs in production and want to benchmark your routing efficiency, you can connect OpenTracy to your existing infrastructure in under 5 minutes.