Auto-routing - OpenTracy

Multiple OpenTracy ghosts fanning out to different providers — one router, many models

The auto-router is the piece that turns “I have thirteen models I could call” into “call the right one for this specific prompt” — without you writing any rules. It’s based on the idea that prompts with similar meaning have similar difficulty, and different models have different strengths on different kinds of prompts. So if you can (a) group prompts by meaning and (b) know each model’s error rate on each group, you can route by minimizing expected_error + λ·cost.

The three moving parts

1. Embedder

Every prompt is run through a sentence embedder (MiniLM-L6-v2, bundled in the wheel) to produce a 384-dimensional vector. This is a pure function: the same prompt always produces the same vector.

2. Cluster assigner

The embedder output is assigned to one of 100 pre-trained semantic clusters. Cluster centroids live in the weights package you downloaded on first run. Examples (cluster names from the default weights):

cluster 47 → “mathematical proofs and formal reasoning”
cluster 84 → “short factual lookup”
cluster 88 → “data-structure code generation”
cluster 29 → “creative short-form writing”

Clusters are assigned by nearest centroid (cosine distance). You can opt into soft assignment — a full probability distribution over the 100 clusters — via use_soft_assignment=True when loading the router.

3. Per-model error profiles

For every model the router knows about, there’s a vector Ψ of length 100: Ψ[i] is the model’s empirical error rate on cluster i. Error is measured as “fraction of validation examples where this model got it wrong” during profile fitting. A routing decision is then:

score(model) = Ψ[cluster] + λ · cost_per_1k(model)
selected    = argmin(score)

λ is the cost_weight argument you pass to load_router().

Using it

The whole thing collapses to two lines:

import opentracy as ot

router = ot.load_router(cost_weight=0.5)
decision = router.route("Write a Python function that reverses a linked list.")

The returned RoutingDecision:

decision.selected_model           # "gpt-4o"
decision.cluster_id               # 88
decision.expected_error           # 0.000
decision.cost_adjusted_score      # 0.0031
decision.all_scores               # {model_id: score, ...} — every candidate
decision.cluster_probabilities    # np.ndarray(100,) — soft distribution
decision.reasoning                # human-readable explanation

Tuning the cost-quality dial

cost_weight (λ) is the one knob you’ll actually touch:

λ	Behavior
`0.0`	Pick whichever model has lowest predicted error, ignore cost.
`0.5`	Balanced — common default. A tiny error delta won’t justify a 10× cost.
`1.0`	Strongly prefer cheaper models; only escalate if they’re demonstrably bad.
`2.0+`	Aggressively cheap; escalate only on the worst prompts.

Try a few values on your traffic. The right number depends on how much quality degradation you can tolerate.

Restricting the candidate pool

By default the router considers every model in the loaded registry. You can restrict it:

# Only route among these three
router = ot.load_router(
    allowed_models=["gpt-4o-mini", "ministral-3b-latest", "gpt-4o"],
    cost_weight=0.5,
)

Or override per-call:

decision = router.route(prompt, available_models=["gpt-4o-mini", "gpt-4o"])

Useful when you want to A/B test a model subset, or when certain models aren’t available in a tenant.

The two backends

load_router has a single parameter you’ll barely ever touch: engine.

engine="go" (default) — spawns the bundled Go engine as a subprocess. Fast (~sub-millisecond routing), production path.
engine="python" — pure Python implementation, no subprocess. Slower, but useful in environments where process-spawn is forbidden or where you want to introspect every internal (e.g. swap the cluster assigner, monkey-patch profiles). The Go binary is bundled per-platform; if it isn’t present you’ll see a clear error.
engine="auto" — prefer Go, fall back to Python if the binary is missing. Not recommended as a default because the fallback is silent — if something’s wrong with the binary, you want to know, not route 10× slower without noticing.

How routing changes over time

The profiles you loaded are from a benchmark the weights were trained on. Your production traffic will be different — maybe your users ask more code questions than the benchmark assumed. Two mechanisms adapt the router:

blend_with_profiles — periodically combine the benchmark’s per-model error profile with the one observed in production: Ψ_new = α · Ψ_prod + (1 - α) · Ψ_benchmark. The feedback module has utilities for this. See the “self-learning” section of the basic_router_to_self_learning notebook.
Alias swapping — when a distilled student is ready for a cluster you’ve worked on, you add it to the registry, point the alias at it, and from that moment the router can select it for prompts in that cluster. See Distillation.

When auto-routing isn’t enough

For two shapes of problem, auto-routing alone won’t cut it:

You have hard policy constraints. “Never route X to Anthropic.” In that case combine with a Router (explicit, rule-based) — the logical alias can still be semantic, but the candidates are constrained.
Your prompts don’t cluster well. If everything you do is one narrow domain that doesn’t match any of the pre-trained clusters, you’ll get mediocre routing decisions. Solution: retrain the weights on your traffic (opentracy.training.full_training_pipeline), or fall back to Router with hand-picked deployments.

Distillation

The counterpart — how the student models that auto-routing swaps in get built.

Router reference

load_router parameters, .route() / .route_batch() signatures, full RoutingDecision schema.

​The three moving parts

​1. Embedder

​2. Cluster assigner

​3. Per-model error profiles

​Using it

​Tuning the cost-quality dial

​Restricting the candidate pool

​The two backends

​How routing changes over time

​When auto-routing isn’t enough

​Next