Distillation - OpenTracy

OpenTracy ghost with dollar-sign eyes — distillation is the cost-reduction wedge

Distillation is how you go from “paying

0.02 per call to GPT-4o" to "paying

0.0005 per call to a small model that I fine-tuned on my own traffic to match GPT-4o’s output”. It’s the wedge — the compounding value that a plain gateway can’t offer.

The core idea

A teacher is a large, expensive model that already does the task well (GPT-4o, Claude Sonnet, etc.). A student is a small, cheap open model (llama-3.2-1b, qwen3-0.6b, etc.). Distillation trains the student to imitate the teacher’s behavior on a specific dataset — usually the one you built from your own traces (see Datasets). The student won’t be as smart as the teacher in general. It will be roughly as good as the teacher on the narrow slice of prompts you distilled — and 10–100× cheaper to run.

What the pipeline does

Data generation

For each prompt in the dataset, the teacher is called N times (default 4) with temperature > 0. This produces N candidate responses per prompt.

Curation

A judge model scores each candidate. The top-k (default: best 2) survive — bad candidates are dropped. This is the “best-of-N” part of BOND (Best-Of-N Distillation).

Training

The student is fine-tuned on (prompt → curated_response) pairs using the BOND loss, a blend of supervised fine-tuning, preference optimization, and KL regularization. Runs on GPU via Unsloth + TRL.

Export

The trained LoRA adapter is saved and optionally converted to GGUF (quantized) for serving on CPU or edge. Output: a directory you can load into any inference engine that speaks GGUF/llama.cpp.

Serve

Register the distilled model in OpenTracy’s model registry. Point a routing alias at it. Your app keeps calling model="smart" and the requests now flow through your custom student.

Running a distillation job

The simplest path — ot.distill() runs the pipeline in-process and returns a callable Student. No REST service, no job polling, no ClickHouse.

import opentracy as ot

student = ot.distill(
    dataset="tickets.jsonl",              # path, list of dicts, or a callable
    teacher="openai/gpt-4o",
    student="llama-3.2-1b",
    steps=100,
    n_samples=4,                          # BOND candidates per prompt
    quantize="q4_k_m",                    # or None to skip GGUF export
)

print(student("Classify: refund please"))  # local inference, $0

Pass on_progress=callback for a tidy phase-by-phase timeline. See ot.distill reference for every parameter. For the queued, multi-tenant, REST-backed flow (jobs persisted in ClickHouse, UI observability, resumable on restart), use the Distiller HTTP client instead — same engine under the hood, different deployment shape.

Choosing a teacher and a student

Teacher: pick the model you’d use in production if cost weren’t an issue. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro are good defaults. The student will learn to match this model’s output style and accuracy — on the distilled task only. Student: the smallest model that can plausibly handle your task’s output. Rule of thumb:

Task	Student floor
Classification (few labels)	0.6B (qwen3-0.6b)
Structured extraction (JSON)	1B (llama-3.2-1b)
Short-form generation (< 200 tok)	1–3B
Long-form + reasoning	8B+ (llama-3.1-8b)

Smaller is cheaper to run but harder to train. If training fails to converge, move up a tier. Discover the full current list:

# Via the REST client (self-hosted stack):
from opentracy import Distiller
d = Distiller(base_url="http://localhost:8000")
for t in d.teacher_models(): print(t["id"], t["provider"])
for s in d.student_models(): print(s["id"], s.get("params"))

# Via the in-process path (no server needed):
from opentracy.distillation.schemas import STUDENT_MODEL_MAP, TEACHER_MODEL_MAP
print(list(STUDENT_MODEL_MAP.keys()))
print(list(TEACHER_MODEL_MAP.keys()))

The BOND hyperparameters

The BOND loss has two knobs worth knowing:

bond_beta (default 0.5) — how hard to push the student toward preferred responses vs. dispreferred. Higher = more aggressive preference shift; lower = gentler, more SFT-like.
bond_gamma (default 0.1) — KL regularization strength. Keeps the student close to its initial weights so you don’t destroy general capability. Raise if your student overfits or starts babbling.

You rarely need to tune these — defaults are good for most tasks. If you’re getting bad results, first look at dataset quality before touching BOND parameters.

Hardware requirements

Training runs on GPU. The Docker image (opentracy-api) is built on the nvidia/cuda:12.6 base and supports --gpus all. Minimum specs:

Student size	Min VRAM	Typical training time (500 prompts, 100 steps)
0.6B–1B	8 GB	10–20 minutes
3B	16 GB	30–60 minutes
8B	24 GB (4-bit)	2–4 hours

Without a GPU, training will fail. Use the estimate endpoint first to validate before kicking off a job.

After training: the alias swap

The closing move. student.deploy(alias) writes the mapping into ~/.opentracy/aliases.json — from that point on, any ot.completion(model=alias, ...) call from any Python process owned by the same user dispatches to this student locally.

student = ot.distill(dataset=..., teacher=..., student=...)
student.save("./ticket-classifier-v1")       # durable artifact path
student.deploy("ticket-classifier")          # register alias

# Now any caller — your app, a FastAPI server, a batch job —
# transparently hits the local student with this single string:
resp = ot.completion(
    model="ticket-classifier",
    messages=[{"role": "user", "content": "Classify: refund please"}],
)
print(resp.choices[0].message.content, resp._cost)   # "billing" 0.0

Re-pointing, listing, or removing aliases is a one-liner:

ot.set_alias("smart", backend="peft", model_path=..., base_model=...)
ot.list_aliases()
ot.unset_alias("smart")

This is the closing move of the pipeline — the moment cost savings actually land in your invoice. Your app code didn’t change; only the thing on the other side of the alias got 10× cheaper.

Common pitfalls

Distilling a single cluster, not a mixed bag. One dataset should be one coherent task. If you mix “JSON extraction” and “creative writing” into the same dataset, the student gets confused. Distill each task separately; swap separate aliases.

Training before the teacher is right. If your teacher is giving 70% accurate answers, your student will cap out below that. Fix prompting and model choice first; then distill.

Evaluating the student only on training examples. Always evaluate on held-out traces. OpenTracy’s evaluation framework handles this — pass a dataset with a test split and it will report accuracy on rows the student never saw.

Distiller reference

Every method of the Distiller client with parameters and return types.

Self-host the full stack

Distillation requires the engine + GPU — this guide sets up Docker Compose.

​The core idea

​What the pipeline does

​Running a distillation job

​Choosing a teacher and a student

​The BOND hyperparameters

​Hardware requirements

​After training: the alias swap

​Common pitfalls

​Next

Distiller reference

Self-host the full stack

The core idea

What the pipeline does

Running a distillation job

Choosing a teacher and a student

The BOND hyperparameters

Hardware requirements

After training: the alias swap

Common pitfalls

Next