
The core idea
A teacher is a large, expensive model that already does the task well (GPT-4o, Claude Sonnet, etc.). A student is a small, cheap open model (llama-3.2-1b, qwen3-0.6b, etc.). Distillation trains the student to imitate the teacher’s behavior on a specific dataset — usually the one you built from your own traces (see Datasets). The student won’t be as smart as the teacher in general. It will be roughly as good as the teacher on the narrow slice of prompts you distilled — and 10–100× cheaper to run.What the pipeline does
Data generation
For each prompt in the dataset, the teacher is called N times (default
4) with temperature > 0. This produces N candidate responses per prompt.
Curation
A judge model scores each candidate. The top-k (default: best 2)
survive — bad candidates are dropped. This is the “best-of-N” part
of BOND (Best-Of-N Distillation).
Training
The student is fine-tuned on (prompt → curated_response) pairs using
the BOND loss, a blend of supervised fine-tuning, preference optimization,
and KL regularization. Runs on GPU via Unsloth + TRL.
Export
The trained LoRA adapter is saved and optionally converted to GGUF
(quantized) for serving on CPU or edge. Output: a directory you can
load into any inference engine that speaks GGUF/llama.cpp.
Running a distillation job
TheDistiller client wraps the REST API:
Estimating cost before you run
Training costs money (teacher API calls) and time (GPU hours). Useestimate before committing:
Choosing a teacher and a student
Teacher: pick the model you’d use in production if cost weren’t an issue. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro are good defaults. The student will learn to match this model’s output style and accuracy — on the distilled task only. Student: the smallest model that can plausibly handle your task’s output. Rule of thumb:| Task | Student floor |
|---|---|
| Classification (few labels) | 0.6B (qwen3-0.6b) |
| Structured extraction (JSON) | 1B (llama-3.2-1b) |
| Short-form generation (< 200 tok) | 1–3B |
| Long-form + reasoning | 8B+ (llama-3.1-8b) |
The BOND hyperparameters
The BOND loss has two knobs worth knowing:bond_beta(default0.5) — how hard to push the student toward preferred responses vs. dispreferred. Higher = more aggressive preference shift; lower = gentler, more SFT-like.bond_gamma(default0.1) — KL regularization strength. Keeps the student close to its initial weights so you don’t destroy general capability. Raise if your student overfits or starts babbling.
Hardware requirements
Training runs on GPU. The Docker image (opentracy-api) is built on the
nvidia/cuda:12.6 base and supports --gpus all. Minimum specs:
| Student size | Min VRAM | Typical training time (500 prompts, 100 steps) |
|---|---|---|
| 0.6B–1B | 8 GB | 10–20 minutes |
| 3B | 16 GB | 30–60 minutes |
| 8B | 24 GB (4-bit) | 2–4 hours |
estimate endpoint first
to validate before kicking off a job.
After training: the alias swap
Once a student is trained and registered, re-pointing an alias to it is a one-line change:Common pitfalls
Next
Distiller reference
Every method of the
Distiller client with parameters and return types.Self-host the full stack
Distillation requires the engine + GPU — this guide sets up Docker Compose.

