Skip to main content
OpenTracy ghost with dollar-sign eyes — distillation is the cost-reduction wedge Distillation is how you go from “paying 0.02percalltoGPT4o"to"paying0.02 per call to GPT-4o" to "paying 0.0005 per call to a small model that I fine-tuned on my own traffic to match GPT-4o’s output”. It’s the wedge — the compounding value that a plain gateway can’t offer.

The core idea

A teacher is a large, expensive model that already does the task well (GPT-4o, Claude Sonnet, etc.). A student is a small, cheap open model (llama-3.2-1b, qwen3-0.6b, etc.). Distillation trains the student to imitate the teacher’s behavior on a specific dataset — usually the one you built from your own traces (see Datasets). The student won’t be as smart as the teacher in general. It will be roughly as good as the teacher on the narrow slice of prompts you distilled — and 10–100× cheaper to run.

What the pipeline does

1

Data generation

For each prompt in the dataset, the teacher is called N times (default 4) with temperature > 0. This produces N candidate responses per prompt.
2

Curation

A judge model scores each candidate. The top-k (default: best 2) survive — bad candidates are dropped. This is the “best-of-N” part of BOND (Best-Of-N Distillation).
3

Training

The student is fine-tuned on (prompt → curated_response) pairs using the BOND loss, a blend of supervised fine-tuning, preference optimization, and KL regularization. Runs on GPU via Unsloth + TRL.
4

Export

The trained LoRA adapter is saved and optionally converted to GGUF (quantized) for serving on CPU or edge. Output: a directory you can load into any inference engine that speaks GGUF/llama.cpp.
5

Serve

Register the distilled model in OpenTracy’s model registry. Point a routing alias at it. Your app keeps calling model="smart" and the requests now flow through your custom student.

Running a distillation job

The Distiller client wraps the REST API:
from opentracy import Distiller

d = Distiller(base_url="http://localhost:8000")

job = d.create(
    name="ticket-triage v1",
    dataset_id="ds_invoice_extraction_v1",
    teacher_model="openai/gpt-4o",
    student_model="llama-3.2-1b",
    num_prompts=500,          # cap from the dataset
    n_samples=4,              # BOND candidates per prompt
    training_steps=100,
    bond_beta=0.5,
    bond_gamma=0.1,
    export_gguf=True,
    quantization_types=["q4_k_m", "q8_0"],
)

print(f"submitted: {job['id']}, status={job['status']}")
Polling for completion:
job = d.wait(job["id"], on_update=lambda u: print(u["status"], u.get("phase")))
When it finishes:
artifacts = d.artifacts(job["id"])
# {
#   "adapter_path": "/app/data/distillation/job_abc/adapter/",
#   "gguf_paths": {
#     "q4_k_m": "/app/data/distillation/job_abc/gguf/model-q4_k_m.gguf",
#     "q8_0":   "/app/data/distillation/job_abc/gguf/model-q8_0.gguf",
#   },
#   "metrics": {
#     "teacher_cost_total": 2.48,
#     "student_loss_final": 0.31,
#     "training_time_sec":   412,
#   },
# }

Estimating cost before you run

Training costs money (teacher API calls) and time (GPU hours). Use estimate before committing:
est = d.estimate(
    student_model="llama-3.2-1b",
    num_prompts=500,
    n_samples=4,
)
# → {"estimated_cost": 2.45, "is_sandbox": False, "tier": "local",
#    "balance": ..., "sufficient": True}

Choosing a teacher and a student

Teacher: pick the model you’d use in production if cost weren’t an issue. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro are good defaults. The student will learn to match this model’s output style and accuracy — on the distilled task only. Student: the smallest model that can plausibly handle your task’s output. Rule of thumb:
TaskStudent floor
Classification (few labels)0.6B (qwen3-0.6b)
Structured extraction (JSON)1B (llama-3.2-1b)
Short-form generation (< 200 tok)1–3B
Long-form + reasoning8B+ (llama-3.1-8b)
Smaller is cheaper to run but harder to train. If training fails to converge, move up a tier. Discover the full current list:
for t in d.teacher_models():
    print(t["id"], t["provider"])

for s in d.student_models():
    print(s["id"], s.get("params"))

The BOND hyperparameters

The BOND loss has two knobs worth knowing:
  • bond_beta (default 0.5) — how hard to push the student toward preferred responses vs. dispreferred. Higher = more aggressive preference shift; lower = gentler, more SFT-like.
  • bond_gamma (default 0.1) — KL regularization strength. Keeps the student close to its initial weights so you don’t destroy general capability. Raise if your student overfits or starts babbling.
You rarely need to tune these — defaults are good for most tasks. If you’re getting bad results, first look at dataset quality before touching BOND parameters.

Hardware requirements

Training runs on GPU. The Docker image (opentracy-api) is built on the nvidia/cuda:12.6 base and supports --gpus all. Minimum specs:
Student sizeMin VRAMTypical training time (500 prompts, 100 steps)
0.6B–1B8 GB10–20 minutes
3B16 GB30–60 minutes
8B24 GB (4-bit)2–4 hours
Without a GPU, training will fail. Use the estimate endpoint first to validate before kicking off a job.

After training: the alias swap

Once a student is trained and registered, re-pointing an alias to it is a one-line change:
# Via the REST API or UI
d.register_model(
    id="smart",
    adapter_path=artifacts["adapter_path"],
    gguf_path=artifacts["gguf_paths"]["q4_k_m"],
)

# Now any request that asks for model="smart" gets routed to the distilled
# student. Your app code didn't change.
This is the closing move of the pipeline — the moment cost savings actually land in your invoice.

Common pitfalls

Distilling a single cluster, not a mixed bag. One dataset should be one coherent task. If you mix “JSON extraction” and “creative writing” into the same dataset, the student gets confused. Distill each task separately; swap separate aliases.
Training before the teacher is right. If your teacher is giving 70% accurate answers, your student will cap out below that. Fix prompting and model choice first; then distill.
Evaluating the student only on training examples. Always evaluate on held-out traces. OpenTracy’s evaluation framework handles this — pass a dataset with a test split and it will report accuracy on rows the student never saw.

Next

Distiller reference

Every method of the Distiller client with parameters and return types.

Self-host the full stack

Distillation requires the engine + GPU — this guide sets up Docker Compose.