(prompt, response, metadata) tuple. Datasets
are how you go from “a million raw traces” to “500 high-quality examples
of ticket classification I want to distill a model on”.
Datasets are the artifact that makes everything downstream work. You
can’t distill without one. You can’t run a rigorous evaluation without one.
Three ways a dataset is born
1. Auto-clustered from traces (the fast path)
The engine clusters your traces by prompt embedding similarity, names each cluster with an LLM, and offers each named cluster as a candidate dataset:2. Uploaded from a file
If you already have labeled data — a JSONL, a CSV, a Hugging Face dataset — you can upload it directly:3. Generated from prompts (synthetic)
Start with a list of prompts you care about. The engine asks a teacher model to generate N responses per prompt, judges them, and keeps the top ones. Useful when you have prompt ideas but no labeled responses yet.What a dataset row looks like
source_trace_id — you can always
follow back to the original request.
Curation: filtering the bad rows
Raw traces have noise. A good dataset is curated: you keep the useful rows and drop the rest. The engine ships a curation pipeline with three stages:Judge
An LLM judge (configurable — defaults to a cheap model like
openai/gpt-4o-mini) scores each row on helpfulness, relevance, and
format. Rows below a threshold are flagged.Filter
Apply rules: drop rows with errors, drop rows outside the target
cluster, drop rows above a length limit, drop rows with flagged PII.
opentracy.distillation.curation and you
can run them standalone if you’re building a pipeline manually. See the
API reference for details.
Versioning
Datasets are immutable once frozen. If you curate more rows or change the judge, you get a new version:Two things you do with a dataset
Distill
Hand the dataset to the distillation pipeline. A student model gets
fine-tuned on the teacher’s labels and comes out as a LoRA adapter
you can serve.
Evaluate
Pick a dataset as the benchmark. Run any model against it and compare
accuracy, cost per row, and latency — including models you’re
considering swapping in via an alias.

