Skip to main content
A dataset is a named, versioned bundle of rows shaped for training or evaluation. Each row is a (prompt, response, metadata) tuple. Datasets are how you go from “a million raw traces” to “500 high-quality examples of ticket classification I want to distill a model on”. Datasets are the artifact that makes everything downstream work. You can’t distill without one. You can’t run a rigorous evaluation without one.

Three ways a dataset is born

1. Auto-clustered from traces (the fast path)

The engine clusters your traces by prompt embedding similarity, names each cluster with an LLM, and offers each named cluster as a candidate dataset:
cluster 12 → "SQL generation"           (2,341 traces)
cluster 34 → "Support ticket triage"    (8,902 traces)
cluster 51 → "Code review feedback"     (410 traces)
In the UI you click Promote to dataset, give it a name, and pick the rows to include. Under the hood, this is the shortest path from running traffic to training data.

2. Uploaded from a file

If you already have labeled data — a JSONL, a CSV, a Hugging Face dataset — you can upload it directly:
from opentracy import Distiller

d = Distiller(base_url="http://localhost:8000")
dataset = d.upload_dataset(
    name="invoice-extraction-v1",
    path="./data/invoices.jsonl",
    # jsonl rows: {"prompt": "...", "response": "...", "metadata": {...}}
)

3. Generated from prompts (synthetic)

Start with a list of prompts you care about. The engine asks a teacher model to generate N responses per prompt, judges them, and keeps the top ones. Useful when you have prompt ideas but no labeled responses yet.
d.generate_dataset(
    name="python-docstrings-v1",
    prompts_path="./data/prompt_seeds.txt",
    teacher="openai/gpt-4o",
    n_samples=4,
    judge="openai/gpt-4o-mini",
)

What a dataset row looks like

{
  "row_id": "r_0000123",
  "prompt": "Classify this ticket into one of: billing, technical, feature_request. Ticket: Where can I download my invoice for March?",
  "response": "billing",
  "metadata": {
    "source_trace_id": "t_af91",
    "teacher_model": "openai/gpt-4o",
    "judge_score": 0.92,
    "cluster_id": 34,
    "tags": ["ticket_classifier", "reviewed"]
  }
}
Rows that came from traces retain a source_trace_id — you can always follow back to the original request.

Curation: filtering the bad rows

Raw traces have noise. A good dataset is curated: you keep the useful rows and drop the rest. The engine ships a curation pipeline with three stages:
1

Judge

An LLM judge (configurable — defaults to a cheap model like openai/gpt-4o-mini) scores each row on helpfulness, relevance, and format. Rows below a threshold are flagged.
2

Filter

Apply rules: drop rows with errors, drop rows outside the target cluster, drop rows above a length limit, drop rows with flagged PII.
3

Review

Human review in the UI for the top slice — usually 50–100 borderline rows. Not required but cheap insurance for your first distillation.
Each stage is implemented in opentracy.distillation.curation and you can run them standalone if you’re building a pipeline manually. See the API reference for details.

Versioning

Datasets are immutable once frozen. If you curate more rows or change the judge, you get a new version:
invoice-extraction-v1  (frozen, 847 rows)
invoice-extraction-v2  (frozen, 1240 rows, stricter judge threshold)
invoice-extraction-v3  (active, 1240 rows + 312 newly reviewed)
Every distillation job records the dataset version it trained on, so you can always reproduce a result or compare student models trained on different data.

Two things you do with a dataset

Distill

Hand the dataset to the distillation pipeline. A student model gets fine-tuned on the teacher’s labels and comes out as a LoRA adapter you can serve.

Evaluate

Pick a dataset as the benchmark. Run any model against it and compare accuracy, cost per row, and latency — including models you’re considering swapping in via an alias.

Common mistakes

Don’t distill with < 200 rows. Below that threshold the student tends to overfit and doesn’t generalize beyond the training prompts. 500–2000 is the sweet spot for most tasks.
Don’t mix unrelated clusters in one dataset. If you put “SQL generation” and “customer emails” in the same dataset, the student learns neither well. One dataset = one coherent task.
Don’t skip curation on the first run. Raw traces include failures, refusals, truncated outputs. Let the judge drop those before training — otherwise the student learns the noise.