4 min readDATA · ML-SYSTEMS · VERIFICATION

Verification tiers and provenance for synthetic data

AnyData is a closed-loop dataset factory where every example carries how strongly its correctness was verified. Why the tier you can verify against is the real ceiling on quality, and why a model can never grade its own output.

Most synthetic-data pipelines fail in the same quiet way. A model generates plausible-looking examples, nothing independent checks them, and you ship a dataset that is fluent and wrong in ways you cannot see. The output looks like data. It behaves like noise. AnyData is built around the one constraint that prevents this: an example is only as good as the strongest signal you can use to confirm it is correct.

Quality is capped by verifiability, not by the generator

The instinct is to spend effort on the generator, a bigger model, better prompts, more samples. That raises the ceiling on how good an example could be. It does nothing to tell you which examples actually cleared the bar. AnyData inverts the priority: each domain pack declares a verification tier, and the tier, not the generator, sets the realistic quality you can claim.

There are four tiers, ordered by how hard the correctness signal is:

  • EXECUTABLE: the example runs and passes tests. Code, SQL, scripts. This is the strongest tier because the world checks the answer for you. Full autonomy, human review only to calibrate.
  • CHECKABLE: the answer matches a spec, a property, or an independently recomputed result. Arithmetic with a recomputed answer lives here. Still full autonomy.
  • COMPARATIVE: no single oracle, but independent methods or models agree. Partial autonomy, and a mandatory percentage of human review.
  • JUDGMENT: no oracle at all, subjective quality. Lowest autonomy, mandatory human review.

The tier is not a label you attach after the fact. It is a declaration about what kind of evidence the domain can produce, and it determines how much the system is allowed to trust itself.

The anti-collapse rule

The most important line in the design is a refusal: a model may never be the sole grader of its own output. A pack that declares COMPARATIVE or JUDGMENT is rejected at load time if its configured human-review percentage is zero.

This is the failure that quietly ruins synthetic data. If the same model writes the example, verifies it, and grades it, you have built a closed loop with no external reference, and the loop converges to whatever the model already believes, including its mistakes. The dataset gets internally consistent and detached from reality at the same time. Enforcing the rule at load time, not at runtime, means a misconfigured pack cannot even start a run that would produce self-graded garbage. You catch the design error before you spend a cent.

That last part matters because generation has a cost ledger. Refusing a bad configuration up front is the cheapest possible place to catch it.

Provenance and evidence make a dataset auditable

A row in the output is not just an input and a target. It carries provenance: which domain pack produced it, which tier verified it, what evidence backs the correctness claim, and what it cost. That metadata is what lets you answer the question every dataset consumer eventually asks, which is “why should I believe this row?” For an EXECUTABLE example the evidence is a passing test. For a COMPARATIVE one it is the agreement of independent methods plus the review status. The evidence is specific to the tier that produced it, so the strength of the claim travels with the data instead of getting lost.

An architectural detail worth stealing: engine purity

The codebase enforces a hard wall between the headless engine and the GUI. The engine directory may never import Qt. This is not style policing, it is checked by a test that greps every file under the engine for a PySide6 or PyQt import and fails the build if it finds one. The payoff is that the dataset factory runs identically from a CLI, a CI job, or behind a GUI, because the logic has no idea a GUI exists. A test that reads source code to enforce an architectural boundary is a pattern more projects should copy: the rule is only real if something fails when you break it.

The throughline is distrust applied early. Decide what counts as proof before you generate, refuse to grade yourself, and write down the evidence as you go. What you get out the other end is not just data, it is data you can defend.