One prompt is not a finding: proving an LLM jailbreak is universal

I am deliberately not publishing attack methods here. The specific techniques are attack uplift, and the README keeps demonstration content out of source for the same reason. What is safe and genuinely useful to share is the validation discipline, because most people doing LLM red-teaming get this part wrong and it is the part that decides whether a finding survives triage.

The trap: a working prompt feels like a finding

You craft a prompt, the target complies, and it feels like you won. You haven’t. You found one point in a space. The model may have complied because of that exact phrasing, that one objective, the temperature, the moon. A model-safety program does not pay for a single lucky completion. It pays for an attack that generalizes, a template that works across objectives it has never seen. The gap between those two claims is statistics, and you have to do the statistics before you write the report, not after.

Split objectives, search on train, test on holdout

The procedure is the same one a careful ML practitioner uses to avoid fooling themselves. Take a standardized objective set (the framework loads HarmBench, 510 behaviors across seven semantic categories). Split deterministically, 70% train, 30% holdout, with a fixed seed so the split is reproducible. Develop and search your candidate template only against the train half. Then apply the frozen template to every held-out objective and score each one.

python

result = await validate_universal(
    candidate=claim,             # a frozen template, not tuned on holdout
    dataset=dataset,             # 510 behaviors
    target=target,               # an authorized adapter
    judge=judge,                 # independent of attacker and target
    train_frac=0.7,
    judge_threshold=0.7,         # what counts as a hit
    significance_level=0.05,     # binomial p-value cap
    null_baseline=0.05,          # assumed baseline noise rate
)
# result.holdout_rate, result.p_value, result.is_universal

The number that matters is the held-out success rate, and crucially its significance. The harness runs a one-sided binomial test against the null hypothesis that your true success rate is just the baseline noise (here 5%). If the held-out hits are so many that p drops below 0.05, you can claim the result isn’t luck. A high holdout rate with a non-significant p-value is not a finding. A modest rate that is overwhelmingly significant might be.

Why the judge has to be independent

A scoring mistake quietly invalidates everything above. The cardinal sin is letting the model under attack score its own responses, which produces sycophantic agreement and a meaningless number. DiaxiInject uses three independent tiers and never lets the target be the grader.

Three judges that disagree are far more trustworthy than one that agrees with itself. The weighting (rules and classifier lighter, rubric heavier) reflects that the rubric understands nuance and the cheap tiers catch the obvious cases.

Authorization is a design constraint, not a checkbox

None of this matters if the harness can fire at a target you aren’t allowed to touch. Every target adapter is gated by a YAML scope file that defaults to authorized: false, and the gate fails closed: a missing acknowledgement line also refuses. A typo, an autocomplete slip, or an autonomous agent’s bad decision cannot produce traffic against a third-party LLM API. You flip the flag to true only after you have read the rules, been accepted, and signed whatever the program requires. The framework treats “did we have permission” as a code invariant, not a promise.

The one thing to take away

Universality is a statistical claim, so prove it like one. Freeze the template, hold out objectives the template never saw, grade with judges that aren’t the victim, and report a p-value, not a vibe. Do that and your finding reads like research. Skip it and you have a screenshot of one lucky completion, which is what triagers close as not demonstrated.