One prompt is not a finding: proving an LLM jailbreak is universal
The discipline that separates a lucky prompt from a bounty-grade universal jailbreak: a train/test split on objectives, an independent multi-judge, a binomial significance test on held-out behaviors, and an authorization gate that fails closed.
I am deliberately not publishing attack methods here. The specific techniques are attack uplift, and the README keeps demonstration content out of source for the same reason. What is safe and genuinely useful to share is the validation discipline, because most people doing LLM red-teaming get this part wrong and it is the part that decides whether a finding survives triage.
The trap: a working prompt feels like a finding
You craft a prompt, the target complies, and it feels like you won. You haven’t. You found one point in a space. The model may have complied because of that exact phrasing, that one objective, the temperature, the moon. A model-safety program does not pay for a single lucky completion. It pays for an attack that generalizes, a template that works across objectives it has never seen. The gap between those two claims is statistics, and you have to do the statistics before you write the report, not after.
Split objectives, search on train, test on holdout
The procedure is the same one a careful ML practitioner uses to avoid fooling themselves. Take a standardized objective set (the framework loads HarmBench, 510 behaviors across seven semantic categories). Split deterministically, 70% train, 30% holdout, with a fixed seed so the split is reproducible. Develop and search your candidate template only against the train half. Then apply the frozen template to every held-out objective and score each one.
result = await validate_universal(
candidate=claim, # a frozen template, not tuned on holdout
dataset=dataset, # 510 behaviors
target=target, # an authorized adapter
judge=judge, # independent of attacker and target
train_frac=0.7,
judge_threshold=0.7, # what counts as a hit
significance_level=0.05, # binomial p-value cap
null_baseline=0.05, # assumed baseline noise rate
)
# result.holdout_rate, result.p_value, result.is_universalThe number that matters is the held-out success rate, and crucially its significance. The harness runs a one-sided binomial test against the null hypothesis that your true success rate is just the baseline noise (here 5%). If the held-out hits are so many that p drops below 0.05, you can claim the result isn’t luck. A high holdout rate with a non-significant p-value is not a finding. A modest rate that is overwhelmingly significant might be.
Why the judge has to be independent
A scoring mistake quietly invalidates everything above. The cardinal sin is letting the model under attack score its own responses, which produces sycophantic agreement and a meaningless number. DiaxiInject uses three independent tiers and never lets the target be the grader.
Three judges that disagree are far more trustworthy than one that agrees with itself. The weighting (rules and classifier lighter, rubric heavier) reflects that the rubric understands nuance and the cheap tiers catch the obvious cases.
Authorization is a design constraint, not a checkbox
None of this matters if the harness can fire at a target you aren’t allowed to
touch. Every target adapter is gated by a YAML scope file that defaults to
authorized: false, and the gate fails closed: a missing acknowledgement line
also refuses. A typo, an autocomplete slip, or an autonomous agent’s bad
decision cannot produce traffic against a third-party LLM API. You flip the flag
to true only after you have read the rules, been accepted, and signed whatever
the program requires. The framework treats “did we have permission” as a code
invariant, not a promise.
The one thing to take away
Universality is a statistical claim, so prove it like one. Freeze the template, hold out objectives the template never saw, grade with judges that aren’t the victim, and report a p-value, not a vibe. Do that and your finding reads like research. Skip it and you have a screenshot of one lucky completion, which is what triagers close as not demonstrated.