Share

Lost in Disclosure

RSDA Workshop @ ISSRE 2019 · S. Johnson, J.F. Ferreira, A. Mendes, J. Cordry · Teesside University / University of Lisbon / University of Beira Interior

In Short

Every time a major service gets breached and its password database leaks onto the internet, security researchers rush to analyse it. But there's a fundamental problem almost nobody had addressed: to understand why users chose the passwords they did, you need to know the rules they were given when creating them — the password composition policy. And that information is almost never included in the breach data. The organisation that was hacked is rarely forthcoming, and the criminals who stole the data don't care. The policy is, in effect, lost in disclosure.

This paper introduces a clean, practical technique to recover that missing information purely from the data itself, treating the problem as one of statistical outlier detection. It also ships pol-infer, an open-source tool implementing the method, validated against four of the most well-known real-world breached password datasets in existence — RockYou, Yahoo, 000webhost, and LinkedIn, covering over 220 million passwords combined.

The Breakdown

Problem Breached password datasets are one of the primary raw materials for password security research — they represent how real users actually behave under real constraints, at massive scale. But they come with a hidden flaw: they contain "noise." Some passwords in any given dump won't comply with the policy that was in place, due to formatting errors in the exfiltration process, intentional padding by criminals inflating the dataset's resale value, or multiple overlapping policies within a single dump. Worse, the policy itself is typically unknown. Without knowing what rules users were given, it's extremely difficult to draw meaningful conclusions about how those rules shaped password choice. Prior work had largely acknowledged this problem and worked around it; no one had tackled it head-on.

Approach The paper reframes policy inference as an outlier detection problem. For any measurable password attribute — length, digit count, uppercase count, and so on — you can compute, for each possible minimum threshold value, how sharply the frequency of passwords jumps as you cross that threshold. If a policy mandated a minimum length of 6, you'd expect a dramatic spike in the number of 6-character passwords relative to 5-character ones, because users cluster just above the minimum. That spike — captured mathematically as a multiplier between cumulative frequencies — stands out as a statistical outlier. Where no such spike exists above a set cutoff, you can confidently infer there was no constraint on that attribute. The pol-infer tool implements this across multiple password attributes simultaneously, and was tested on both real-world datasets with known ground-truth policies and synthetic datasets engineered to simulate padding and formatting corruption.

Key Findings The method correctly recovered the known password composition policies for all four real-world datasets tested. For RockYou: minimum length 5. For Yahoo and LinkedIn: minimum length 6, no other constraints. For 000webhost: minimum length 6 plus a mandatory digit — the digit requirement being inferred from a separate outlier analysis on digit counts. The method also proved robust against the synthetic noisy datasets, successfully recovering the original policy even when the data had been padded with tens of thousands of extraneous records from other sources, or corrupted by formatting errors that artificially generated hundreds of thousands of additional short strings.

Real-World Implications The direct application is in password security research: pol-infer gives researchers a principled, automated way to clean and contextualise breach datasets before using them for downstream analysis. But it has a secondary, more operationally interesting implication — given only a set of passwords, you can make statistically grounded inferences about the security posture of the organisation that generated them, without any cooperation from that organisation. That has significant implications for forensic investigation, threat intelligence, and competitive security benchmarking.

So, What?

The credential breach landscape has deteriorated dramatically since this paper was published. The exposure of 16 billion login credentials in June 2025 stands as the largest credential compilation in history — an aggregation of around 30 separate datasets, primarily harvested by infostealer malware. In 2024 alone, 2.8 billion passwords were put up for sale on criminal forums. The raw material this paper works with has never been more abundant, or more consequential.

That scale changes what's possible with a technique like pol-infer — and what's at stake. At billions of records, automated policy inference isn't just useful for academic research. It becomes a tool for continuous threat intelligence: analysing the composition of credential sets circulating in underground markets to infer which organisations they came from, what security posture those organisations had, and how that shapes the risk of credential stuffing and password spraying attacks against them.

There's also a forensic angle that has grown sharper with time. In an era of GDPR in Europe and tightening data protection regulation globally, a breached organisation's password policy is a material fact — it speaks directly to whether adequate security measures were in place. The ability to reconstruct that policy from the breach data itself, without relying on the organisation's disclosure, is exactly the kind of independent verification that regulators, insurers, and incident responders need.

For AI-driven offensive security specifically, the implications are pointed. Automated red-teaming and password attack tooling increasingly rely on breach datasets for wordlist generation, policy-aware password mutation, and targeted credential stuffing. A tool that can automatically characterise the policy constraints embedded in any given dataset — and therefore the likely shape of valid credentials — makes that automation smarter and more targeted. Brute force attacks against web applications grew from 21% to 37% of successful incidents in 2025, a trend that makes policy-aware attack generation an increasingly important capability on both sides of the fence.

The core insight of this paper — that the rules governing a system leave a detectable statistical fingerprint in the data that system produces — is a broadly applicable idea. Password policies are one instance. The same logic applies anywhere behaviour is constrained by policy: network traffic, API usage patterns, authentication logs. The methodology here is a small, clean example of a much larger principle: in security, data always tells you more than it appears to at first glance.