A redaction tool scans for patterns. It finds a name and replaces it with [NAME-1]. It finds an account number and replaces it with [ACCOUNT-1]. Anything it doesn’t recognise as a pattern, it leaves intact.
A tokenisation tool replaces known entities with opaque identifiers. Same strength. Same weakness. The category of data it doesn’t know to protect stays fully exposed.
Neither approach sees a woman in London with 101 dogs. The quasi-identifier pattern is invisible to regex and to the named-entity models most DLP tools are built on. A combination of postcode, age, and job title identifies one specific employee without any of those three fields being flagged as sensitive in isolation. A sentence where the event itself is the identifier, like a heart attack during a specific televised match, cannot be anonymised by removing names.
These aren’t edge cases. In clinical notes, legal filings, complaint records, and performance reports, the most commercially valuable data is precisely the contextual narrative that conventional tools cannot redact without destroying.