Tech blog post: the practical problems of trimming hashes
During investigations, the Dutch Data Protection Authority (Dutch DPA) regularly deals with organisations that indicate that they process anonymous data (and therefore, not personal data) because the data has been ‘hashed and trimmed’. In practice, the DPA often finds that organisations make mistakes using this form of anonymization, meaning the data is not actually anonymous. In this blog post, DPA technicians, Victor Klos and Jonathan Ellen, expand upon this topic for those who are tech-enthusiasts.
Please note: the correct application of the techniques mentioned here is difficult and often context and case dependent. Therefore, this blog post cannot be used as technical or legal advice.
k-anonymity
A commonly used method to anonymise data is k-anonymity. Here you change a dataset so that any combination of attributes always occurs at least k times.
Under the right circumstances and when k is large enough, it becomes impossible to single out an individual. After all, every person is part of a group with (k-1) others that share the same attributes.
Trimming
One way to create groups is by rounding attributes. For example, if you round down all ages from a dataset to groups of tens, clusters will arise automatically. A person who is 29 will be part of the same group as a person who is 21 or someone who is 27, namely the 20s age group.
With a bit of imagination, you can also do this edit by trimming. Take, for example, the age of 26. Trimming off 1 symbol, from the right, results in the age of 2.
After trimming, everyone of an age between 20-29 again falls into the same group. (Depending on the application, you can add another symbol after cutting, like a 0, but that doesn't change the effect.)
Hashing First
It is a different situation when you have an identifier of a person or person-related device, such as a phone number, IP-address, MAC-address, IMSI-number or something similar.
For example, through a trimmed IP-address you can still see which internet provider a person has or even sometimes in which area a person lives. In order to avoid inferences such as these, this type of data is often hashed.
Image of hashing phonenumbers
Also note that it often makes no difference to the possibility of singling out individuals whether identifiers themselves have meaning (IP-addresses, IMSI-numbers, MAC-addresses, etc.) or are meaningless (hash values, random numbers or symbol series, etc.).
Trimming Hashes
Hash values are not random, even if they look like they are. What also stands out is that they are made up of many symbols. In other words: there are a lot of possible hash values.
And that’s the pitfall too: although there are many possible outcomes, there is usually only a limited number of inputs. To put it a bit more formally: in practice, the range of the hash function is many times greater than the domain.
In total, about 54 million mobile phone numbers have been issued in the Netherlands. That sounds like a lot, but this is only a tiny fraction of the possible sha256-hashes. And that is where our intuition fails us.
By trimming 2 symbols off an unhashed phone number, groups of up to 100 phone numbers can be made. Depending on how many numbers are in the dataset, this might result in k > 1.
Further research will then have to be done to reveal whether trimming off 2 symbols is sufficient, or whether 3, 4 or even more need to be trimmed, possibly depending on the number sequence.
The situation after hashing is quite different. In this case each hash value is still unique, even after some symbols have been trimmed off.
Image of hashed phone numbers where some symbols have been trimmed off
But how much do you have to trim off to make hashed attribute groups? The answer depends on the dataset, but in many cases: almost all of it.
For example, take the first phone number from the figure above. Even though the first phone number only differs 1 digit from the second phone number, the hash values are completely different.
And if the entire dataset were to be similar to the 4 example phone numbers, it even becomes impossible to create groups by trimming off symbols from hashed phone numbers.
Therefore, the right question is not how much you have to trim off, but how much you can save.
Because what is the consequence of trimming too few symbols off of hashed attributes? A dataset that still contains personal data.
Insufficient trimming of hash values leaves unique identifiers. And thus, one cannot speak of anonymised data.