Usually, maintainers of the database try to eliminate all channels that could potentially help an attacker leverage queries to gain personal/sensitive information about a specific person. Here are a few examples:
Pseudonymization: This method of processing of personal data is based on replacing the values, which contain personal information, with pseudorandom strings. De-identified data is stored separately from the ‘additional information’, which doesn’t contain any kind of personal/sensitive information, making that data identifiable only when both elements are together. In practice, one ‘real’ sensitive value corresponds to one pseudorandom value, so that analytical correlations are still possible. Because of this transitive dependency of ‘real’ value and ‘random’ value, cryptographic methodologies are usually used (hash functions like SHA-512). This means that the attacker, who doesn’t have access to the secret key, can’t decrypt the pseudonymized values.
Anonymization: The main difference between pseudonymization and anonymization is that by using some rules or cryptographic algorithms (pseudonymization), there is still a pathway to retrieve sensitive/personal data from the de-identified information. With anonymization, however, there’s no way back (as long as it is an irreversible removal of information that could lead to the individual being identified). Just as with pseudonymization, anonymized data should be stripped away from any kind of identifiable information.
Suppression: This technique is quite similar to the previous one, but instead of replacing sensitive/personal data with the random strings, it is replaced with some hard-coded sequences, such as ‘***’. Suppression is also called data masking, and as with anonymization, there’s no way back to retrieve original values.
Encryption: Let’s compare encryption to pseudonymization. While they both use the same algorithm, the difference is pseudonymization uses a secret key to produce pseudorandom values. Additionally, encryption is regulated by GDPR because the encryption strength should be good enough: controllers are required to implement risk-based measures to protect data security.
No kind of machine learning can be performed on the data, which is anonymized using previously described techniques. The reason for it is obvious: all of the features, which have some kind of information gain, are removed from the data. As you can see, there are plenty of ways for data scientists to use datasets that don’t contain any kind of sensitive information, and still use them to make meaningful computations.
This is the second part of a five-part series about de-identifying and securing personal data by 1touch.io. To go back to part one, click here. To move forward to part three about the machine-learning ways to de-identify personal data, click here.