Part 2: Standard Ways to De-Identify Personal Data
Generally, administrators of a database try to eliminate all channels that could potentially help an attacker leverage queries to gain personal and sensitive information about a specific person. Here are a few examples:
Pseudonymization: This method of processing personal data is based on replacing the values, which contain personal information, with pseudorandom strings. De-identified data is stored separately from the ‘additional information’, which does not contain any kind of personal/sensitive information, making the data identifiable only when both elements are together. In practice, one ‘real’ sensitive value corresponds to one pseudorandom value, ensuring that analytical correlations are still possible. Because of this transitive dependency of ‘real’ value and ‘random’ value, cryptographic methodologies such as hash functions are often used (for example: SHA-512). This ensures that the attacker, who does not have access to the secret key, cannot decrypt the pseudonymized values.
Anonymization: The main difference between pseudonymization and anonymization is that by using rules or cryptographic algorithms (pseudonymization), there is still a pathway to retrieve sensitive/personal data from within the de-identified information. With anonymization, however, there is no going back (providing it is an irreversible removal of any information that could lead to the individual being identified). Just as with pseudonymization, anonymized data should be stripped away from any kind of identifiable information.
Suppression: This technique is like the previous one, but instead of replacing sensitive/personal data with random strings, it is replaced with hard-coded sequences, such as ‘***’. Suppression is also called data masking, and as with anonymization, there’s no way to retrieve the original values.
Encryption: The obvious comparison for encryption is pseudonymization. While they both use the same algorithm, the difference is pseudonymization uses a secret key to produce pseudorandom values. Additionally, encryption is regulated by GDPR because the encryption strength is expected to be good enough: controllers are required to implement risk-based measures to protect data security.
No kind of machine learning can be performed on the data, which is anonymized using previously described techniques. The reason for it is obvious: all the features, which hold certain pieces of information gain, are removed from the data. However, this does not mean that data scientists should use the datasets containing personal information, even though there are a variety of approaches to make meaningful computations on the de-identified datasets which do not reveal sensitive information.