Usually, maintainers of the database try to eliminate all channels that could potentially help an attacker leverage queries to gain personal and sensitive information about a specific person.
Here are a few examples:
Pseudonymization: This method of processing of personal data is based on replacing the values, which contain personal information, with pseudorandom strings. De-identified data is stored separately from the ‘additional information’, which doesn’t contain any kind of personal/sensitive information, making the data identifiable only when both elements are together. In practice, one ‘real’ sensitive value corresponds to one pseudorandom value, ensuring that analytical correlations are still possible. Because of this transitive dependency of ‘real’ value and ‘random’ value, cryptographic methodologies are often used (hash functions like SHA-512). This ensures that the attacker, who doesn’t have access to the secret key, can’t decrypt the pseudonymized values.
Anonymization: The main difference between pseudonymization and anonymization is that by using rules or cryptographic algorithms (pseudonymization), there is still a pathway to retrieve sensitive/personal data from the de-identified information. With anonymization, however, there’s no going back (as long as it is an irreversible removal of any information that could lead to the individual being identified). Just as with pseudonymization, anonymized data should be stripped away from any kind of identifiable information.
Suppression: This technique is quite similar to the previous one, but instead of replacing sensitive/personal data with the random strings, it is replaced with hard-coded sequences, such as ‘***’. Suppression is also called data masking, and as with anonymization, there’s no way to retrieve the original values.
Encryption: Let’s compare encryption to pseudonymization. While they both use the same algorithm, the difference is pseudonymization uses a secret key to produce pseudorandom values. Additionally, encryption is regulated by GDPR because the encryption strength is expected to be good enough: controllers are required to implement risk-based measures to protect data security.
No kind of machine learning can be performed on the data, which is anonymized using previously described techniques. The reason for it is obvious: all of the features, which have some kind of information gain, are removed from the data. However, this doesn’t mean that data scientists should use the datasets containing personal information, even though there are a variety of approaches to make meaningful computations on the de-identified datasets which don’t reveal sensitive information at the same time.
This is the second part of a five-part series about de-identifying and securing personal data by 1touch.io. To go back to part one, click here. To move forward to part three about the machine-learning ways to de-identify personal data, click here.