Part 3: Machine Learning Ways to De-Identify Personal Data (Homomorphic Encryption)
homomorphic encryption, privacy, security, machine deidentify, personal data

Homomorphic Encryption: The main idea behind homomorphic encryption is that the inferences we make based on computations of encrypted data should be as accurate as if we had used decrypted data. Homomorphic encryption is an evolving field, and at this point in time has certain limitations. For example, only polynomial functions can be computed and only additions and multiplications of integers modulo-n are allowed. Most mathematical operations, which are used even in the simplest neural networks are not allowed when performing model training with homomorphically encrypted data. As you can understand, the final concepts of this methodology are still being developed.

The main idea behind homomorphic encryption is that we don’t need to remove any kind of values from the dataset, or mask/anonymize personal data in any way. However, as of the time of writing, there is not enough practical evidence to state that they can be used for the production-level methodologies; furthermore, there are not so many functional homomorphic encryption pipelines.

Let’s imagine a situation where we’ve removed all personal data from the dataset (or anonymized and stored it separately from other values). Most likely, even after removal of the personal data, QIs are still left in the database.

The biggest problem of storing quasi-identifiers is that when enduring an attack on the database, it isn’t all that difficult to combine QI values with other open data sources and reveal the identity of the person together with their personal/sensitive information. A good example of that is when the Netflix Prize competition open data was combined with IMDB’s movie ratings dataset: entire movie-watching history of individuals was compromised.

As a result of datasets, insecure data science pipelines, which make predictions using datasets and QIs, potentially sensitive/personal information can be revealed even after the personal/sensitive data itself has been removed. We need to make sure that no queries that have the potential to reveal individual personal information that can be leveraged. Furthermore, we must make sure that no inference on the data subject can be made by running multiple predictions using machine learning algorithms.

This is the third post in our Deidentifying and Securing Personal Data Series. To read part one, click here. For part two, click here. For part four, click here.

Share this post with your friends

Share on facebook
Share on google
Share on twitter
Share on linkedin

Subscribe to our Newsletter

To be first to read our newest posts, subscribe to our newsletter here. Your information will not be shared with 3rd parties. We are a data privacy company after all.