Part 3: Machine Learning Ways to De-Identify Personal Data (Homomorphic Encryption)
Homomorphic Encryption: The main idea behind homomorphic encryption is that the inferences we make based on computations of encrypted data should be as accurate as if we had used decrypted data. Homomorphic encryption is an evolving field that currently has certain limitations. For example, only polynomial functions can be computed, and only additions and multiplications of integers modulo-n are allowed. Most mathematical operations, which are used even in the simplest neural networks are not allowed when performing model training with homomorphically encrypted data. The final concepts of this methodology are still being developed.
The main idea behind homomorphic encryption is that one does not need to remove any kind of values from the dataset, or mask/anonymize personal data in any way. However, as of the time of writing, there is not enough evidence to state that they can be used for the production-level methodologies. Furthermore, there are not many functional homomorphic encryption pipelines.
Imagine a situation where all personal data was removed from the dataset (or anonymized and stored separately from other values). Most likely, even after the removal of the personal data, QIs are still left in the database.
The biggest problem of storing quasi-identifiers is that when enduring an attack on the database, it is not very difficult to combine QI values with other open data sources to reveal the identity of the person together with their personal/sensitive information. A good example of that is when the Netflix prize competition open data was combined with IMDB’s movie ratings dataset: entire movie-watching history of individuals was compromised. As a result of datasets and insecure data science pipelines, which make predictions using datasets and QIs, potentially sensitive/personal information can be revealed even after the personal/sensitive data itself has been removed. One must make sure that there are no queries with the potential to reveal individual personal information that can be leveraged. Furthermore, one must ensure that no inference on the data subject can be made by running multiple predictions using machine learning algorithms.