Homomorphic Encryption: The main idea behind homomorphic encryption is that the inferences we make based on computations of encrypted data should be as accurate as if we had used decrypted data. Homomorphic encryption is an evolving field, and at this point in time has certain limitations. For example, only polynomial functions can be computed and only additions and multiplications of integers modulo-n are allowed. Most mathematical operations, which are used even in the simplest neural networks are not allowed when performing model training with homomorphically encrypted data. As you can understand, the final concepts of this methodology are still being developed.
The main idea behind homomorphic encryption is that we don’t need to remove any kind of values from the dataset, or mask/anonymize personal data in any way. However, as of the time of writing, there is not enough practical evidence to state that they can be used for the production-level methodologies; furthermore, there are not so many functional homomorphic encryption pipelines.
Let’s imagine a situation where we’ve removed all personal data from the dataset (or anonymized and stored it separately from other values). Most likely, even after removal of the personal data, QIs are still left in the database.
The biggest problem of storing quasi-identifiers is that when enduring an attack on the database, it isn’t all that difficult to combine QI values with other open data sources and reveal the identity of the person together with their personal/sensitive information. A good example of that is when the Netflix Prize competition open data was combined with IMDB’s movie ratings dataset: entire movie-watching history of individuals was compromised.
As a result of datasets, insecure data
science pipelines, which make predictions using datasets and QIs, potentially
sensitive/personal information can be revealed even after the
personal/sensitive data itself has been removed. We need to make sure that no
queries that have the potential to reveal individual personal information that can
be leveraged. Furthermore, we must make sure that no inference on the data
subject can be made by running multiple predictions using machine learning