Part 5: Machine Learning Methods to Process Datasets With QI Values

Published On: July 31, 2019Categories: Blog

Differential Privacy (DP): This mathematical framework gives the ability to control to what extent the model ‘remembers’ and ‘forgets’ potentially sensitive data, which is its big advantage. The most popular concept of DP is ‘noisy counting’, which is based on drawing samples from Laplace distribution and using them to make the dataset represent augmented values instead of the real one.

However, the main disadvantage of Differential Privacy is the potential for the attacker to estimate the actual value from the repeated queries. Predictions made by using different private datasets are accurate as it is, but with each new query made by the attacker, more and more sensitive information is getting released.

Federated Learning: The core idea of federated learning is very similar to distributed learning, because the model is training on subsets of the data. This is quite a powerful method providing we can effectively train and improve the model on separate devices while holding different subsets of data and gradually improve it.

‘Private Aggregation of Teacher Ensembles’ (PATE): This framework uses pieces of the different privacy methods, which is storing personal/sensitive data in a way that does not reveal any kind of individual personal information. The core idea of PATE is that if two models trained on separate data agree on the same outcome, it is less likely that sharing the outcome to the consumer will leak any sensitive data about a specific user. Training methodology resembles federated learning (and bagging techniques, of course) because the first step is splitting our dataset into smaller subsets and then training various models on them. Predictions are made by aggregating many predictions from different models and injecting noise into them.

Another important feature of PATE is the continuous training of the downstream ‘student’ model using this ‘noisy’ data and finally showing the user not the ‘teacher’ models, but rather the ‘student’ ones, which ensures that sensitive/personal data is not revealed during inference phase.

We would love to hear your thoughts on this series. Please feel free to respond here with comments/questions/other feedback.

This is the fifth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by