Part 5: Machine Learning Methods to Process Datasets With QI Values
sensitive data, machine learning, qi values, privacy, security

Differential Privacy (DP)[i]: This mathematical framework gives the ability to control to what extent the model ‘remembers’ and ‘forgets’ potentially sensitive data, which is its big advantage. The most popular concept of DP is ‘noisy counting’, which is based on drawing samples from Laplace distribution and using them to make the dataset represent augmented values, not the real one. However, the main disadvantage of Differential Privacy is the potential for the attacker to estimate the actual value from the repeated queries. Predictions made by using different private datasets are accurate enough, but with each new query made by the attacker, more and more sensitive information is getting released.

Federated Learning[ii]: The core idea of federated learning is very similar to distributed learning, because we’re not trying to train our model with all of the data at once, but instead are training it on subsets of it. This is quite a powerful method as long as we can effectively train and improve the model on separate devices while holding different subsets of data and gradually improve it.

‘Private Aggregation of Teacher Ensembles’ (PATE): This framework uses pieces of the different privacy methods, which is storing personal/sensitive data in a way that doesn’t reveal any kind of individual personal information. The core idea of PATE is that if two models trained on separate data agree on some outcome, it is less likely that sharing the outcome to the consumer will leak any sensitive data about a specific user. Training methodology is quite similar to federated learning (and bagging techniques, of course) because at the first step we need to split our dataset into smaller subsets and then train different models on them. Predictions are made by aggregating all of the predictions from different models and injecting noise into them. Another important feature of PATE is that we’re continuously training our downstream ‘student’ model using this ‘noisy’ data and finally showing the user not the ‘teacher’ models, but rather the ‘student’ ones, which ensures that sensitive/personal data is not revealed during inference phase.

We would love to hear your thoughts on this series. Please feel free to respond here with comments/questions/other feedback.

This is the fifth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by For part one, click here. For part two, click here. For part three, click here. For part four, click here.

Share this post with your friends

Share on facebook
Share on google
Share on twitter
Share on linkedin

Subscribe to our Newsletter

To be first to read our newest posts, subscribe to our newsletter here. Your information will not be shared with 3rd parties. We are a data privacy company after all.