Part 4: Standard Ways to Process Datasets with QI Values
process, qi values, personal data, datasets
  • K-anonymity: This approach is quite different from the one that I described earlier. With K-anonymity, we’re not aiming to ‘hide’ any data, but rather are softly ‘masking’ the QI values. The most popular techniques used in k-anonymity are purging and generalization. Purging simply replaces QI values with random strings like ‘-’ (the idea is similar to suppression). Generalization doesn’t remove QI values completely but replaces them with ranges instead of set numbers(e.g. 20-30 years old). The main goal of k-anonymity is to provide a guarantee that any arbitrary query on a large dataset will not reveal information that can help narrow a group down below a threshold of ‘k’ individuals. Strictly speaking, ‘k-anonymity’ ensures that all possible equivalence groups of a dataset have at least ‘k’ records (equivalence groups are the subsets of datasets, which have the same value for one or more QIs). For instance, a 3-anonymity dataset ensures that for each query that a potential attacker can perform, we will have at least 3 individuals, which cannot be distinguished based on the QI values.
  • l-diversity. Unfortunately, k-anonymity techniques may still be subject to attacks, which is usually because each of the equivalence groups may not have attribute diversity. A rare case for that is when all QI records of the equivalence group are the same, enabling the attacker to easily make an inference. l-diversity makes sure that there is enough diversity among QI records in each of the possible equivalence groups.
  • t-closeness. When speaking about the distributions, which are created by purging and generalization techniques, it is worth noting that the distributions of data in the equivalence groups should be similar to the distributions in the whole dataset. Specifically, the difference should not be bigger than the pre-specified value ‘t’. Earth Mover’s distance is used to measure the distance between the distributions.

One may learn that preserving all of these rules, which are defined by l-diversity, k-anonymity and t-closeness can cause complex combinatorial problems. At this point, machine learning techniques become quite useful as long as they can operate data in separate hyperplanes and perform computations there, which can be very complex tasks for the approaches described earlier.

This is the fourth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part five, click here.

Share this post with your friends

Share on facebook
Share on google
Share on twitter
Share on linkedin

Subscribe to our Newsletter

To be first to read our newest posts, subscribe to our newsletter here. Your information will not be shared with 3rd parties. We are a data privacy company after all.