Terms like ‘sensitive data’ and ‘personal data’ have been floating in the air ever since GDPR, CCPA, and similar privacy acts were introduced to companies across the globe. One challenge they present is that the complexity of the federal laws and complicated terminology used to identify the corresponding subjects make it difficult for those in the technical field to truly grasp. It becomes harder than ever for the data scientists to figure out the main challenges of processing datasets containing sensitive information and how the data should be anonymized properly.
The main idea behind these regulations is the need to protect the data subjects’ rights. One method is to not save any data that is not necessary for your business uses. Another objective is to protect data from possible breaches, which unfortunately has been happening quite often to the world’s biggest companies (such as the recent British Airways data breach). In terms of the development of machine-learning algorithms to analyze possibly sensitive datasets, no one actually needs real personal data to create a functioning data science pipeline.
After researching this topic and learning the reasons behind it, it seems that the highest priority that comes to an engineer’s mind is the need to anonymize potentially sensitive data to avoid the possibility of sensitive data leakage. Another potential problem is that even partially ‘anonymized’ datasets that don’t have any kind of personal data can reveal personal information when under an effective attack. Here’s why:
The presence of personally identifiable information (PII). As its name would give away, by using this data we can uniquely identify the person (e.g. passport ID, national ID, tax ID). When performing any type of anonymization (anonymization types will be given in more detail later) this data is often removed or replaced with a form random strings.
Sensitive Information. This information doesn’t reveal any personal data, but contains the data about the person, which should be protected (e.g. HIV status);
Quasi-Identifiers (QI). These records also don’t reveal PII on their own but combined with other information can be used to uniquely identify a person. For instance, ZIP code can not identify a person on its own, but the combination of state, gender and ZIP code can do it.