Terms like ‘sensitive data’ and ‘personal data’ have been floating in the air for a while ever since GDPR, CCPA, and similar privacy acts were introduced to companies across the globe. One challenge with them is that the complexity of the federal laws and quite complicated terminology used to identify the corresponding subjects make it difficult for those in the technical field to truly grasp. It becomes even harder for the data scientists to figure out the main challenges of processing datasets containing sensitive information and how the data should be anonymized properly.
The main idea behind these regulations is the need to protect the data subjects’ rights. One of the ways to do this is by not saving any data that isn’t necessary for the business to use. Another objective is to protect this data from possible breaches, which unfortunately happens quite often to the world’s biggest companies (such as the recent British Airways data breach). Regulatory fines aren’t given because a database was breached but instead due to the mistreatment of the personal data.
In terms of the development of machine-learning algorithms to analyze possibly sensitive datasets, no one actually needs real personal data to create a functioning data science pipeline. After researching this topic and understanding the reason for it, it seems that the highest priority that comes to an engineer’s mind is the need to anonymize potentially sensitive data to avoid the possibility of sensitive data leakage. Another potential problem is that even somewhat ‘anonymized’ datasets that don’t have any kind of personal data can reveal personal information when an effective attack is performed on it. Here’s why:
Possible Resources of the Data Breach
- The presence of personally identifiable information (PII). As its name would give away, by using this data we can uniquely identify the person (e.g. passport ID, national ID, tax ID). When performing any kind of anonymization (we’ll talk about anonymization types later) this data is often removed or replaced with some kind of random strings.
- Sensitive Information. This information doesn’t reveal any kind of personal data, but contains the data about the person, which should be protected (e.g. HIV status);
- Quasi-Identifiers (QI). These records also don’t reveal PII on their own, but combined with other information can be used to uniquely identify a person. For instance, ZIP code itself can’t help to identify a person, but the combination of state, gender and ZIP code can do it.