Part 1: Introduction and Resources of the Data Breach
Terms like ‘sensitive data’ and ‘personal data’ have been floating in the air ever since GDPR, CCPA, and similar privacy acts were introduced to companies across the globe. One challenge they present is that the complexity of the federal laws and complicated terminology used to identify the corresponding subjects make it difficult for those in the technical field to truly grasp. It becomes harder than ever for the data scientists to figure out the main challenges of processing datasets containing sensitive information and how the data should be anonymized properly.
The main idea behind these regulations is the need to protect the data subjects’ rights. One method is to not save any data that is not necessary for your business uses. Another objective of these regulations is to protect data from possible breaches, which unfortunately has been happening quite often to the world’s biggest companies (such as the recent British Airways data breach). In terms of the development of machine-learning algorithms to analyze possibly sensitive datasets, no one needs real personal data to create a functioning data science pipeline.
After researching this topic and learning the reasons behind it, it seems that the highest priority on an engineer’s mind is the need to anonymize potentially sensitive data to avoid the possibility of sensitive data leakage. Another potential problem is that even partially ‘anonymized’ datasets that do not have any kind of personal data can reveal personal information when under an effective attack.
Possible Resources of the Data Breach
The Presence of Personally Identifiable Information (PII): As the phrase suggests, by using this data we can uniquely identify the person (e.g. passport ID, national ID, tax ID). When performing any type of anonymization (anonymization types will be mentioned in more detail later) this data is often removed or replaced with random strings.
Sensitive Information: This information does not reveal any personal data but contains data about the person which should be protected (e.g. HIV status).
Quasi-Identifiers (QI): These records also do not reveal PII on their own but combined with other information can be used to uniquely identify a person. For instance, a Zip Code cannot identify a person on its own, but the combination of state, gender and a Zip Code can do it.