Inventorize Your Personal Data
It has come to my attention that recent regulations require us to manage personal data in a very specific way. I trust that the security teams that report to you are now managing sensitive data such as our customer info in a very controlled way, backed up by an effective security policy. Please take a few minutes to look and assure me that we meet these new requirements.
P.S. Not 100% sure, but I think it’s called GDPR or CCPA or PIPEDA or LGPD … Don’t know, they all look the same to me….
So what’s the plan? I was taught in the army that every good plan is built from three parts. In this case: Inventorize! Manage! Protect!
- 1. Create an inventory of all the personal data that you process, store and share.
- 2. Build a policy, tools, and processes to manage it effectively.
- 3. Use existing security tools and add new ones to protect them.
Part one of the plan, Inventorize, is the foundation stone for everything else. If you have a reliable and up-to-date inventory, managing and protecting will naturally evolve from it. So let’s inventorize….
To inventorize, I’ve tried to find a simple answer to the question “Where to look for personal data?”
I’ve read many articles, blogs, and journals which explained this in general or legal terms, but none truly answered the question in a way that would clearly explain where I should look, and with what tools. While creating a list to answer the question, I’ve built a table that I’m happy to share with you all. I’ve probably missed something and will be more than happy to get your feedback to be able to maintain one table for the benefit of all.
Where to Look for Personal Data
(On mobile, we recommend putting your phone sideways to view the full chart)
HTTP, SOAP, FTP, SMTP, SMB, CIFS, POP3, IMAP
Application transactions that contain personal data and that indicate that a network element processed it.
The main activity that you do as an organization is process personal data. You may have one CRM database that stores personal data but 10,000 endpoints that process it on a daily basis. They are all part of the inventory. If you don’t know about them, how can you manage and protect the data?
Don’t try to integrate with every application or expect the IT team to develop an API to each one of the hundreds of homegrown application. Just analyze the transactions in the network and all applications will be covered.
Note: There is no need to analyze every transaction. yet need to figure out the encryption part when relevant. It’s solvable and not that complicated
Unstructured Textual Files
PDF, DOCX, PPT, TXT, RTF, GDOC
Any “readable” file format that your organization has used in order to document and communicate while processing personal data for business use.
You have your main repositories and applications used for processing personal data but you will find that endless amount of copies are made in emails, forms, documents… containing between 1% to 100% of your personal data inventory.
You want to know about it, make sure that it’s stored in the right place, protected by your security tools and accessed only by authorized employees.
Once there is no business need for a copy, delete it to reduce the attack surface and minimize risk.
Since personal data only becomes interesting if you process it for business purposes, focus on file formats that been used by your organization to process personal data.
For example, if you don’t use Google Docs as part of your business flow, it would be good to analyze it just in case, but should be of lower priority.
Structured Data Files
XLSX, CSV, XML, HTML, NUMBERS, GSHEET
Whenever data is stored in a matrix format such as a table, or in a hierarchical format such as XML it’s considered structured data.
The main difference is that in a structured format you have the metadata that describes the data attached to the actual value (column name in the table, tag name in XML…). This enriches the data and makes it more clear to figure out by anyone with access to the data. And therefore is more sensitive.
In structured files, you reveal very important and relevant information about the personal data – the entity relationship. If you have 100 names and 100 phone numbers, once you arrange them in a table, it’s clear which belongs to whom.
Quantity! Since it’s the favorite format for reports and analyzing data, this is where you potentially will find the biggest amount of personal data, therefore, this format has a higher degree of risk for data leaks and unauthorized access.
You can use best of breed solutions to protect your databases, yet one excel file in the wrong directory and it’s game over.
One or more files of any kind wrapped into one container file
The primary reason is obvious. We don’t care about the container file but what’s in it.
But the secondary reason is more due to human behavior. Privacy is no different than security in that manner. People like to .zip things when they feel like they’re not 100% aligned with the company policy.
Just like with security, password-protected archive files should be treated as sensitive personal data. Either you know what’s in there or you take the most conservative approach to reduce risk.
Your backup processes are also part of this family of archive files. When you deal with privacy, don’t go in there. Spending time to inventorize backup tapes or “backup to disk” storages will divert you from the main goals of privacy.
At the end of the day, the data subject wants to know if you process his data, why and with whom. Not that you have his data in a 3-year-old tape that doesn’t participate in any personal data processing flow.
If your security policy regarding backup was good enough before privacy joined the game, it’s most likely still good enough without any extra effort.
Oracle, MSsql, Informix, BigTable, Cassandra, DB2, PostgreSQL
“Standard” SQL database, BigData, Cloud-based… they are all the same.
They are a structured way to store and process all your personal data in a way that makes it readily consumable by your applications and analysts.
This is where your root data assets are stored. It is where you store and process your personal data for most of your business activities and where you document the entity relationship to create an “Identifiable” data subject (PII).
Rely on it to inventorize your PII list. If it’s not in one of your databases, it’s probably not a manageable data subject.
Note that you probably have a problem that until now didn’t occupy you as a CISO, your business units are not accurate when it comes to managing personal data. You may find one human being (Data Subject/PII) exists in your databases as two or more different customers. This is a problem when it comes to DSAR. Don’t expect them to solve it for you; it’s on you to merge the records into one unique PII.
GIF, JPEG,BMP, PDS,PNG
A binary representation of a picture.
Print screens, Scanned documents and IDs, people’s faces… all of them can contain data in general and personal data in particular.
In many business flows, you will find that your organization stores and processes images containing personal data. In some cases, it’s even required by other regulations.
Similar to archived files, employees find that print screens are a great way to overcome company policy mainly to share data with others and to work from home.
A picture of a person is considered sensitive personal data, since not only can you identify the person based on the picture but you can also learn from that picture sensitive categories such as religion, ethnicity, sexual orientation and more.
If you try to OCR your entire organization, you will find yourself overwhelmed by false positives.
Align your image strategy with your business use cases and focus on that alone. For example, if your business has a need to scan customer IDs and store them, look for images that contain IDs and categorize them as one in the inventory linked to the specific data subject.
So every picture is sensitive personal data? NO!
Only if it used by your organization as part of business processes and you, as a controller or processor actively collected it. Two examples to clarify this:
- Picture of an employee family as his background wallpaper – NO.
- Pictures of an employee as part of the company yearly trip – YES.
If you understand why you’ve got it 🙂
MP3, WAV, OGG, MIDI, RAW
A binary representation of a voice stream
Same as images, except that if it’s part of your business flows, such as recording conversations with customers as part of your customer services procedures, you need to inventorize it and link it to the relevant data subject.
The best practice would be to mark all customer service recording files as personal data since that’s probably the case anyway.
Once you inventorize them, make sure you don’t have copies outside the recording system. If you do, make sure you have proper business reasons and that it’s not more than the minimum time needed to reach your business goals.
AVI, FLV, WMV, MOV, MP4, MKV
A binary representation of a video stream.
In most cases, if your business uses video as part of business processes toward customers, you probably know about it and have dealt with it in the past as part of your security responsibility.
From a privacy point of view, a video is a combination of image and voice into one file. More than that, the context is significantly richer than in an image or sound separately, so potentially more sensitive for the data subject.
Most people would really want to know if you store video on them and will probably not like it. As part of the DSAR process make sure you really have the right business reasons, strict policy, and legal backup.
Don’t forget your CCTV system. Your physical security team records all employees all the time, and in some cases your customers and it is sensitive personal data.
There is no need to analyze your video files – they are all sensitive. Just like audio files. Secure them and make sure they are not distributed or stored with the wrong access permissions.
Since you already deployed a proper discovery tool, sample other video files your discovery found. This is not related directly to privacy but from a security perspective, it’s good to know why your organization stores or processes video files and by whom.
Marketing, HR, just browsing the internet on work hours… I get it. But I found some interesting use cases that as a CISO I would not have been happy about. The sys admins found a great way to document activities in the data center. You can see all the root passwords that were entered into the consoles. They are now saved video files in the public folder, used for training by all other team members. 🙁
In each business, I have found uncommon file formats and unique transaction protocols that characterize the industry they belong to. Some of them are as a result of “old” systems and others are industry standards.
From a security perspective, these formats store and process the most sensitive data assets; from a personal data perspective, you can easily find them if there is a link.
For example, an architect who stores designs in CAD files sounds innocent, except for the fact that the drawing may contain the owner details.
For a credit card clearing process where a proprietary VISA protocol has been used for more than 10 years with no change, it’s incredibly important that your discovery tool is able to identify and analyze it. Whoever is using it is actually the main processor of personal data in your organization
Don’t expect to find off the shelf tools that will support your proprietary formats and protocols. Make sure that the tool you choose has the ability to add support in a very easy and cost-effective way.
Having a partial inventory is as good as having no inventory at all. Whatever your solution is unable to do, you will have to do manually. We’ve all been there and we know how that ends.
You may not find a tool that can do it all at the moment but make sure you choose one that has a proven method to make sure that the missing pieces will be delivered during deployment. As long as these pieces come to add another protocol, repository or file format to an existing proven engine, that’s ok. If the tool has no ability to address both data in motion and at rest, known and unknown data, and structured and unstructured data, you’re gambling on the wrong horse. A year from now you will still maintain a partial inventory supported by ineffective manual mapping processes.