Data Wrangling for Machine Learning Professionals Part II – Data Privacy and Security

Security and privacy are completely overlooked during the data wrangling process. Expert machine learning teams need to incorporate these into the earliest phases of the model development lifecycle.

Vin Vashishta | Originally Published: April 15th, 2020

Book time with me for Career Coaching or sign up for my Business Strategy for Data Scientists class.

We have all gotten the compliance lecture. It covers the basics, but data wrangling has grown beyond simple guidelines. At the very earliest stages of data analysis, security is often forgotten. Databases are wide open. Datasets go home with laptops, sit in open access clouds or repositories that are public by default.

Aside from basic security measures, privacy needs to be looked at upfront. Personally identifiable data points are just the beginning. What data scientists typically work with at the data wrangling stage is the kitchen sink. That is any and all data they can get their hands on. The scope of data under analysis expands quickly during the discovery and sourcing process.

Data scientists have access to all areas of business data. Anything the company has been gathering is within reach. Anything the company can start gathering is fair game as well. There is a lot of trust given with almost no initial oversight. That comes later; after the model is developed. By that time, tracing the source data is exceedingly difficult. Breaches have gone unnoticed.

In the last post, I said you can pick apart most models by looking at their data wrangling processes. Poorly secured data or the unaudited use of private data allows regulators, courts, or public opinion to tear apart a company.

This is really part 1.5 of the series. I discussed source tracing in the last post with regards to credibility of external sources. In this post I will explain source tracing with regards to privacy and security.

External Data Privacy

I’m returning to the COVID example. It is a case where privacy can easily be overlooked. The sources for the data are at the bottom. The CDC source is benign. There is no personal information and the conclusions are generalized. This is an aggregation of data behind the scenes which exposes no access to the more granular sourcing. No red or grey flags here.

There’s low risk if:

  • The data is not about people in a granular presentation
  • The data is publicly available
  • The data is presented transparently

  • That is typical of single data points and ranges like this one. How useful is this source? It is hard to gauge certainty without a view into the underlying pieces. That is why most sources expose some or all the underlying data to support their conclusions. The WHO source (warning PDF download) is where we begin to wander into the slightly grey area.

    The data is more granular. It includes specific cases that are used to draw the conclusions. There are two questions to ask:

  • Can data points be tracked back to an individual?
  • Would data deanonymization reveal, personal or private information?

  • In this example, I do not see a simple way to back track, so we are clear on question 1. Question 2, this is an example of the top level of personal data, healthcare records. Here is the slightest of grey. It does not mean that the data cannot be used. It means it should be traceable.

    There are datasets with anonymized location data. I am not linking to a specific example to avoid the implication of wrongdoing against a specific group. Location data can be back traced to individuals. Location data is personal data. Here is a deeper grey area.

    What level of personal data is this? The New York Times ran a feature on tracing location data to individuals. Reporters talked to people whose location data was easily traced. Many people did not care. Where does that leave our data source and accountability to secure data?

    Again, we can use the data, but these concerns need to be logged. Incases this murky, there should be a review of ethical to legal concerns. Data wrangling is a workflow and it involves the whole company.

    Business Applications – External Source Documentation

    Every data wrangling effort has a data gathering stage and each source needs to be examined. Anything grey needs to be documented. It is simple. Add it as a set of notes that stays with the dataset.

  • Source
  • Data
  • Date Gathered
  • Gathering Method
  • Privacy Concern Type
  • Threat Level

  • Put it in source control with the dataset. It is like comments in code and documentation for an opensource project used in production.

    Why do this? At a junior level, this is simply a best practice. We teach it to new data scientists as part of data wrangling basics. It provides technical leaders with a way to keep track of their work and team leadership with a way to track progress.

    At an expert level, there is a different expectation. We are expected to know and mitigate privacy risks. That is not just a data governance and compliance role. It is a matter of reputation. Data scientists who publicly fails at this stage becomes unemployable.

    All external data sources have associated risks. Once that is in the log, it is out in the open. The last two bullet points make a data scientist report risk. Accountability and transparency are part of the data sourcing process. Customers expect it, especially in B2B relationships.

    Open Data Security

    InfoSec best practices for open data are immature. Open data streams and repositories are attack vectors. Sources must be verified before the first download. Streams need continuous validation. Anti-virus and firewalls do not provide comprehensive protection.

    Why not? Open data is distributed across several environments. Each has their own security protocols. Many have no security beyond the basics. Introducing bad data, adversarial samples, or malware into a dataset has happened. That vulnerability is only now being discovered.

    There are also the streams and pipelines created to ingest data. External APIs need regular evaluation for security flaws. The services built to consume external data need to be evaluated as well. If that data goes to a distributed test environment or personal laptop/desktop, security on that endpoint needs to be evaluated.

    There are layers of access which cannot be addressed by passwords and firewalls. There are ways malicious data or code can find its way into data science infrastructure then propagate from there. A data scientist’s environment is extremely attractive for data theft because of all the access they have.

    Internal Data Privacy and Security Audits

    Internal environments feel safe and that is the root of both security and privacy lapses. External security is not up to us. Data scientists can only evaluate the risks and try to mitigate them. Third party data gathering practices are often opaque. Our actions are limited by what vendors expose and back tracing as much as possible. Internally, there is more control but just as much room for complacency.

    The issue of privacy takes on a new aspect, internal data gathering and usage. Zoom is a good example. Reviewing individual conference calls served legitimate purposes. Terms and conditions allow for broad data use and the customer signs off before using the product.

    Data usage was not tracked or reviewed properly. It was revealed/alleged that engineers monitored conferences of women for unethical purposes.

    Once data becomes available, access needs to be traced. The privacy level of the data needs to be logged. Data access needs to be granted, not just based on need, but based on trust. Trust levels allow for usage to be monitored. Groups in a higher trust level with access to more private data should be audited more frequently. Project based access privileges can be reviewed. Data may have been useful for the previous project, but not applicable to what a data scientist is currently working on.

    Access can be revoked to avoid the Zoom scenario. Most privacy violations happen because of availability, lack of accountability, and boredom. Periodic audits take care of the first two. They also remove any doubt as to fault. If a data scientist misuses data, they are accountable for that lapse. If the audit log or policy was not clear or not enforced, the data scientist is off the hook. Companies without privacy audits and policies can easily blame the data scientist to avoid taking accountability themselves.

    Internal security falls into two main categories: endpoints and access. Internal data is safe from many of the vulnerabilities that external data brings with it. The risks start with where the data resides. Data scientists work on local environments and on demand distributed environments. Both are endpoints and comprehensive security can be overlooked.

    Once data leaves a secured repository, the security measures from that repository need to follow it. For that to happen, the security team needs to know where data goes during the model development lifecycle. Tracing makes that quite simple. A logged event helps security teams to track data flow and address security.


    No one teaches any of this. I looked for examples online. I searched guidelines for public repositories from the top names in tech and government agencies. There are no guidelines to follow. We are on our own when it comes to security and privacy.

    GDPR and the CCPA provide a flimsy framework. They were built with the best of intentions. Regulations are difficult to craft so that they cover all bases without making some businesses unable to function. That means some activities are fine from a regulatory perspective but insufficient to protect the business or the data scientist from liability.

    Data wrangling is a complex workflow with deep implications to consider. The next post in this series will address incomplete, inconsistent, or unreliable data. This is where bias enters a model. That takes hours to fix during wrangling or weeks to resolve in test, sometimes months to resolve in production.