Data Wrangling for Machine Learning Professionals Part I – Sourcing and Validation

Enterprise machine learning leans heavily on quality data. The earliest stages of data gathering and wrangling have a huge impact on model performance. Here is how to do sourcing and validation right.

Vin Vashishta | Originally Published: April 11th, 2020

Book time with me for Career Coaching or sign up for my Business Strategy for Data Scientists class.

The way a model performs is based on the data. There is no dispute there. Wrangling, putting data into a format that can be consumed by models, is an art form. It is not taught beyond the basics: filling in missing data, dealing with data in different formats and from different sources, streams/loading/transforming, etc.

You can pick apart most models by revealing the flaws in the data wrangling process. The largest companies have models in production which are broken because they are based on garbage data. A data scientist or team thought they had pulled the data into shape. However, these models fail over time, revealing fundamental flaws in the data.

I have cleaned up enough of these models to see what is been done wrong and it is easy to criticize. That is not what I want to do with this post. I am talking through the process of working with mission critical data. I will build a notebook and post it on GitHub a bit later with details. I think this post will be helped through illustration and a deep dive.

What’s mission critical data? COVID. That is the example I am going to be using. I have been working for almost 2 months on COVID modeling for clients. The use cases revolve around understanding the impacts of COVID on different areas of the US. They have ranged from unemployment to workforce absenteeism to customer behavioral analysis to supplier viability and on and on. The point is, I have been nose deep in this data and I know it well.

Data Sourcing

Wrangling fails at step one. Deep learning cannot compensate for sourcing issues. This is the most complex part of the data wrangling workflow. Data wrangling is a workflow and there is a lot to cover there. I will leave that for another post. Suffice to say that doing data wrangling companywide requires complex, optimized workflows.

Research is required. Research is not pulling down a dataset for analysis after googling it. This is the science part of data science. While we do not have to do all the leg work of creating datasets from scratch, we must do validation and provenance at the very least. We all know the basic steps here but those are insufficient for most model development.

Let’s start with COVID data. We have all seen the charts, but data sourcing is often taken for granted. Look at a detailed write up of COVID’s incubation period. How long between exposure and symptoms? This link is as well done as possible when it comes to data provenance and presentation. This is expert level research presentation. Me stumbling upon it is not research.

They did research. As a data scientist, you must do some of the same research to understand the data. It cannot be used in the current form. Data sourcing must be retraced. At the bottom of well-presented data is every source for every claim or data point. Check out the bottom of this webpage. Worldometers has listed its sources.

A Business Case for Sourcing

Let us talk about a business parallel. There is some data which comes with a list of sources. A detailed list of sources tied to each data point. We do not get that in all cases. An accounting of the methods used to generate a given data point. That is not common.

We cannot assume the data we have is in its raw form. Data comes pre-wrangled and my point is, we must do the back tracing before we do the modeling. The deeper you question your data, the more holes you will find in the way it was gathered. Each is adding an error to your models that will go undetected because flawed data becomes part of the training and testing.

Tracing COVID Incubation Sources

CNN, Al Jazeera, and Reuters all need to be thrown out. Those are secondhand sources, basically sound bites that represent a partial presentation of findings. Unreliable. Good for Worldometers and others who present the most complete picture possible but irresponsible for anyone to create a model based on.

The remaining expert sources need to be evaluated for bias or incomplete data. That means each paper needs to be read and evaluated. Are you an expert in this domain? I am not so I had to consult experts. I am an expert in deep learning. That does not mean I can grab some data and act like a virologist on Twitter.

After being walked through the papers and additional publications around the topic, I better understood the reliability of each number. Specifically, the best available data on incubation periods are ranges. Ranges are often used in modeling so that type of data is appropriate and consistent with best practices in the field. There is always a “but.” But, COVID’s range has not been set with certainty. What range should I use? The answer, “Which study do you like better?”

That is an expert’s way of saying, “You’ve asked a stupid question.” I wanted a number and there were a lot of numbers. That is data sourcing; keep digging and eventually there is a reliable dataset at the end of those efforts. Unfortunately, there is not yet, and domain experts are the only ones qualified to keep a data scientist from pulling data points that look certain to a statistical observer.

Back to Business: Handling Uncertainty

That is a common business problem. I learned about it in manufacturing. We had weekly meetings with the leads and supervisors who worked on the floor. I would show a set of conclusions. They would laugh and ignore them.

The only thing I was doing right was talking to experts. After a few meetings I took a trip to one of the factories and got on the floor. I spent a day wandering around with no real clue what I was looking at. I was looking for a sensor that was not working. In my mind, that was the only way the data would be unreliable. That is the kind of assumption that kills model performance.

One of the leads decided the only way to get rid of me was to explain why I did not know what I was talking about. I was feeding the models the data I had, not the data I needed to build an accurate model. There is a significant difference.

The result of wrangling can be, “I need more data.” That needs to be coupled with a method for gathering the data. In some cases, that is reliant on someone outside of the business. In the COVID case, I cannot get this data because there has not been enough time to get a consensus across multiple studies.

This sounds like a dead end. Far from it. I have a range. I have a strong understanding of the shortcomings of that range. The data is useful up to a point. That is fine if I do not overstep the bounds of its utility. I need to build my model to account for the flaws in this range. The model cannot be heavily dependent on this range and it may be helpful to create a small model around the range to account for known flaws.

Deep learning models hide errors in the data. Once the flawed assumption or bias is in, it will not be seen until after the fact. We can account for bias after the model has been built, but that is inefficient. It also assumes we know what to look for when it comes to debiasing. Again, that is an assumption with the same errors that are baked into the model.

I am splitting this up into different posts. Halfway through writing I realized this would become a monster if I covered everything in a single post. The next post will cover how to assess data which is presented as complete but really is not. Data scientists can add value to researchers by exposing flaws in their data. For COVID, there’s significant inconsistency. This is another danger of partial wrangling. Many of the charts and projections you are seeing are built on flawed data. It is irresponsible and I want to show you how to avoid the same mistakes with your data.