Why Do Models Drift?
Data drifts. Models lose accuracy. They require monitoring to detect drift and retrain to prevent failures. Why? Let me explain in non-math terms.
February 5th, 2021
Data drifts. That means the data you initially gathered for model development does not cover the complete distribution of the data you will encounter in the real world. That is completely normal.
Why does data drift? No dataset that I have seen is a complete representation of the system we are trying to model. For the data to be complete, I would need all distributions of all data points that describe the system I am modeling.
Put another way, a complete dataset would have every piece of data needed to predict every possible outcome. If you have ever gotten one of those, you live a charmed and magical life. I have been called a unicorn, but I do not have magic powers. That means I have to handle drift.
We start out with a distribution of distributions. Our dataset was gathered over some time and during that time we saw some set of behaviors. That time and set of behaviors is a part of some larger distribution. The system has been generating data for longer than we have been gathering it. Complex systems produce a set of behaviors. Our data gathering has observed some part of the complete set.
The data points created by a specific behavior are another set. For each emergent behavior, there is a (possibly diverse) set of data points. For some behaviors (if we are lucky), we are gathering all created data points. For some, we are not. Our dataset is a partial representation of a distribution of distributions.
I am not going to start declaring variables, but this is foundational math. You can mathematically describe the initial dataset with respect to the system being modeled. Do that on one of your projects and you will realize we are pitifully under equipped when faced with the task of modeling a system of any complexity.
The data we start out with is insufficient and suffers from assumptions built into the data gathering process. When we say dirty data, it glosses over these fundamental weaknesses in the datasets we use to build models. Even the cleanest data has these significant gaps that need to be understood before we can understand model drift.
We start the process by observing a system’s behavior and gathering data from those observations. For the set of all observed emergent behaviors from a system, we have some set of data points that describe those behaviors. For a given emergent behavior, our dataset contains a partial or complete distribution of data for some or all data points created by that behavior.
As we learn more about the system from observing then modeling its behavior, we start to gather environmental data. These data points are what we believe influence the behavior of the system based on our early models.
This is our starting point for building a model that drifts. Our input data represents the starting state or environment the system finds itself in. Given that starting state, what behavior will the system exhibit? This is a simple description of how models work.
For a frequently observed behavior, we may have a sufficient distribution of all the data points needed to accurately predict that behavior given a starting state. However, does that set of data points generalize? Is that set of data points all the data points required to predict all probable emergent behaviors?
What if it is not? What if, 10 data points are all the model needs to predict 85% of observed behaviors? Let’s extend to say that those 10 data points are enough to predict the other 15% of observed behaviors with a 65% accuracy. If we do not dive too deeply into our model, that would give us great accuracy metrics.
We are basing accuracy on the available data. Our dataset is partial not only with respect to all data points created by system behaviors but also the complete distributions of those data points. There are dimensions of incomplete data:
The first two bullet points pertain to our input data. The third pertains to our prediction or labels. Each behavior is a class. We can have an incomplete set of labels and some starting states can be mislabeled.
Gathering data is limited by what we can observe about the starting state and resulting behavior of a system. It is also limited by our understanding of the system itself. Complex systems are built with smaller systems. Those systems interact with the starting state and each other to create the emergent behavior of the larger system.
This is what a deep learning model is built to account for. We need math to pull in our partial pictures (each data record in our training dataset) of the starting state and create an architecture that represents all those smaller systems’ interactions that lead to an emergent behavior. Then we use another set of partial pictures of the starting state to interrogate our model’s performance. The goal of testing and validation is to determine how well the model represents the system under different starting states.
Training, testing, and validation metrics are powerful indicators of performance to someone who is unaware of the true state of our data. Understanding the science side of our field, we are aware that those indicators are flawed. By exploring our dataset and using prior knowledge of the system we are trying to model, we have a chance to present those flaws along with the final model.
Those are uncertainty metrics. They become the data points we measure while the model runs in production to predict imminent failure. We have essentially built a model to detect the types of drift that will require model retraining. Sometimes drift is so significant that retraining also includes comparing multiple models and selecting a new one to deploy that performs better on the new normal distribution of distributions.
In other cases, analyzing our initial model to understand how it works is better than sending it to production for continuous retraining. The successes and failings of the model indicate which features actually have an impact on the system versus those environmental variables which do not.
Some features represent the behavior of a smaller system that we need to better study. Features are sometimes aggregations of multiple data points and can obscure the fact that we have baked in assumptions about the subsystem which produces that feature. We can also mix independent, unrelated subsystems’ behavior data into the same feature. Our model needs better engineered features, new features, and often new architectures to improve based on what we find.
On the other side of the model, the prediction, we have an equally large set of failure cases to explore. Predictions can be a house price. It can be a buy or do not buy decision. It can be time to machine failure. All of these seem like solid predictions, but they hide a complex set of subclasses for each top class.
Buying is a result of a decision chain, a string of decisions leading to a behavior. The problem is framed by a binary behavior when the reality is many different decision chains lead to buying or not buying. Breaking out subclasses can reveal more about the system than the top-level category alone. We usually find new, unobserved classes and that starts the process of detecting and gathering data for them.
All this new understanding of the system we are modeling comes from experimentation. The models that drift provide useful data about the system. We can run experiments where we compare different models and use the resulting data to verify or refute an assumption about the system.
A model is a math equation with variables for each data point and weights attached to each data point. What deep learning does is use several layers of equations aggregated into representations of complex concepts. When we deconstruct a model into these equations and aggregations, we can propose experiments to validate or reject an equation.
Some experiments are unrealistic. Others are possible but cost more than using models that drift in predictable ways and handling drift with retraining. Building models that drift less is a cost benefit analysis and a feasibility study. Often, the work I have outlined results in better uncertainty metrics and retraining automation. The research role in machine learning organizations handles this process.
Twitter’s content recommender curates the tweets each user sees on their timeline. They retrain the model daily because users’ content preferences change so often. It is a competitive advantage to rapidly react and keep users engaged using the most recent view of their preferences.
Twitter’s research is focused on optimizing their model’s retraining time and hardware usage. They run experiments to validate the complex assumptions in their model. However, a model that does not drift would read a user’s mind. There is no feasible experiment to do that yet. I say yet because there are scientific fields working to map stimulus to brain activity to decisions. Sometime in the future, it could be feasible to do this experiment.
That is drift. Data is incomplete. Models are imperfect. Applied causal machine learning is a work in progress. What we have now can be cost prohibitive for most business cases.
However, there is a middle ground. Accuracy metrics and failure analysis provide a picture of model health and performance. Automating tasks like monitoring, retraining, and model selection can reduce the costs associated with models that drift. Researchers can work to better understand the model and validate as much as is feasible.
Drift is a deep rabbit hole, and this is the simplest I could make the concept. The math heavy side of machine learning explains how to evaluate data and models, build experiments, discover and define uncertainty metrics, validate or refute causal features, and measure the impacts of drift.