Introduction to Unit Testing for Machine Learning

Unit testing for machine learning has different objectives than software developers are used to. Unit tests add stability to the data pipeline and extra granularity to model validation.

Vin Vashishta | Originally Published: October 23rd, 2020

Book time with me for Career Coaching or sign up for my Business Strategy for Data Scientists class.

Yes, we unit test our models and code. This is a best practice that is often overlooked. Why build unit tests?

  • Catch issues in your data pipeline early.
  • Validate model inference under different request scenarios.
  • Verify changes to the model will not impact functionality and integration.

  • The earlier you catch defects, especially in the data phase, the less of an impact they will have downstream. Unit tests make the debugging process easier. Unit tests make model maintenance faster in subsequent improvement cycles.

    Code coverage is the traditional unit test metric, but it does not apply to machine learning. Focus on covering the three bullet points above. That keeps the level of effort low while provided automated detection for high impact defects.

    What is a Unit Test?

    It is an automated test that you write a few lines of code to create. A unit test checks that a specific part of the code functions as expected. This is a well-established part of software development, so the tools and frameworks are very mature.

    Here is a set of tutorials for Python, R, and Java (if you want to see more of the developer’s deep dive view check out the Java one). Most languages have multiple unit test framework options, but the fundamentals are universal.

    We build unit tests because we want to reduce the time between introducing a defect and finding the defect. Unit tests run as part of the code check in and build process. When the unit test fails, it gives you a good idea of where to start debugging to fix the issue. Quicker discovery and easier resolution means more time doing machine learning.

    Here is a good tutorial on unit testing specifically for machine learning. I agree and diverge with different parts of it, but the technical explanation is excellent.

    Machine Learning has code for the pipeline, model training, model testing, validation, and serving. You could unit test the training and testing code, but I do not think it is worth the time. In my opinion, it is a bit redundant.

    The pipeline and serving are worth the time to add unit tests. Those tests will catch defects early and prevent future updates from breaking existing functionality. Unit tests add another facet to model validation providing better model transparency and greater certainty of production accuracy.

    Unit Testing for Data Pipelines

    Available data is imperfect. We build pipelines to handle this. Unit tests can validate that our model gets the expected data for training. This is not really an issue during our first pass. We built the pipeline specifically for that model and typically cover all our needs without error.

    We usually go back to the pipeline and make changes. That is where unit tests are worth their weight in gold. Transformation changes and new data sources can break downstream processes. We may not see the signs of bad data until after one or more iterations of training.

    Working in a team makes data pipeline failure scenarios more likely. Unit tests serve as a second form of documentation. They prevent team members from stepping all over each.

    During model training I was working with data about products that had been available for sale for at least 3 months. Model training and testing went well. During the validation phase, a unit test on quantity sold failed. Values should always be positive, right?

    New products could have negative quantity sold due to recalls. The model would have had an obscure production accuracy failure. That error would have been difficult to detect and trace back. It also led to considering how to handle recalls. Should we use them for training? Are there any other assumed positive datapoints in the training set which could also be negative? For that matter, are there other production product classes we have failed to include?

    Unit Testing for Model Outputs

    Your assumptions when building the model need to be added as unit tests on the model output. Pricing models get a lot of bad assumptions built in and I have had to clean up that mess. Let’s say that product margin increases with product price; that is my assumption. I build a unit test where I feed the trained model the same product data points with only price changed and compare the results. If my assumption is correct, the outputs pass the unit test.

    They will not because I have made a bad assumption. Margin has more components than just price. So, let’s try to fix my assumption by changing the unit test parameters. Now my inputs are two examples of the same product data with the same purchase cost and different prices. This time my outputs should pass.

    They still do not. Cost has more components than just purchase cost. You can see what I am catching is not typically found, at this depth, by looking at model performance metrics. This is a toy example with simplistic relationships. The deeper the model, the more complex, therefore more obscure the implications of bad assumptions are.

    As a framework, unit tests are a simple way to track assumptions that can cause model opacity and model failure in production. Unit tests become the starting point framework for production logging. Relationships that where supported in model testing and subsequent validation with production data, may be invalidated over time. Relationships can also weaken over time. Logging is a continuous validation tool to give you granular information about failure/degradation.

    Unit Testing for Integration

    Integration is two sided. Your model has to accept requests for inference and serve inference. Remember that you cleaned the data. It was often gathered in the dirty format from the same interface that will be sending your inference requests.

    You will need two types of integration tests for inputs. Bad requests and adversarial requests. The first type is easy to handle. Have the team or developer who wrote the request service(s) write you a set of unit tests. It is hard for you to guess what could be sent your way and time consuming to figure it out yourself.

    Adversarial attacks need to be accounted for as well. Unit testing forces you to think of adversarial scenarios. Model development should include robustness. There are frameworks for model evaluation against adversarial attacks and I cover the topic in depth here.

    Once the model has been through that evaluation and hardening, unit tests provide a sanity check. Do not go overboard. You could spend months trying to think of all the possible scenarios. Just cover what matters.

    For my pricing model, I built unit tests around scenarios that could cost the company a lot of money. A price of 1 penny would not be good. I covered cases around business rules. Price drops and increases had specific upper and lower bounds. Anything outside of those required a person to authorize it. The rules were complex with different bounds depending on product category, price, inventory, etc.

    What kind of adversarial data did I mockup? One set was based on competitors’ prices, an organic, unmalicious adversarial attack. Was there a set of prices that would cause a unit test to fail? Was there a change in prices over time that would? The other scenario I covered was someone intentionally trying to reverse engineer the pricing model. This is not a traditional unit test, so I am going to gloss over that, but I do cover it in my tutorial.

    Model outputs get covered here as well from an internal team perspective. External teams who build the services that consume model inference need to build their own unit tests. It is worth holding an hour meeting to discuss any scenarios they have not already covered. The difference between what the model is allowed to do and what external teams expect it to do can cause unhandled exceptions.


    Unit testing for machine learning follows its own set of rules and best practices. This may sound like a lot of work. Remember to cover what matters. Other teams aim for code coverage. You want to build to validate functionality. For machine learning, functionality is not completely code based so code coverage really does not matter.

    It is important to understand the built model enough to write unit tests to validate inference. This is an important concept that makes the need for unit testing clearer. Unit testing forces you to understand the data coming in and why the model functions the way it does. It forces a more complete validation. Requiring these types of unit tests prevents someone from cutting corners.