top of page
Search

Why you need to test the tests in machine learning

  • Gianluca Gindro
  • Jun 18, 2020
  • 5 min read

What can still go wrong even if you meticulously separate training and test datasets

If you have ever interacted with a machine learning product, chances are you know about the importance of separating the training and testing of a model to avoid over-fitting and to make sure the model will generalize well on unseen data.


The background on training vs testing separation

To those in the field, the focus on training and testing might seem obvious these days, but it has not always been necessarily so. Before the advent of data science, its ancestor of statistics had not embraced this concept fully yet.

Why?


Before machine learning and big data, the models were simpler, while now more data has allowed us to build more complex models. And as we know, over-fitting becomes increasingly problematic when model complexity increases. If you are running a linear regression, you can afford to worry less about over-fitting. Also, model interpretability is normally easier with simpler models, and some of the over-fitting could be compensated by meticulous manual work.

So, having to separate train and tests data sets came more as a necessity rather than as best practice.


Is training vs testing separation enough?

Let’s say you are a business consumer or product manager of a machine learning product. Your data science team has developed a new ML product with the gold standard approach of separating the training and test sets.

You need to make the call whether to release the product into production. But you want to be 100% sure of its performance before making the final decision.

So you ask the Data Science team to show you the model and the data. They show you the full data set, and how it was split on a 70% training and 30% testing subset, swearing that the final model was trained purely on the training dataset and its performance metrics derived only from the test dataset.

You now feel relieved. Nothing can go wrong. Right?

Well, there are three common things that can still go wrong.


1. The test dataset has already been used before

In theory, the rule seems straightforward: the test dataset should never be used until the very last stage. In practice, though, it’s very rare this is respected 100%, and the impact of this can range from small over-estimations to pretty bad mistakes.

Two questions need to be asked in order to check this.


On which dataset has the initial data exploration been performed?

The first part (and generally more time-consuming) part of data science is data exploration and data wrangling. Has this been performed before or after dividing the data in train and test datasets?

From a purely theoretically perspective, you shouldn’t see the test data in any way before the very final step of model assessment. Anything shorter than that might give you over-confidence in your predictive capabilities.

However, this is often not so easy in practice, particularly if your data comes from a business context. What if as part of your data exploration you need to assess missing data from manual or external sources and decide what to do with those? (eg. whether to fill or drop missing rows or columns) The fact that you take those decision looking at the full dataset deviates from the ideal scenario, but is often impossible to avoid in practice.

This point can become particularly critical if in this phase you have performed feature selection and engineering. If you have chosen your features based on their correlation with their target variable on the full dataset, this will surely affect the generalization error.


How many different models have been assessed on the test dataset?

The most common way a test set can lose its pure status is when you test multiple models on it. This can sometimes be done maliciously, but in my experience most often is the result of badly planned iterations.

Typical case: your data science team selects and builds a model, at some point they feel confident it is final, so they test it on the test set to evaluate its performance. At this stage the model is discussed with the business, that gives feedback and suggest to try to incorporate another feature, or to give less importance to some variables. Doing such iterations two or three times might possibly come unnoticed, but the more this is done, the bigger the problem becomes.

A caveat: what if you are using cross-validation? When you need to tune hyper-parameters, it is normally best practice to further split the ‘training’ data in training and validation set. If you have limited data, cross-validation can come to the rescue in using the same subset for both training and validation purposes. However, cross-validation should not be used as a replacement of the test set, otherwise the same problems previously mentioned would still apply.


2. The test data contains some form of data leakage

Data leakage is when your model is using data that would not be available in production. This can falsely increase the model performance metrics basically by cheating.

It’s not always obvious to spot data leakage issues. Two common ways this happens in practice:

  • The data was only available after the moment in time when the prediction had to be made

  • The data was theoretically available at that time, but not technically available to the prediction system

A scenario I have faced in my working environment has been to try to predict hotel booking cancellations. When a hotel is booked, certain data is available at that time (e.g. on the customer, trip, booking seasonality), but other data is only gathered in the time-span between booking and planned check-in date.

The problem is, some of that data might be tracked in the same tables of the data warehouse, or potentially in the same fields that are then constantly updated. Without an historical view of those fields, these are unusable for model training purposes.

To the extreme, some fields might only be filled when the cancellation is actually processed. For example, we might only collect certain customer information when the cancellation request is sent. If we would try to use these fields for training our model, we would get perfect predictors of the cancellation outcome!


3. You still can’t test data from the future

Now, let’s imagine you have been super careful in using your test dataset and to avoid any data leakage.

Can you still trust the tested model performance will be reflected in production?

Well, at the very least your generalization performance is still referred to data from the past and, most likely, you are going to use your model to predict a phenomena in the future.

The basic underlying assumption of every model is that, within certain limits, you are trusting that the phenomena you are trying to predict will not change too rapidly in the future, otherwise you could not infer from past data.

How big this can be of a problem is difficult to tell, and surely depend a lot on the application. Trying to model this could also become tricky, and possibly would need to incorporate assumptions on the future expectations (eg with a bayesian model), which would normally not be trivial.

In my experience in a business context this aspect is often neglected. However now in a post COVID world this might change. Think of training a model to predict hotels cancellation, just before the COVID outbreak (like I did!).

And when life will slowly go back to normal post-COVID, it will be increasingly become more difficult to trust past data to predict the future, without some sort of adjustment. But this is possibly the topic for another post!

 
 
 

Comments


Post: Blog2_Post

Subscribe Form

Thanks for submitting!

©2020 by THE DATA MBA

bottom of page