top of page
Search

Don’t trust data scientists to set performance metrics

  • Gianluca Gindro
  • Jun 18, 2020
  • 8 min read

How to align a machine learning product with the business

The importance of choosing the right metrics

You are the founder of an e-commerce start-up selling bikes and bike accessories. Your website right now is the same for all users, but you want to introduce a personalization feature that shows the most relevant 3 products at the top of the page.

You walk to the desk of the Data Science team and explain to them the problem. “How should I decide which products to show?”, they ask. “The ones that are more relevant to the user”, you explain. “Relevant in what sense?” “Well, the ones where users are more likely to click on”, you reply, as that seems straightforward to you.

But a couple of weeks after the new feature has gone live, your reports show that it has generated lots of additional clicks, but very few sales. Apparently, lots of ‘click-bait’ products have been promoted! They might be a fancy bike helmet advertised with a big discount, but out of stock. This is not good.


So you complain to your Data Science team and ask them to change the concept of relevance. “Of course it does not make sense to promote products that do not convert. We need only products that the user will eventually buy!” you explain.


Good, so your data scientists go back at work, and this time their algorithm maximizes the number of items sold. But then you discover a lot of $3.99 pedals like these were sold!

You now make clear that total revenue needs to be maximized, as selling lots of cheap items won’t help too much your bottom line. But, wait, is it the total price of the item or your gross profit that you want to maximize? And what if it’s generally the gross profit, but you still want to sell certain low-margin products because they tend to bring high-value and more loyal customers?


Bottom line is, a machine learning product is as good as the objectives that you give to it, and that is a business-driven decision, often not straightforward.

But what are the levers that you have control on?


Working on three dimensions

Three crucial things need to be decided in every ML product:

  1. Setting proper performance metrics to measure success

  2. When the model is being built, choosing a proper optimization (loss) function for the model

  3. When the model is deployed into production, looking at the bigger picture of how the product helps the business overall

Let’s look at them one by one.


Setting performance metrics, ie defining success

Performance metrics of a machine learning model can represent business variables, but not necessarily so. The business focus is particularly useful when the model lacks in interpretability, which can then be compensated by more user-friendly performance metrics.

A key aspect to take into account when setting metrics and targets for a predictive model is the cost of making a mistake for the business. Let’s see that in action with the example of a classification engine.


The cost of making mistakes

An online hotel booking platform wants to predict the customers who will likely cancel their reservation.

The machine learning model workflow looks like the one below, flagging each customer as either ‘likely to cancel’ or ‘not likely to cancel’.


The Data Science team comes out with a model that has an accuracy of 90%, that is, 90% of the customers are predicted correctly. Should you congratulate your Data Science team?

Let’s look at the outcome in more detail. There are 1000 customers, out of which 900 have been predicted correctly.


But when we look at it in more detail, we realize that not all errors are born equal. There is a difference in making a mistake of a customer predicted to cancel that eventually doesn’t VS a customer predicted not to cancel that eventually travels.


There are two questions in particular that we want to ask:

  1. When the model says a booking will cancel, how confident can I be? This is defined as precision

  2. What proportion of the canceled bookings has been correctly identified? This is called sensitivity or recall

In this case, when the model predicts a customer will cancel, it will only be right 25% of the time. And out of the customers that will eventually cancel, it only catches 5% of them! Hardly a good classifier, even though it has an accuracy of 90%. Consider also that a dummy classifier, simply predicting NO for all customers, would have an accuracy of 91% in this scenario (895 + 15 correct predictions out of 1000). This is a typical problem of imbalanced classes, well known for example to those designing medical tests to identify rare diseases.

What is the best metric to choose, then?

  • The ‘right’ metric to measure a predictive model entirely depends on your business process and what you are trying to achieve

  • Often, you need to consider at least a couple of metrics and look at the right balance between them. See precision-recall curves as a way to find the right balance

Let’s view now a couple of business applications of this model that would have different business priorities and hence would require different metrics to be optimized.


Hotel cancellation prediction for an overbooking use case

The Optimization Team wants to perform overbooking of reservations of those customers with a very high probability to cancel. They have a big cost of making mistakes (customer will remain without a place to sleep!) but does not necessarily need a huge list of customers to be flagged. What should we optimize this model for?

This model will be optimized for precision, so would flag few bookings, but for those it will be super-confident.


Hotel cancellation prediction for customer service reminder calls

Consider instead now the same model to be used by Customer Service to schedule reminder calls to customers more likely to cancel, so to try to engage with them and prevent their cancellation.

In this case, the cost of making the phone call can be quite low, and the cost of making a mistake (e.g. calling someone who was not even considering canceling) is not too high. So you might want to flag a larger share of customers with different levels of confidence, and you will end up with a classifier with lower precision but higher recall/sensitivity.


Choosing the right optimization function

At this stage, you might have chosen an ideal set of performance metrics to measure your machine learning model ex-post, but chances are you also want the model to be optimized for those metrics while it is being built in the first place.

This is achieved through the so-called loss function, that is the error function that the model tries to minimize when it is trained (ie when it is learning from past data), and can also be seen as the reverse of an optimization function.

This step spans quite technical and mathematical considerations, even though lots of pre-built algorithms work best with certain default functions. For example, if you are running a linear regression, it’s quite uncommon that you want to deviate from least-squares as a loss function.

But this is not always the case, and in some situations, this can also become a business decision.

For example, if you are trying to cluster shops similar to each other, you need to consider both a ‘category’ dimension (the type of shop) and a geographical component (its location).

The clustering model will try to define clusters based on a distance criteria, but there is not a single way to define such distance: for example, depending on your business use case, you might want to use a broader definition of categories (e.g. all food shops together) vs wider geographical areas of clusters (e.g. all shops in a city rather than a neighborhood).

A question that could arise is:

Why do we need to define a loss function separatly by the performance metrics and not simply use the same one?

There are two reasons for that:

  1. A model can only have one loss function, while you can have several performance metrics. For example, as we have seen, you might want to measure both the impact on gross profit and revenues of your algorithm, but the algorithm cannot maximize both at the same time

  2. Models are implemented in code, and there are operational efficiency considerations to be taken into account. This is particularly true when models need to run at scale. For example, in predicting house prices, you might want at the end to evaluate your algorithm by median percentage error. But optimizing on median can be tricky in practice, so you are likely to choose another metric as the loss function.

In general, though, the loss function should be aligned with the evaluation metrics. At the end of the day, the model sees only its loss function while it’s being trained, but you will assess it using the evaluation metrics at the end. If the two start to diverge, it’s not going to be good.


Looking at the big picture

Measuring the performance of a machine learning product per see is only part of the story, as the very end objective normally is to improve the business overall. Business and product objectives can differ for several reasons, but here are two common examples, one negative and one positive:

  • Your product is cannibalizing revenues from other channels

  • Your product is driving cross-sales to other channels

Let’s consider an example of a personalized newsletter. You might be able to accurately measure the product-specific metrics vs a control group, such as increased revenues, conversions, etc, by attributing additional sales when the user clicks on the newsletter and then purchases a product.

But what if the customer sees a product in the newsletter, doesn’t click on it, but then later in the day remembers it and buys it directly on the website? In most attribution models, this would be considered as a ‘direct channel’ purchase, hence not attributed to the newsletter.

On the opposite side of the spectrum, your newsletter could drive certain sales that might have occurred anyway from other channels.

A way to try to measure these wider effects is to measure the uplift performance based on the customer as a whole, regardless of the channel. In this example, if you are comparing treated vs control customers from your newsletter, you would need to measure their activity across the entire platform (e.g. whether they bought a product in the day across all channels)

This is however easier said than done for two main reasons:

  • from a technical point of view, a global attribution might require a deep level of data integration, not necessarily available within the company

  • more subtly, other internal products might run similar experiments, treating the same customer differently (e.g. showing the same customer a personalized mobile app page)

In any case, even though it might be impossible to account for all these variables, it’s important to keep them in mind at least from a qualitative point of view, even when it’s impossible to quantify them.


Key takeaways

Setting the right performance metrics can be a tricky task and needs to be a business decision. If you are from the business side, involve your data scientists, but remember that at the end of the day it’s your call and needs to be based on business priorities. If you are a data scientist, make sure you are fully aligned with the business and try to put yourself in their shoes before making these decisions.

Summary of key takeaways:

  • Consider metrics at various levels of the value chain: a CTR might be easier to measure, but often lacks end-to-end focus

  • One single performance metric is normally not enough: you need to consider multiple metrics and often evaluate a balance and a compromise between them

  • Always keep in mind the cost of making a wrong prediction when setting metrics. Often the same predictive model can be easily tweaked to make it more or less conservative in predicting an outcome (as in our hotel cancellation model)

  • No matter how many evaluation metrics you look at, remember that a machine learning model can only work with a single optimization (loss) function. You need to make sure this is aligned with your other metrics and goals

  • Always look at the big picture from a business point of view, even if that entails considering aspects that are not quantifiable, such as customer retention, cross-channel interactions, or long-term engagement.


 
 
 

Comentarios


Post: Blog2_Post

Subscribe Form

Thanks for submitting!

©2020 by THE DATA MBA

bottom of page