top of page
Search

Do you have enough data?

  • Gianluca Gindro
  • Jun 18, 2020
  • 4 min read

Going deep, going wide or going quality: how to find your data bottleneck

“If only I had access to more training data, my model accuracy would massively increase”, “We should acquire more data via API”, “Data quality at source is so bad that we can’t use it”


Data is the foundation of every machine learning or analytics project, but despite the fact that we have now more data than ever, excuses related to not having enough data, or the right type of it, are not in short supply.


But how can you tell when these are real concerns or only excuses? In other words, how can you find if data is the limiting factor of a project?


Finding the data bottleneck


There are three different ways you can control your data, by:

  • Going deep: increase number of data points

  • Going wide: increase number of data sources

  • Going quality: fix the mess!


Going deep

This scenario is when you don’t change your data structure, but you simply increase your data points.

You do not always have control on this point (eg you can’t easily increase your customers), but often you do at least in some respect.

There are a few different scenarios where data quantity helps.


A/B testing or experiment

If you are running an experiment, you need to have enough data points to achieve statistical significance of your results. How many points you need, would depend also on other factors such as the margin of error, the confidence interval, and the variance of the distribution. For each experiment you are trying to run there is a minimum data volume threshold: if you have already reached this threshold, you can move on, as additional data points won’t help. Otherwise, this might be your bottleneck. This article also gives a nice overview of this.


Prediction accuracy in machine learning

If you are running a predictive model, prediction accuracy increases with more data, but only up to a certain ‘saturation’ point. How do you discover if you have reached such a point? You can re-train your model with a varying number of training points, and you plot the prediction accuracy against the data volume. If your curve is not flattening yet, chances are that you could benefit further from additional data.


Source: Kim and Park article on researchgate

Enabling deep learning

While traditional machine learning models can also run with smaller data volumes, the more complex your model is, the more data it requires. To the end of the spectrum, deep learning models cannot run if they don’t have a very large amount of data available. For them, big data is a requirement, rather than a nice to have to increase performance.


Analytics & Insights

Even if you are not using your data for prediction purposes, but perhaps you want to enrich your reports or make a one-off analysis to support your decisions, data volume can still be a bottleneck. This is particularly true if your data has a lot of heterogeneity and you need to analyse it at different levels of granularity. For example, if you have a large sales force and a wide product range, each salesperson might have only sold a subset of the products. If you want to compare how good they are at selling a particular product, you might not be able to.


Going wide

Variety of data can be key, but in my experience this aspect is also often overestimated.


In one of my previous jobs, I was working for a startup predicting house prices with machine learning. The strategic advantage for us was the variety of data we had, so to incorporate all possible data sources that could help in making predictions on real estate.

A key point was to decide which sources to acquire, in order to increase the predictive power of the model.

How to assess cost vs benefit of acquiring new data?


Source: Gianluca GIndro

Assessing the benefits of new data involves two main questions: what is the correlation of new data with the target variable we are trying to predict (hopefully as high as possible), and what is the correlation of new data with the data we already have (hopefully as little as possible). Unfortunately this is not always easy to quantify analytically, but a bit of qualitative judgment can help to try to filter out the best candidates.


Assessing the costs of new data can be seen as a total cost of ownership of data. Sometimes there are actual costs in purchasing the data or paying an API, but that is only part of the story. There are other factors, often the largest ones, to be taken into account:

  • One-off vs recurrent ingestion

  • Complexity of data transformation and storage

  • Data quality and data cleaning required

  • Data processing and parsing


Going quality

There is a very stimulating talk by Harvard professor Xiao-Li Meng that states ‘Data quality is far more important than data quantity’. The beauty of that talk is that he is able to quantify mathematically that statement, looking at statistical measures of data quality or quantity.

Data quality is far more important than data quantity

My business experience has also reflected that: often companies want to start acquiring or incorporating more data without having first looked whether trying to work on existing data can be enough.

Data quality is often a problem, a big problem. This can be due to manual input errors, inaccuracy of raw data, issues in aggregation or processing layers, data missing for periods of time, and so on.

This can require lots of work, and particularly boring work, but can also bring the most rewarding results.


Conclusion

Try to identify where is your data bottleneck, if any.

Issues of data quantity can often be recognized by simple checks on statistical significance or on the accuracy curve. If this is not the issue, move on.

Variety of data is in my experience often overstated, not because new data won’t be useful, but because new sources might contain information that you have already captured in some way, particularly if you have already a relatively rich dataset.

Data quality is key, and it’s far better to focus on a smaller, but cleaner dataset, rather than on a large and messy one.

 
 
 

Comments


Post: Blog2_Post

Subscribe Form

Thanks for submitting!

©2020 by THE DATA MBA

bottom of page