In an environment where machine learning developers are encouraged to build and deploy pipelines, it can be tempting to find the nearest machine learning use case, hook up the hose and let things flow. But taking some time before deployment to actually look at your data and do an exploratory data analysis (EDA) will reveal information that will change the way you implement your machine learning models.
Exploratory data analysis helps ensure your machine learning implementations are reliable, robust, and can bring insights before you engineer the pipeline and push to production.
Here are three ways that EDA can improve your machine learning implementations:
Any machine learning practitioner will tell you that the best way to improve your modelling output is not by picking a more sophisticated model, but by boosting your data quality and quantity.
Quantity is often the quickest criteria to evaluate through exploration. As a rule of thumb, the more complex a machine learning model you’re looking to train, the more data will be required to do it well. A simple regression model may only need a few hundred data points for useful insights, then again a recommendation model to detect the subtle content preferences of an entire user-base would likely require hundreds of thousands of data points. Taking a quick look at your data will tell you whether or not you’re set up for success.
The behaviour of machine learning systems is largely an artifact of the data that they’re built on.
Taking an early glance into your dataset will help you understand whether you have substantial amounts of missing data, duplicate data, data whose type is incorrect or inconsistent, or simply erroneous values -- data quality issues that can degrade model quality.
Depending on the platform you’re using, your models will still train smoothly with any or all of those elements present, but of course its predictive performance will suffer. These types of issues can generally be raised and addressed by data professionals in collaboration with the business.
Part of the job of machine learning practitioners is to assess a data set’s readiness for machine learning training given the business problem at hand and to prevent bias from creeping into the pipeline along the way. We are more susceptible to biases than we believe, and so are our ML systems. There can be biases in the way your data has been collected and analyzed (e.g. sampling bias, measurement bias, labelling bias), in the way your models are evaluated after training, or even in the way that our business problem is formulated in the language of data. (For a more in-depth look at biases in ML, see this paper). Some of these biases can be assessed in the data exploration phase to lead to a more robust machine learning product when things move into production.
After considerations around bias and basic cleaning, (exploring the problem space and raw materials), we can move into an exploration of the patterns inherent within the data. This is often the most overlooked step in machine learning preparation, and it’s a shame because it delivers a ton of value later down the road.
This is where data visualization is key. Visualize the distribution of your numerical features to determine whether they’re normal, skewed, exponential, etc. Take a look at the correlations or mutual information shared between your features to see whether they’re highly related or mostly independent. Not only do these findings matter, but these visualizations can start to shed light on the patterns and trends behind the business problem you’re investigating.
Often businesses turn to machine learning as a way to get insights about their data.
EDA gives you the opportunity to ask yourself: Is my question already answered after having simply visualized the data? The insights you desire may already exist in your data without ML and can be surfaced using an EDA.
If desired insights exist, then you may not even need to pursue a machine learning solution, you may just need a well-designed dashboard fed with a pipeline of the most up-to-date data. You’ve just saved your organization a ton of time and effort going down an ML rabbit hole when the optimal solution was a much simpler one.
Of course, many times we’ll still need to move forward with that initial machine learning vision, but now you’ll be moving with much more clarity. Visualizing the distributions and relationships across your features can reveal that you may need to do some feature transformation to improve your modelling results. Perhaps taking the logarithm of features that have highly skewed distributions, or taking the ratios between two features in order to develop a more relevant metric, or performing principal component analysis on a set of features that are highly correlated. These actions are all dependent on your dataset, and there’s no better way to determine when a particular action is needed than to actually look at your data.
Exploring your data can also reveal that it doesn’t carry enough signal to bring predictive power. This is when you would look to data augmentation techniques. Adding features to your dataset, whether from elsewhere in the business or from external sources, can bring new information that boosts the power of your ML products.
All of these feature transformations and data augmentation actions can be encoded into a solution like a feature store so that once the actions are explored and finalized, they’ll be performed automatically as new data flows in.
Lastly, a solid Exploratory Data Analysis can actually make the modelling process more focused, and quicker as a result. Remember those visualizations and trends you explored earlier? Those will help determine just how complex of an ML model you’ll need for your product.
Machine learning is a special type of endeavour where bringing in the most powerful tool possible can actually make your product worse. Imagine if having Michael Jordan play on your schoolyard basketball team would almost guarantee you’d lose. That’s the world we find ourselves in with machine learning. A good ML practitioner will tell you that the best model for a product is, a) one that is appropriate for the problem & data at hand, and b) the simplest one for the job. In the spirit of Occam’s Razor — the simplest viable model is usually the best one.
If the trends in your data are linear, you don’t want a complex non-linear model, even if that model is state-of-the-art. If you have a limited volume of data, you’ll want the model with fewer parameters to fit.
Exploring your data will give you insight into the complexity of the relationships in your data, guiding your decision-making process when choosing a set of models to prototype.
Take some time before deployment to actually look at your data. The EDA process will reveal information that will change the way you implement your ML models.
Of equal importance is the attention paid to concerns after the model has been trained. How and where will it be deployed? How frequently will it be updated with new data? Are you monitoring your models for drift? These are some of the considerations that fall under the umbrella of MLOps, and they ensure that our ML products bring real business value over time.
If you’re looking for help with Exploratory Data Analysis or with anything under the MLOps umbrella, get in touch.