Automated Machine Learning is broken
Automated Machine Learning (AutoML) according to Wikipedia: is automating the end-to-end process of applying machine learning to real-world problems.
Typical machine learning application contains:
- Data preprocessing
- Feature extraction, engineering and selection
- Algorithm selection
- Hyperparameters optimization
A lot of steps! AutoML aims to automate them so you can spend your precious time on more challenging tasks. Sounds cool, but is it? Let’s look closer …
Before starting - what is your problem?
Before starting your ML pipeline (automatic or manual) you need to know:
- what you are trying to achieve?
- what problem are you trying to solve?
- do you really need ML?
This is very important! Perhaps you don’t need a complex Machine Learning solution. Many problems are not ML problems. They might be optimization issues or exploratory data analysis tasks or problems that can be solved with simple statistics.
OK, now you are certain that you need ML.
Data problems - what? where? how?
Then, you need data!
What data do you need?
- Do you already have it (somewhere)?
- Do you need to buy?
- Do you need manual labeling?
- Where is the data located and how can it be accessed?
- How do you combine multiple sources data? (this can be non-trivial)
Many say that the steps outlined above are the most time consuming (thus the most expensive) in building ML solutions. (At least some basic ETL will be required to run this part in a production environment).
Let’s train models (easy part, the autoML part)
After data is prepared, the autoML can be applied. In one mouse-click you can train 100s or even 1000s of models which are then trained in parallel in the cloud. You end up with a set of models, or even an ensemble of models, trained to optimize some metric with respect to some validation scheme. The best single (heavily tuned) model will be a few percent better than a single model with default hyper-parameters. An ensemble of all your models will give you the next few percentage points of improvement. Let’s say you will end up with up to 10-25% improvement, pretty good, isn’t it?! In same cases it is a lot and can be converted into huge ROI - think of trading solutions or credits scoring tasks.
But! You end up with a complex model which is:
- Hard to understand blackbox (yes, you can try to use AI explainers and try to explain models and their predictions)
- Time needed to compute predictions can be long (in the case of an ensemble, you need to compute predictions from all models in the ensemble)
- Hard to maintain …
Machine learning maintenance
What do you mean by model maintenance? Well, it is not enough to train accurate and fancy AI/ML model. For using the ML model in production you should:
- Monitor data quality and be aware of possible drifts (for example in case of NLP the language needs years to change significantly, but in case of customer data it can be changed in days)
- Monitor predictions (the model is computing predictions on completely new data unseen in the training and so no one know what will happen)
- Provide a feedback loop which can be used for monitoring purposes and model retraining
The boring software
Training ML models is exciting! This is what Tigger likes most! Building software around ML is not as exciting. The boring software needed for ML model to operate in production consists of:
- Software to prepare input data for training and production - your ETL solution
- Software for serving ML model, monitoring its data quality and feedback loop
- Software for requesting ML model to compute predictions on new data records and making actions (yes, you need to do something with predictions and you need code for this as well)
When running ML locally, the last two steps are usually connected. All the software used needs some maintenance as well.
What to do with predictions
You were successful with all above steps. Congratulations! You can now compute predictions for your new data records. What do you do with them?
- You can use them to get insights (I think this is the easiest way and in fact it is not using ML model at all, it is exploratory data analysis)
- You can use them to make actions. There are many possibilities, for example it can be used to submit a trade on exchange or sending an email to a customer. I think this is the most beautiful part of whole ML pipeline. Act based on data.
- AutoML solutions can tune the ML model and bring improvement in accuracy under some validation schema and data. It can save a lot of time usually required for data scientists to check different algorithms and validations (you don’t need to code, check library versions or interfaces). For sure, having autoML is a huge advantage and is necessary. However, in my opinion, you need a lot of resources (money, time and people) to handle ML in your company, even if you have an autoML magic box.
- You need to provide boring software for manipulating the data and making use of ML predictions. I think this part of the ML pipeline is so underrated while it is a work-horse of data-driven solution.
- What is more, many times you don’t need a complex, high-accuracy model. I think it is best to start with a model that is as simple as possible, and check if you can apply it in real life and if is it providing improvement/ROI for your business. Even simple ML models should be better than no model at all (i.e., random treatment). Simple models have beautiful advantages over complex one: they are readable for humans (yes, you can understand what is it doing!).
- Use ML to make actions. You will feel the beautiful breeze of automation. The ML model is learning itself based on your data (you don’t need to code it), then it helps you to predict the future. Use it to make actions.
- The commercial autoML solutions are pricey (50k - 200k per annum) so only big enterprises can afford them. Beside that, you still need a lot of resources to successfully deploy ML in the business. I think small and medium business will need to wait till the technology matures and becomes more affordable.