Have you heard of the Darwin Awards? Get on YouTube and take a look. It's usually pretty funny. It is an ironic honor that recognizes people for the most sophisticated attempts to do something that they find cool. One takes a selfie with an injured bear, another screws a jet engine on a skate. These bold actions result in fatal mistakes with dire consequences and funny comments. Spoiler alert – unfortunately – they all die. You don't want your startup to "die" from machine learning errors.
In the past 25 years, I have seen thousands of times when a person makes mistakes – but never when a machine makes a mistake. Today, a mistake in learning projects can cost companies millions and years of useless work. For this reason, the most common errors in machine learning related to data, metrics, validation and technology are collected here.
- The data.
The chances of making a mistake when working with data are quite high. It is easier to successfully pass a minefield than to be right when working with the dataset. In addition, there can be several common errors:
- Data not processed. Unprocessed data is waste that will not allow you to have confidence in the adequacy of the model built. Therefore, only preprocessed data should be the basis of any AI project.
- Anomalies. To verify and eliminate data on deviations and anomalies. Getting rid of mistakes is one of the priorities of every machine learning project. The data may always be incomplete, incorrect or certain information may be lost for a certain period.
- Lack of data. Perhaps the simplest way is to conduct 10 experiments and get the result, but still not the most correct. A small amount of unbalanced data would lead to a conclusion far from the truth. So if you need to train the network to distinguish the glasses penguins from the glasses bears, some photos of bears will not fly. Even if there are thousands of pictures of penguins.
- Lots of data. Sometimes limiting the amount of data is the only correct solution. This is how you can get, for example, the most objective picture of human actions in the future. Our world and the human race are incredibly unpredictable. Typically, predicting a person's response based on their behavior in 1998 is like reading tea leaves. The result, quite the same, will be far from reality.
Precision is an essential metric in machine learning. However, the pursuit of insane absolute precision can become a problem for an AI project. In particular, if the goal is to create a predictive recommendation system. It is obvious that the accuracy can reach an incredible 99% if the online grocery supermarket offers to buy milk. I bet a buyer will take it and the referral system will work. But I am afraid that he will buy it anyway, so there is no sense in such a recommendation. In the case of an inhabitant of the city, who buys milk daily, it is an individual approach and the promotion of goods (the one that we did not have in the basket anymore early) that count in such systems.
A child learning the alphabet gradually masters letters, simple words and idioms. It learns and processes information at a certain level. At the same time, the analysis of scientific articles is incomprehensible for the toddler, although the words in the articles consist of the same letters as he learned.
The model of an AI project also learns from a specific data set. However, the project will not deal with an attempt to verify the quality of the model on the same set of data. To estimate the model, it is necessary to use information specially selected for verification that was not used during the training. In this way, one can get the most accurate quality assessment of the model.
The choice of technology in an AI project is always a common mistake, otherwise leading to fatal, but serious consequences which influence the efficiency and the timeframe of the project.
No wonder, you can hardly find a more passionate theme in machine learning than neural networks, due to its universal algorithm suitable for all tasks. But this tool will not be the most efficient and quickest for any task.
The brightest example is the Kaggle competition. Neural networks don't always take first place; on the contrary, random tree networks are more likely to win; it is mainly related to tabular data.
Neurons are more often used to analyze visual information, voice and more complex data.
Using a neural network as a guide, we can see, these days, that is the simplest solution. But at the same time, the project team needs to clearly understand which algorithms are suitable for a particular task.
I truly believe that the hype of machine learning will not be false, exaggerated and baseless. Machine learning is another engineering tool that makes our lives simpler and more comfortable, gradually changing it for the better.
For many massive projects, this article may just be a nostalgic retrospective of the mistakes they have already made but which have nevertheless managed to survive and overcome serious difficulties along the way. of the product company.
But for those just starting their AI adventure, this is an opportunity to understand why it is not the best idea to take a selfie with an injured bear and how not to fill out the lists. endless "dead" startups.