Construction of a predictive analysis model from scratch

There is a lot of talk now about the potential value that artificial intelligence can bring to businesses, and the logistics industry – because of its complexity and the fact that e-commerce depends on it – is no exception.

Imagine that your e-commerce business needs to ship an order from San Francisco to Seattle and you have promised a delivery within 2 days. It's 3:34 pm and USPS, UPS, FedEx and Ontrac all have different time limits in their sorting centers. It will take between 15 and 45 minutes to your warehouse to pick up and pack the order, and there is a 62% chance that a thunderstorm will occur in San Francisco tonight. Do you ship it by plane (express) or by land?

If you choose to ship it by air, you lose all your profit margin. If you choose the land, your margin is large, but it may be late and you may lose the client. The only way to make this decision in real time, thousands of times a day for your growing business, is to predict the future. There are far too many variables and factors to be considered by a human – you need AI. You need a predictive model. And if you do not have one and your competitors do, you give them ground and you lose the competitive advantage.

Start with the data

That is the promise of AI and Machine Learning (ML): collect a mountain of data, integrate them into a predictive model and take advantage of it! Unfortunately, it is not so simple. Even the best neural networks have difficulty extracting accurate predictions for very complex real-world issues.

In 2016, DeepMind used a self-taught neural network to beat the 18-time world champion Go player, a game that is probably more complex than chess. Forming a network of neurons to play games (for example, Chess or Go) is not easy, but it differs from the real world in that you have perfect and accurate data at all times. You know the positions and possibilities of each element of the table and you know instantly when they change. This is rarely the case for the tough business questions you want to get answered in order to gain a competitive advantage or reduce costs.

Your data probably comes from multiple sources of varying quality, there is no guarantee that they will be delivered to you in real time and there are far too many of them – more noise than signal. Before you start uploading all your data into Tensorflow or the Google Cloud AutoML table, you need to understand your domain in depth and hire a scientist.

Statistical processing has been around for decades and only a trained computer scientist will be able to process the petabytes of data you've collected and clean them up to make your forecasts accurate. The interest of AI and ML lies in the fact that we will have better models with much less work – no more tedious extraction of features or selection of variables! But that's just not the case … for the moment. Almost none of your raw data will fit perfectly into a predictive model – they will all have to be distributed in several formats for each specific application.

It is common for newcomers on the field to be excited about the ease of use of modern AI and ML tools, but the devil is in the details. Even the simplest models will give you a prediction, but the accuracy of these predictions will be so bad that you will not be able to extract any more commercial value. Unfortunately, the difference between a naive model and a sophisticated model developed by a data scientist will be confirmed by the accuracy and confidence you place in his predictions.

Our experience

At EasyPost, we try to predict when shipments will arrive at their destination. However, even with tens of billions of data points on previous shipments, this is extremely difficult to do. When we started trying to do these forecasts with our only tracking data, the results were catastrophic. However, when we began to pair scientists with data from transportation experts, we were able to make tremendous progress in terms of speed and accuracy.

Human intelligence can notably help AI to human intelligence: our human resources experts understand the importance of time limits in sorting facilities in the logistics sector. By adding data from the domain experts – in this case the time limits of each type of installation in the networks of operators – we could significantly improve our results. By adding domain-specific relevant data to our scientists' toolbox, we are able to create a smarter model than AI alone.


In our experience, a complex question like the one previously asked about shipping times contains too many variables for the best neural networks to learn and solve on their own. Fortunately, they do not have to do this, but you will need data scientists to work with experts in the field to properly weight the importance of air humidity levels in the Bay Area. !

The future of predictive models is promising, but do not ignore the past! Statistical processing and data science are the key to formulating and simplifying complex issues for AI and advanced ML to tackle.

Sawyer Bateman

Sawyer Bateman

Sawyer studied computer science at the University of Alberta in 2001 before joining Dreammates, one of the first dating sites as CTO. After his stint in matchmaking, he directed and one of the largest networks of online flash gaming sites. In early 2013, he met Jarrett Streebin and partnered with him to create EasyPost, believing that the logistics had to be user-friendly for developers. They started by facilitating shipments and, in 2016, decided to also be the best company to make it faster and cheaper through the introduction of their distribution network. Today, Sawyer leads 60 of the world's best marine engineers to solve these three problems: easy, fast and economical.