00:00:06 Machine learning generational development intro.

00:00:38 1st gen: 1950s/60s statistical forecasting algorithms.

00:03:26 Transition to 2nd gen: late 80s/90s nonparametric models.

00:06:20 Statistical vs. machine learning convergence.

00:07:55 Tech improvements’ impact on machine learning evolution.

00:09:54 Deep learning’s effect on forecasting, contrasting standard ML.

00:11:31 Parametric models, avoiding deep learning overfitting.

00:13:01 Deep learning’s relationship with hardware, GPUs, linear algebra.

00:14:50 Cloud computing’s role in deep learning data processing.

00:16:01 GPU challenges, cloud computing benefits for supply chain forecasting.

00:17:22 Future of ML: rise of differentiable programming.

00:19:13 Supply chain industry’s ML investments, big data adaptation.

00:22:44 Tech change pace, supply chain executive’s adaptation.

00:25:24 Conclusion: SaaS, cloud computing’s importance in tech evolution.

### Summary

In an interview, Lokad founder Joannes Vermorel discussed the history of machine learning (ML), from its origins in 1950s time series forecasting algorithms to the advent of deep learning. He emphasized ML’s applications in supply chain management, his company’s specialty. Vermorel outlined the progression from simple, data-driven models to nonparametric statistical models capable of learning any pattern with sufficient data. The conversation covered key ML milestones, the role of technology, and the challenge of overfitting. Vermorel predicted future ML developments, including differentiable programming, and the ongoing focus on voice and image recognition. He concluded by advocating for Software as a Service to help supply chain executives keep pace with rapid technological change.

### Extended Summary

The interview between host Kieran Chandler and Joannes Vermorel, founder of Lokad, delves into the evolution and development of machine learning, with a particular emphasis on its application in supply chain management.

Vermorel suggests that the origins of machine learning can be traced back to the 1950s and 60s, with the emergence of the first time series forecasting algorithms. These algorithms, while not traditionally recognized as machine learning at their conception, exhibited key machine learning characteristics, such as being data-driven, statistical, and designed to learn patterns from data. Vermorel further highlights that the initial use of these algorithms was closely related to supply chain optimization, an area that his company, Lokad, specializes in today.

In terms of the specific methods utilized in this early phase of machine learning, Vermorel cites several that would be familiar to supply chain practitioners. These include moving averages, exponential smoothing, and more complex models such as the Holt-Winters and Box-Jenkins methods. He characterizes these initial algorithms as relatively simple, designed primarily to meet the computational capabilities of computers available during this period. These early models needed to be quick and efficient, capable of processing hundreds of data points with thousands of operations within the constraints of limited processing power and memory.

Shifting the conversation towards the progression of machine learning, Vermorel shares that the next significant leap occurred in the late 80s and into the 90s, marked by the emergence of nonparametric statistical models. This evolution from the first generation’s parametric models, characterized by a fixed number of adjustable parameters (typically no more than a dozen), represented a critical development.

Parametric models, limited by their fixed parameters, could only adapt to a certain range of data patterns. In contrast, nonparametric models did not have a predetermined form, allowing them to potentially learn any pattern, provided there was sufficient data. This shift signaled a breakthrough in machine learning’s capabilities and flexibility, providing the foundation for the more complex and versatile machine learning applications seen today.

Vermorel begins by highlighting the challenge of overfitting in early forecasting models, where increasing the number of parameters could lead to models that fit historical data perfectly but offered no predictive capabilities for the future. This was a major puzzle in the field for decades until the end of the 90s when satisfying solutions emerged with the advent of nonparametric models.

He then discusses the two camps in the field that emerged around this time: statistical learning and machine learning. The former comprised mathematicians doing statistics with extensive use of computers, while the latter consisted of computer professionals gradually moving towards statistical algorithms. He notes that these perspectives gave rise to different modeling styles. For instance, the statistical learning camp favored support vector machines, a model well-understood mathematically. On the other hand, the machine learning camp was more attracted to neural networks, which involved a lot of numerical manipulation.

Chandler then steers the conversation towards the role of technology in the evolution of these fields. Vermorel mentions a significant breakthrough at the end of the 90s, the idea that more data leads to better results. This concept extended not just to longer time series but also to more diverse data sets. Initially, this approach was a slow process as it required waiting for more history to accumulate. However, advancements in machine learning and statistical learning allowed for leveraging data from more products, leading to more accurate demand forecasts.

Vermorel cites the introduction of models like support vector machines in the late 90s and random forests in the early 2000s as significant steps forward in capturing information from larger, more diverse data sets.

The discussion then moves to the advent of deep learning. Vermorel explains that the gradual accumulation of critical insights made deep learning considerably different from standard machine learning. One of the key benefits of deep learning is its ability to learn more complex functions with less data compared to shallow learning algorithms.

Interestingly, Vermorel points out that deep learning does not necessarily outperform classical algorithms on small data sets. But, it excels when dealing with very large data sets, where shallow learning algorithms fail to leverage the extra information available.

In a surprising turn, deep learning brought back the use of parametric models, albeit with multiple millions of parameters, in contrast to the early parametric models that had a fixed number of parameters. The challenge here was to avoid massive overfitting, which was overcome through a series of clever techniques.

Vermorel further discussed the role of Graphical Processing Units (GPUs) in the advancement of machine learning. These are essential for deep learning tasks but are expensive and power-intensive. Cloud computing platforms alleviated this issue by providing on-demand GPU farms, effectively addressing cost and energy consumption issues. This has been particularly beneficial for supply chain optimization, where statistical forecasts typically run once daily, requiring GPU allocation for only a short duration.

Transitioning to the future of machine learning, Vermorel predicted a shift back to nonparametric models within the deep learning spectrum. He pointed to a new approach, “differentiable programming,” where the structure of the deep learning model is adjusted during the learning phase. This dynamic approach could be the next significant phase in machine learning and statistical learning.

When asked about the current focus of big tech companies, Vermorel mentioned that voice recognition, voice synthesis, image recognition, and natural language translation are currently receiving substantial investment. These are core areas of research and development, driving the future of machine learning. However, supply chain companies, including Lokad, are slightly behind, as they lack the resources to invest heavily in machine learning technologies.

Supply chain optimization presents unique challenges for machine learning application, particularly because it deals with smaller data chunks compared to other fields like image processing. This requires a balanced utilization of both CPUs and GPUs.

Chandler then raised the issue of rapid technological change and the challenge it poses to supply chain executives, whose implemented solutions risk becoming quickly outdated. Vermorel advised that Software as a Service (SaaS) could be a viable solution. He highlighted Lokad as an example of a SaaS provider that constantly updates and optimizes their services, thereby easing the burden on their clients.

### Full Transcript

**Kieran Chandler**: Today on Lokad TV, we’re going to go back to the start and investigate the generational development of machine learning and also understand if this gradual progress can give us any clues as to what the future of machine learning may hold. So, Joannes, what did this first generation of machine learning look like? When did it come about?

**Joannes Vermorel**: Interestingly enough, I would say the first machine learning algorithms were, in a way, related to supply chain with the very first time series forecasting algorithm that emerged during the 50s and 60s. It had all the core ingredients: it was data-driven, statistical, and indeed, it was trying to learn patterns from the data. At the time, people would not refer to that as machine learning; they were just forecasting algorithms. But all the ingredients were there.

**Kieran Chandler**: So what kind of methods were used? I mean, most supply chain practitioners would know them, right?

**Joannes Vermorel**: They would know moving average, exponential smoothing, and then there are more fancy methods from this era, such as the Holt-Winters model, the Box-Jenkins models, etc. So there was a series of relatively simple algorithms that emerged right at the beginning of computers. It’s interesting to see that as soon as we had computers in companies, they were actually used to optimize supply chains, albeit for relatively modest purposes.

**Kieran Chandler**: Back then, things were very different in the world of computational analysis. What was the main focus in those days?

**Joannes Vermorel**: The main focus was on having so little processing power, memory, and capability to do a lot of calculations. All those first-generation models, dating back to the 60s and 70s, were focused on being super fast. That means if you had 100 data points to process, you would have only a few thousand operations to do on those data points. These algorithms were designed for machines that had only kilobytes of memory and processor frequencies below 1 MHz.

**Kieran Chandler**: I imagine back then there was much less resource being applied to computational analysis compared to today, where you’ve got hundreds of thousands of people working on it. How long did it take for the next generation to come about? Did it take a long time for that to happen?

**Joannes Vermorel**: It was a gradual evolution. We had the first wave of models that emerged in the 60s and 70s, and they were all parametric models. These were statistical models with a fixed number of parameters, typically no more than a dozen.

**Kieran Chandler**: What does that mean, a parameter?

**Joannes Vermorel**: A parameter is like a number. So, your statistical model had a couple of numbers that you could adjust for the model to fit the data. The essence of the learning phase is to find those parameters. Typically, you’d have about half a dozen, maybe up to a dozen for the more fancy models, and that was it. What happened during the late 80s and more strongly in the 90s was the emergence of nonparametric statistical models. That was interesting because the first generation of models could not fit any kind of time-series patterns or any kind of demand patterns; they had a very limited number of parameters, so they were very limited in what they could learn by observing historical data.

**Kieran Chandler**: The second generation going from parametric to nonparametric was significant. If you had enough data, you could potentially learn any pattern. This breakthrough at the end of the 90s led to the development of models with appealing mathematical properties. Given an arbitrarily large number of data, you could get arbitrarily close to the best model without ending up with an overfitting problem. Overfitting, of course, occurs when you increase the number of parameters to a point where the model fits perfectly with your historical data but loses predictive capabilities about the future. Overfitting is a puzzling problem, it’s about having a forecasting model that is accurate on the data that you do not have. This problem puzzled decision-makers for decades, until some satisfying solutions emerged with the introduction of nonparametric models at the end of the 90s. With these models, we started seeing the advent of machine learning. How did that come about and what impact did it have?

**Joannes Vermorel**: It’s interesting. In terms of terminology, we had several camps. We had the camp of statistical learning where mathematicians, who were doing statistics, came to use computers extensively to support their work. On the other hand, machine learning was the opposite. It was computer folks who encountered these sort of problems and started to gradually move toward statistical algorithms. It was more of a perspective difference.

For example, in the statistical learning camp, you had support vector machines that were well-understood from a mathematical perspective, which appealed to the hardcore statistical community. On the other side, you had neural networks, a lot of numerical cooking that appealed to the machine learning community. These were different perspectives on the domain, and they did gradually converge.

**Kieran Chandler**: Regardless of the camp you belonged to, what was evolving around you was technology and the capabilities of what you could achieve with it. So, what were the significant technological improvements and breakthroughs that really helped with all of this?

**Joannes Vermorel**: The breakthrough at the end of the 90s was the idea that if you had more data, you would get better results. And I don’t mean just longer time series, but also more time series. For supply chain, that means, can you get a more accurate demand forecast just because you had more history? But the problem is, if you want a year or more of sales history, you need to wait another year, which is a very slow process. Moreover, with new products being launched and some products being phased out, you never get much more history.

There were some breakthroughs in being able to leverage more data from more products. This didn’t come at the end of the 90s; it came more in the 2000s. What made it possible were breakthroughs in machine learning and statistical learning, all related to those nonparametric models.

There was a series of these statistical models that represented breakthroughs, like support vector machines, published around ‘96 with working implementation by ‘98, and then random forests around 2001. These models started to work very nicely in capturing information from larger datasets with more diversity in terms of features.

**Kieran Chandler**: Deep learning, what was the impact of this and what was the key difference between deep learning and just standard machine learning?

**Joannes Vermorel**: It’s interesting because deep learning is the conjunction of probably a dozen critical insights, but it was all very gradual. Putting all those things together, it did make quite a big difference. One key benefit of deep learning is the capability to learn more complex functions with less data. The problem with second-generation machine learning algorithms, like shallow learning, is that they can learn any statistical pattern if given enough data, but in practice, it takes a huge amount of data to get there, which is completely impractical. Deep learning, in a way, was capable of making better use of very large datasets.

First, deep learning does not necessarily outperform classical algorithms on small data sets, but when the datasets become very large, those shallow learning algorithms do not really leverage as much as it’s actually possible, all the extra information that is there, while deep learning can. So what makes deep learning different? We are back to parametric models, which were used as early as the 1950s or 1960s. These have a fixed number of parameters, and we went to nonparametric models where the number of parameters is dynamic. Then, with deep learning, we are back to parametric models, but the big difference is that these models have multiple millions of parameters. Our models can have anywhere up to 20 million parameters.

To avoid massive overfitting, there were a series of very clever tricks uncovered as part of the deep learning movement. Another key ingredient was to think of statistical models that had maximal affinity to the computing hardware we had, such as graphical processing units (GPUs), which are very efficient at linear algebra. One of the computational tricks of deep learning is to bring everything back to linear algebra. By switching from CPU calculations to GPU calculations, we gained two orders of magnitude of extra computation, making a lot of things that were not possible suddenly become possible.

**Kieran Chandler**: You talk about the hardware progressing and the processing capability, what were the other technical improvements that were made in the industry that made this possible? How did the advent of the cloud fit into things?

**Joannes Vermorel**: The cloud really helped facilitate gathering all the data. If you want deep learning to be really of interest, you need a lot of data. Shuffling around terabytes of data is actually much easier with the cloud.

**Kieran Chandler**: It seems that cloud computing platforms have simplified things for everyone. For example, you no longer have to deal with disk quotas or manually managing your storage across multiple physical drives. Is that correct?

**Joannes Vermorel**: Absolutely. Cloud computing platforms have eliminated a lot of the manual processes associated with storage management. Also, they have facilitated the consolidation of all the necessary layers for deep learning.

**Kieran Chandler**: What about the cost of deep learning and GPUs? They are quite expensive and consume a lot of power, don’t they?

**Joannes Vermorel**: Indeed, graphic cards can easily consume around 400 to 500 watts. If you start to have multiple of them, it can turn into an electrical problem. However, cloud computing has eased this by offering on-demand GPU farms. In the specific case of supply chain, it’s very convenient because typically, you only need to do your statistical forecast once a day. You can allocate your GPUs for one hour, do all your calculations, and then return them to your preferred cloud computing platform, be it Microsoft Azure, Amazon Web Services or Google Cloud.

**Kieran Chandler**: Machine learning has developed gradually over the past few decades. Can we take any clues from this to predict the future of machine learning? What can we expect to see next?

**Joannes Vermorel**: Interestingly, everything goes in cycles. We started with parametric models and time series forecast, then moved to nonparametric models with the first generic machine learning algorithms. Next, we transitioned to hyperparametric models with deep learning. Now, what’s emerging are nonparametric models again in the deep learning spectrum. These are more sophisticated deep learning methods that adjust the very structure of the model during the learning phase. If I had to bet on the buzzword of tomorrow, it would be “differentiable programming”. This approach is similar to deep learning, but it’s much more dynamic in the way the model is built and evolves during the learning phase.

**Kieran Chandler**: So, differentiable programming is the new buzzword. The supply chain industry is often a little behind the big four in terms of what they’re focusing on. What are they investing research into at the moment and what big developments can we expect in the next year or so?

**Joannes Vermorel**: As far as machine learning is concerned, the big problems the tech giants are investing billions in are voice recognition, voice synthesis, image recognition, and natural language translation. These are core problems for information-driven learning and are ahead in terms of research and development. Supply chains, including those developing machine learning software, are a bit behind. No one in the supply chain has the resources to invest a billion dollars a year for better demand.

**Kieran Chandler**: There’s been substantial investment in forecasting, but it seems like it’s a small fraction of what’s necessary. It appears to lag a couple of years behind the big developments. What are your thoughts on this?

**Joannes Vermorel**: You’re correct. The big development right now is adapting the techniques found in other areas, like image and voice processing, into supply chain situations. This requires significant redevelopment. For instance, those big problems typically have large chunks of data to process. An image, for example, will be several megabytes. Therefore, it doesn’t require a sophisticated pipeline to move your data from the CPU to the GPU. Your image is a large object with a lot of information that will stay in the GPU for quite a while before the calculation is done.

On the other hand, supply chains have different requirements. The objects you want to optimize, like storage keeping units, are smaller data-wise but numerous. Your entire history of movements for a SKU will fit in a few kilobytes, but you have tens of millions of them. Therefore, adapting these techniques developed for big machine learning problems to supply chains presents a series of challenges. It requires us to make the most of both the CPU and the GPU because there are still a lot of calculations that are better done on the CPU side.

**Kieran Chandler**: It sounds like the industry is constantly evolving and changing. Implementations tend to become outdated quickly. How can a supply chain executive possibly keep up, and do you have any tips for that?

**Joannes Vermorel**: The pace of change is indeed a challenge. But it has always been a problem as far as computers are concerned. My suggestion is to opt for Software as a Service (SaaS) solutions like Lokad. For example, we are at the fifth generation of our forecasting engine, but our clients don’t have to do anything technical to upgrade. We upgrade them from one version to the next on their behalf, as part of the package.

With the advent of SaaS software, this problem becomes much easier to manage. You don’t have to dedicate resources just to keep up - your vendor does that for you. This was not the case with on-premises software, where upgrading from one version to the next was typically a big project.

By the way, cloud computing platforms have solved this very same problem for us. So, a supply chain manager using a SaaS app like Lokad, which delivers advanced predictive analytics to optimize your supply chain, will keep up with the pace of change. Lokad, in turn, keeps up with the pace of change because the cloud computing platform we use is Platform as a Service (PaaS), and it constantly upgrades many things for us.

**Kieran Chandler**: It sounds like everyone’s essentially keeping up with the technology advances, that’s quite insightful. Thank you for sharing your thoughts, Joannes. We’ll continue this discussion next time. Thanks for watching, everyone.