A/B Testing, Exploration vs. Exploitation

September 4, 2019

supply chain science and tech

00:00:08 A/B testing and its applications in marketing and supply chains.
00:01:47 Examples of A/B testing in marketing and supply chains.
00:03:41 Problems with A/B testing in supply chains and how it displaces problems.
00:06:02 Displacement issues and interconnectedness in supply chain A/B testing.
00:07:45 Supply chains as interconnected systems and the challenges of A/B testing.
00:09:58 A/B testing limitations in supply chain management.
00:11:45 Reinforcement learning to supply chains.
00:13:22 Balancing exploration and exploitation in decision-making.
00:15:01 Randomness for better supply chain insights.
00:17:08 Companies exploring alternative suppliers and markets.
00:19:39 Quantifying the worth of knowledge in business decision-making.
00:20:52 How Lokad optimizes business decisions by considering second-order effects.
00:23:42 The future importance of exploration and quantifying its value for companies.

Summary

In this interview, Kieran Chandler talks to Joannes Vermorel, Lokad’s founder, about A/B testing and its limitations in supply chain optimization. They discuss the history and applications of A/B testing, which is popular in marketing but less so in supply chain management. Vermorel argues that A/B testing is insufficient for supply chain optimization due to the interconnected nature of supply chains and the limited learning it provides. Instead, he suggests adopting a machine learning approach and introducing randomness in decision-making. By continuously exploring alternative options and quantifying knowledge, Vermorel believes state-of-the-art companies can enhance their supply chain processes, driving optimization and improvement over time.

Extended Summary

In this interview, Kieran Chandler discusses A/B testing and its applications in supply chain optimization with Joannes Vermorel, the founder of Lokad, a software company specializing in supply chain optimization. They begin by explaining what A/B testing is and its history before delving into its applications, limitations, and alternatives.

A/B testing, a subset of experimental design, involves testing two variants against each other to determine their effectiveness. The method likely originated in the late 19th century, although records are unclear due to its intuitive nature. A/B testing is part of the scientific method and the broader field of design of experiments, which aims to acquire nuggets of truth about statements or hypotheses.

A/B testing is particularly popular in marketing, where it is used to assess the effectiveness of promotional materials, such as newsletters or advertisements. An example of A/B testing in marketing is splitting a customer database into two random groups and sending version A of a newsletter to the first group and version B to the second group. The outcomes are then measured to determine which version performed better.

In the early 2000s, Google conducted a series of A/B tests to determine the optimal number of search results to display on their search engine results page. The tests helped the company balance page load times and user satisfaction, ultimately leading to the decision to display around 10 results per page.

Although A/B testing is less popular in supply chain management, Lokad is often asked, either explicitly or implicitly, to conduct A/B tests for their clients. In the supply chain context, A/B testing usually involves comparing the performance of a set of stores managed by Lokad’s inventory optimization system against a set of comparable stores managed by the client’s existing system. The comparison is conducted over a period, such as three months, and can be referred to as a benchmark or pilot.

Vermorel argues that A/B testing may seem like a rational approach to compare two methods, but it can be problematic for supply chain optimization due to the interconnected nature of supply chains.

Vermorel explains that in a supply chain, problems are often displaced rather than solved. When comparing the performance of two different optimization techniques, they may not be independent, as they are competing for the same resources. This leads to a situation where the optimization of one technique can be done at the expense of the other. The interconnected nature of supply chains also means that when one part is affected, it can influence other parts, making it difficult to isolate and measure the impact of a single variable.

Another issue with A/B testing in the context of supply chains is the limited learning it provides. A/B testing only tests one hypothesis at a time, generating a small amount of information. This may be sufficient if one is looking for absolute certainty on something they feel strongly about, but supply chains are constantly changing, and the slow pace of A/B testing may not keep up with the evolving needs.

Vermorel also points out the problem of seasonality, which can affect the validity of A/B testing results. To account for this, a test may need to run for 12 months, but this is often not feasible as it only provides a single bit of information about which system is better. In addition, different systems may be better for different types of products or situations, further limiting the usefulness of A/B testing.

Instead of relying on A/B testing, Vermorel suggests looking at the problem from a machine learning perspective. This approach focuses on actively extracting information from data, which can be more effective for understanding complex and interconnected systems like supply chains. By considering how decisions influence observations, it becomes possible to better learn about demand and optimize supply chain operations.

Vermorel explains that businesses should balance optimizing their current processes with exploring alternative options. This might involve introducing randomness to their decision-making, which can help prevent companies from getting stuck in a local minimum – a situation where they think they have found the best solution, but a better one exists if they were to deviate from their current approach.

One way to introduce randomness is through experimenting with different products or suppliers. For example, a retail network could add a few random products to their assortment in each store or test alternative suppliers to gauge their reliability and product offerings. Companies in the automotive aftermarket industry have even implemented this approach, passing a portion of their orders to suppliers that do not initially offer the best prices or conditions, simply to test the waters.

Although it might seem counterintuitive for companies to introduce such randomness into their processes, Vermorel argues that this approach can actually improve profitability in the long run. By continuously learning about their market, businesses can uncover new insights that may have significant impacts on their bottom line. For instance, they might discover that they could raise or lower their prices without impacting sales, leading to increased revenues or economies of scale.

Incorporating randomness into decision-making allows companies to test alternative markets, suppliers, price points, and even supply chain organization structures. This investment in exploration helps businesses discover slight variations that are better suited for their operations, which in turn can drive growth and enhance their overall performance.

Joannes Vermorel, founder of Lokad, discusses the importance of exploring and quantifying knowledge within a company. He references a paper he published over a decade ago, introducing an algorithm called “poker price of knowledge and estimated reward” that can help quantify the cost and reward of exploration. Vermorel emphasizes that companies should optimize for actual gains, such as dollars, rather than arbitrary targets. He predicts that state-of-the-art companies will increasingly introduce exploration and randomization into their supply chain processes to drive optimization and improve over time.

Full Transcript

Kieran Chandler: Today, we’re going to discuss why this technique is profoundly weak and understand what some of the alternative techniques we can use in order to test our supply chains more effectively. So, Joannes, perhaps you should just start off, as always, by telling us a bit more about what A/B testing is.

Joannes Vermorel: A/B testing is a method to test whether a hypothesis is true or not, typically by comparing two groups, but it can be more than two groups. It’s a bit fuzzy when it was invented. I would probably guess that it was invented somewhere in the late 19th century, but the records are fuzzy, and probably because it’s something so intuitive, people thought about it much earlier, it was just not very clearly documented and not necessarily called A/B testing. The interesting thing is that it’s part of the scientific method, which is in the field of design of experiments, a scientific way to acquire nuggets of truth about any statement that you can make. It’s not going to prove that any statement is true, but it can give you a scientific answer to the question of whether your hypothesis is true or not.

Kieran Chandler: So, what sort of types of experiments are we really talking about here?

Joannes Vermorel: A/B testing is super popular in marketing. In supply chain, it’s much less popular. In marketing, it’s heavily used for things like promotional newsletters. For example, if you’re advertising one product first and another product second, you can split your customer database into two random groups, send version A of your newsletter to the first group and version B to the second group, and then measure the outcome. That’s a fairly efficient way to do A/B testing.

Kieran Chandler: So, the idea is that you’re sending out two things and seeing which one performs better?

Joannes Vermorel: Exactly. You’re testing a hypothesis. Google, for example, very famously did a series of A/B tests in the early 2000s just to determine how many search results were optimal in terms of display. They found a balance through A/B testing, which was around 10 results at the time.

Kieran Chandler: Why is this something that’s of interest to us here at Lokad? Is it something our customers are really asking for?

Joannes Vermorel: In supply chain, we are frequently asked, either explicitly or sometimes implicitly, about doing A/B tests. In supply chain, A/B testing takes another form. For example, people would say, “Let’s have Lokad manage 10 stores with their inventory optimization system, while 10 other comparable stores are managed through the old system. We’ll run that for three months and compare the results.” They might call it a benchmark, but it’s actually an A/B test.

Kieran Chandler: There’s sort of an A/B test going on, and it sounds fairly rational. It seems like you need a way of comparing these two different approaches. So, how does it actually work in the real world?

Joannes Vermorel: The crux of the problem is that it looks obvious and reasonable. You might say it seems like a reasonable way to compare those two methods. I just change one variable, like the software driving stock, and make sure my experiment is representative. So, I would take multiple stores and a longer period, like three months, to ensure statistical significance. All of that looks fairly reasonable and rational. But there is a “but” - it’s more complicated than it seems. The problems I have with those benchmarks are in my book examples of naive rationalism. It looks very scientific, but it’s not actually super scientific or rational; it just appears to be.

The problem in supply chain management is that you tend to displace problems rather than solve them. For example, you have those 20 stores in the test. It looks super rational. The problem is that all those stores compete with the same stock at the distribution center. If I want to cheat, being Lokad, the software, I could boost my own results by consuming a lot of stock, improving the performance of my scope at the expense of the other stores. And if you have a benchmark that says the goal is to maximize the performance of those ten stores, the mathematical optimization will do it at the expense of the other stores. So, there is a feedback loop between the stores because they are competing through the same distributor for the same stock at the distribution center. This always happens in supply chains; it’s a system, and it’s interconnected by design.

Supply chains allow for massive gains in terms of efficiency, reliability, cost, and economies of scale. But the downside is that, because it’s one system, if you touch one part, you tend to influence the other parts.

Kieran Chandler: What would be a better approach then? Should you try one technique for six months at twenty locations and then another technique for six months?

Joannes Vermorel: Another problem I have with this sort of benchmark is that you learn very little about your system. A/B testing is typically underappreciated because you’re testing only one hypothesis at a time. In terms of information, we’re talking about a bit of information, just a zero or one. It’s not even a byte, but a bit. And it’s not even a full bit because you’re only going to have a degree of confidence in your results. So, what you learn is like a fraction of a bit, which sounds very little, and actually, it is very little. The main criticism of A/B testing is that you learn very little about your system.

Kieran Chandler: Testing is good if you want to have absolute certainty on something where you’re feeling very strongly about it. For example, you can do an A/B test to have the final confirmation that you were right, but the problem is that you’re assuming that you already know the truth. That’s why it works very well for science. In scientific methods, people gather clues in very indirect ways, and once they have gathered a mountain of clues, they perform an A/B test to confirm their hypothesis in a more direct way. But it’s going to be very expensive and slow, and that will be the final confirmation, putting the nail in the coffin and closing the case forever.

Joannes Vermorel: The problem with supply chains is that things are changing all the time. Your network is an ever-changing beast. If you want to do an A/B test for supply chain optimization, you might need 12 months instead of three due to seasonality. But then, who can afford 12 months just to get one bit of information about which of the two systems is the best? There are so many other alternatives in the market, and only so many trials you can conduct. System A might be better for slow movers, while system B could be better for high movers. Having just one bit of information is very weak, and it won’t give you any insight on the best option.

The problem with A/B testing is that you’re only testing two possible paths, and in a supply chain, there are millions of possible paths. How can we possibly generate information on all those possibilities?

Kieran Chandler: So, in a supply chain, we’ve got millions of possible paths. How can we possibly generate information on all those possibilities?

Joannes Vermorel: That’s a very interesting question, and a more modern perspective on the case would be reinforcement learning. When you want to think about how a learning engine works, you can extract information from the data passively, like data comes and you want to learn, or actively, where what you do has an influence on what you observe, which is the case in supply chain management. For example, if you decide not to put a product on sale in a store, you’ll never observe the demand for this product in this store.

A/B testing is a way to acquire knowledge, but it’s incredibly sluggish. If you had to learn as a baby to walk through A/B tests, it would take a million years to learn walking. It’s very powerful for scientific certainty, but it cannot be the process that drives a journey to the truth.

In supply chain management, a more modern perspective is reinforcement learning, where you think about a trade-off between exploration and exploitation. You have a guess of what the good thing is, but you’re not completely convinced it’s always the best, so you want to do what is called exploration. You randomize your actions a bit to learn more about the system.

Kieran Chandler: You have your optimization process that is trying to optimize, you know, according to specified metrics, some algorithm that drives you to what you think is the optimal according to your own measurements. But the problem is that if you do that, you can be stuck in a way of doing things, which is, you know, mathematically it’s what people call a local minimum. You try to minimize your cost function, and you’re stuck in an area like a local minimum where it looks good. If you deviate from this point, it seems that you are at the optimum, but actually, if you want to have something that is way better, you need to diverge.

Joannes Vermorel: So basically, we’re talking about introducing a certain percentage of your decision, which actually might not be correct and might not actually go with your optimization. But it’s basically introducing this certain percentage of potential error just in order to find out more about what could possibly work. And obviously, this is about experimenting. You don’t want to do crazy things, but for example, if you have a large retail network, the idea would be to change your assortment. You can decide that all the time in every single store, you’re going to introduce a few products that are not usually part of the assortment, pretty much at random. Obviously, you will not try to do that with super expensive items, like an expensive gardening machine if you have a store that is in the middle of a city. You don’t do things that are completely absurd, but you introduce some randomness to try if some products would not completely unexpectedly get a lot of traction just because you’ve tried them in a city center, and usually, you were thinking that this product was not a good fit for this area. It turned out that maybe it is. So you want to introduce some kind of randomization.

It can be done in supply chain, for example, just sometimes try other suppliers to test the water in terms of lead times. You have your routine supplier, and you just pass a few orders to competitors just to see how it goes. And I’ve even seen companies, in the automotive aftermarket, for example, have that in place automatically, where a certain fraction of the orders passed to suppliers are not initially passed to the suppliers that offer the best price and the best condition, but they are just passed to test the water and see if the supplier is super reliable, and if the products meet the expectations in terms of ordering process, meaning that when you order a certain part, it’s really this part that you get and not another one.

Kieran Chandler: It seems very surprising because companies, on the whole, are normally so based around profitability and acting as efficiently as possible, maximizing that bottom line. They’re actually introducing these different suppliers and just in order to test things. But is that a difficult thing to incorporate?

Joannes Vermorel: That’s, again, I would say the naive rational approach would say, “Oh, we just directly optimize.” But that’s neither rationalism nor the best approach. If you start thinking about the second-order effects, the idea is that you want to always be learning about your market. You want to test alternative suppliers, alternative markets for your clients, alternative price points because the idea is that knowledge has a price, and it’s valuable. You can have big rewards.

You might be stuck, for example, you might realize that you’re selling your product at a certain price point, but actually, you could raise your price, and it would still more or less sell the same. It’s just that you’ve never tried; you didn’t think that people were perceiving your product as valuable as they are.

Kieran Chandler: The reality is that usually, you know, you’re stuck in what you’ve been doing so far. Or maybe sometimes the opposite is true – actually, you’re selling your product at a price that is too high. And if you were trying to lower the price, you would actually vastly increase the demand, and then you would have economies of scale that would kick in. And then you could actually produce at a cheaper price and have things that snowball in terms of achieving a lot of growth for the company. So the idea is that this randomization that can be introduced is actually an investment that you make in the idea that you’re going to discover slight variations that are better suited for your company. It can be variations in your price points, in your suppliers, or even in your supply chain organizations, such as which warehouse is supplying which plants, or vice versa. Is there any way of quantifying this knowledge and working actually how much this knowledge is worth to a company?

Joannes Vermorel: Actually, yes. I mean, I’ve even published a paper over a decade ago called “POKER: Price of Knowledge and Estimated Reward.” So if you really want to do it the fancy way, you can literally quantify the cost of exploration versus the reward of exploration of what you gain with having a certain horizon. Because obviously, you have to keep playing – it’s the idea of having an iterated game where you play the same game over and over. And when you explore, well, you do things that are typically less optimal, but sometimes you hit a sweet spot, and then afterward, you can exploit this finding. But the idea is that in order to do that, you need to have an algorithm, especially, I would say, on the machine learning side, that can really take advantage of this noise in your data and can leverage that to learn not just a bit of information but a lot more. And again, this is not just like an A/B test where you’re just basically establishing your percentage or something. It’s something that is able to capture, I would say, much more fuzzy patterns where you have tons of effects that are interconnected and can drive better performance in a very high-dimensional situation.

Kieran Chandler: How does this approach fit in with what we do here at Lokad? Because what we’re doing here at Lokad is we’re kind of optimizing those business decisions that can be made at any one time. Kind of introducing this kind of noise, doing things that are intentionally a bit wrong.

Joannes Vermorel: Yeah, and that fundamentally goes against that belief. I mean, not my belief, but when you want to really factor in the second-order effects. At Lokad, we really try to apply not the irrationalism, but to try to be rational, taking into account those other effects which are wicked. During the first decade of Lokad, for the vast majority of our clients, they were not even optimizing anything. They were optimizing percentages of error, which in my book, is not even an optimization. If you optimize percentages of error, you don’t even know what you’re doing for your company. You need to optimize dollars. The first step is to move toward an optimization process where you actually try to optimize, as opposed to just repeating targets that are completely arbitrary. Now, what we see is that with our most advanced clients, especially on the e-commerce side, is that now that this optimization process is in place, the idea of exploration starts to emerge. It typically starts with things like pricing, which again, from my perspective, is very much into the supply chain scope because that’s where the demand comes from. You need to have a good price, and the price explains the demand for a large part. But the price is certainly not the only area where you want to do exploration. What I’m seeing for the next couple of years is that I think for companies to remain state-of-the-art, they need to have the ambition to be state-of-the-art as far as their supply chain goes. They will increasingly introduce the idea of a bit of exploration and randomization, just to generate results that drive the optimization process itself and make it better over time.

Kieran Chandler: So, to conclude, you can sort of see in the future there’ll be a time when we’re actually placing a much higher importance on this kind of exploration and the importance of quantifying how much that gives you as a company in terms of knowledge.

Joannes Vermorel: Exactly, perfect.

Kieran Chandler: All right, we’re going to have to wrap it up there for today. Thanks for your time.

Joannes Vermorel: That’s everything for today. Thanks very much for tuning in, and we’ll see you again next time.

Kieran Chandler: Thanks for watching.

Back to Lokad TV ›

PREVIOUS EPISODES