Data preparation in supply chain

Data preparation












Home » Resources » Here

When Lokad tackles any quantitative supply chain initiative, we invest about 80% of our efforts in data preparation, sometimes also referred as data cleansing, data cleaning or data preprocessing. The fact that such efforts are required is usually poorly understood, even by experienced supply chain professionals. Yet, our own observations indicate that problems concerning data is probably the number one cause of supply chain optimization failures across nearly all industry verticals, from fresh food to aerospace. While some supply chain failures are spectacular enough to make the headlines, most of the failures are quietly swept under the carpet. Gaining insights in those failures is of primary importance to precisely avoid repeating them.

Data-driven failures

Supply chain optimization projects often fail. Based on anecdotes gathered from Lokad’s client base, failure rates of software-driven supply chain optimization initiatives are probably hovering above 80% across most verticals. Few vendors would even acknowledge that failure is the "normal" outcome for their clients. One paradox of modern supply chain is that there are less risks involved in physically moving warehouse stocks from one continent to another, than there are risks in moving warehouse data from one computer to another one that is located 1 meter away from the first one.

When things go wrong, neither the client nor the vendor have much incentive to disclose anything. Hence, the vast majority of failures are discreetly forgotten, and are never to be discussed again. From time to time, the failures are so spectacular that they make it to the news:


While such failures do not have one single explanation, each time, something goes very wrong with the data, and the resulting supply chain decisions end up being spectacularly bad. However, failures are usually less spectacular. Supply chain practitioners tend to maintain a healthy distrust of the numbers produced by their new supposedly-better provider system, stick to their good old Excel sheets, and after a couple of months the system gets shut down. No serious harm has been done to the business, but time and energy have been wasted.

Making sense of data is tough

Historical business data is profoundly ambitious and subtle. These aspects are quite counter-intuitive, and consequently tend to be misunderstood. Our experience indicates that this very misunderstanding is probably one of the key root causes behind the data woes so frequently encountered in supply chain initiatives.

Let’s consider a simple example of sales history: a table list the quantities sold per product per day for the last few years. This sales history should be able to give a sense of where the business is heading.

Yet, this is not the case because:
  • The history might contain bundles, with a product being a bundle of other products, so quantities might be going down while business is going up.
  • The history might contain returns, possibly representing a large fraction of the sales, so business might look like it is going up, while actually it is going down, as returns are moving faster.
  • The history might contain promotions, where the margin has been sacrificed in order to liquidate inventory, and the generated demand is non-representative of the demand for full-price goods.

Then, when we refer to quantity per day relating to a specific date, the date alone comes with its own lot of ambiguities. It might be the day when:
  • The client has made an order
  • The client has confirmed a pre-order
  • The product has been shipped to the client
  • The order entry finally reached the ERP
  • The order entry was last modified within the ERP
  • The payment from the client has been received
  • Etc.

Each variant is correct, but each variant also comes with its own business twists. Yet, too frequently, it’s tempting to think that, well, it’s just a plain old sales history. When historical sales are involved, (almost) nothing is ever plain.

In addition, such twists tend to become exponentially more impacting when they are combined. There is nothing as harmful in generating overstocks as an upward bias inflating sales combined with a downward bias downplaying incoming stocks.

As a rule of thumb, at Lokad, we routinely observe that good documentation of data often involves writing a description of nearly one page per column, for every column, in every table. Yet, we consider ourselves lucky when just 1 line of documentation per column (per data field) is available to us at the very beginning of a project.

Even when documentation does exist, it frequently misses the point entirely. There is no doubt that a column named OrderDate contains, well, dates. There is no doubt either that another column named StatusCode contains a short list of codes. Most technical documentation focuses on the IT aspects, putting aside entirely all business concerns. Yet, it is precisely these very concerns which are of primary interest from an optimization perspective.

In computer science, there is an old saying: Garbage In, Garbage Out. If the input data does not make sense, the results won't either. Therefore, breaking down the data and documenting all our findings is a key objective when Lokad takes over any supply chain initiative. Most of our clients are surprised by the amount of documentation that Lokad produces while reviewing our clients’ systems, data and processes. Yet, our experience shows that we have never once regretted a situation where we documented too much.

No two set-ups are exactly alike

A frequent question during the course of a quantitative supply chain initiative is: Is this optimization compatible with system XYZ? The short answer is nearly always yes, but the longer answer is yes, but with some effort. Moving data around is fairly easy: FTP (file transfer protocol) has been around for decades, and even a modest internet connection can move around quite a lot of data. Yet, the true challenge is to make complete sense of the data being crunched. Here, it is not always easy for companies to realize how flexible their IT systems really are. While company systems might appear quite rigid to practitioners, in practice, they very rarely are.

Indeed, software vendors have been competing on flexibility features for decades, and systems that truly lock workflows into one single way of doing things are rare. Most systems offer a myriad of different ways to achieve the same outcome. As a result, even when data originates from the one and same software, it is rare for the data preparation to remain the same. In fact, data preparation is highly dependent on the practices and workflows that are in place in a given company and the same data records might not be interpreted exactly in the same manner in two different firms. At first, such discrepancies might seem minor, but from a supply chain optimization perspective, they frequently cause significant misalignments between the optimization logic and the actual business needs.

Strategy is data

The good news about computers is that they do what you tell them to do.
The bad news is that they do what you tell them to do. Ted Nelson

Most business executives implicitly trust their team to execute the business strategy even if the strategy is not formalized or not even written down. Yet, quantitative optimization leaves no room for implicit goals: numbers produced out of input data exactly reflect the numerical framework that has been put in place. As a result, initial results produced by quantitative optimization solutions tend to be baffling: spot on in many respects, but also deeply missing the point in many others.

Since the exact business goals are rarely specified from the very beginning, , it frequently takes the first results produced by the quantitative optimization logic to realize what’s missing. It might be that:
  • some clients are VIP and require much higher service levels
  • some products are on the edge of obsolescence and entail a lot more inventory risk
  • some suppliers have MOQs both per SKU and per order
  • some warehouses are nearly full and reorders need to be thin
  • etc

Addressing such challenges requires additional data, and companies usually start realizing that the relevant data cannot be found anywhere except - maybe - among the many Excel sheets dispersed throughout the company. The Excel sheets themselves are too often treated as not being particularly important and many of them tend to be archived in email boxes.

As a result, it is not only the transactional data, such as the data extracted from an ERP, that needs to be prepared, but it is also all the high-level data that shapes the business strategy that needs to be treated too. To add to the challenge, high-level data tend to be more difficult to prepare too. Indeed, unlike transactional data, little effort has usually been made so far to make such data fully consistent company-wide. Consequently, as this type of data starts becoming prepared, it’s not infrequent to realize that the company’s high-level business strategy may partly not make sense for some edge-cases.

In order to align the optimization logic with the business, the most straightforward approach usually consists of re-expressing everything in Euros or Dollars (or any other currency for that matter). This is so not because of some intrinsic superiority of viewing things from a financial standpoint, but merely because these are the only units of count that can be made consistent across the entire company with little or no coordination between teams being required. In practice, this financial viewpoint tends to create friction when a company was already used to working with non-financial metrics; or worse still, no metrics at all.

There is never enough documentation

The secret ingredient for all good data preparation is good documentation. This ingredient is simple, but nearly always either overlooked or dismissed as a non-urgent and non-important task. Yet, our experience at Lokad indicates that good documentation of data is both urgent and important.

In particular, documentation needs to answer the why question for every single data source, and even for every data field. Documenting the intent of the data is critical to making sure that this data is correctly processed afterwards. And yet, too frequently, when documentation does actually exist, it merely tends to paraphrase the field names.

Poor documentation example
ORDERS.DATE: the date associated with the order line.

Better documentation example
ORDERS.DATE: the date when the client first expressed his intention to purchase the good; represents the core demand signal. This date can be compared to the ORDERS.DELIVERYDATE in order to assess the time period elapsed between the initial order and the delivery from the client perspective.

Documenting input data should not be seen as an IT element, but rather as a business-related asset. No matter how motivated and supportive the IT department might be, the IT team simply doesn’t usually have access to the business insights required to gain deep understanding of the data.

As a result of all this, when Lokad tackles any new client initiative, we tend to invest time in writing documentation early on; typically a few bits at a time after each conversation with our client (who provides us the necessary insights into their business expertise). The value delivered from this process isn’t typically obvious from the very beginning, as other things seem to be more pressing at the start. Yet, as weeks go by, some of the subtle details relating to data become forgotten, and the written documentation becomes the only way of avoiding bumping into the same issues over and over again.