00:00:03 Data preparation in data science overview.
00:00:46 Underestimating data preparation complexities.
00:02:01 Typical data preparation project duration.
00:03:19 Challenges in data preparation speed and accuracy.
00:06:07 Importance of data preparation documentation.
00:08:00 Interpreting ‘order date’ in supply chains.
00:09:02 Complications in data interpretation from system upgrades.
00:10:07 Understanding data semantics to avoid errors.
00:10:15 Case study: Supply chain system peculiarities.
00:14:53 Need for data documentation in business operations.
00:16:01 Importance of data tracking in supply chains.
00:17:24 Broadening data scope in automated decision making.
00:18:42 Risks of relying on individuals for data recall.
00:19:02 Challenges and expectations in data preparation.
00:20:13 Data preparation as a company-wide effort.
00:21:56 Judging data interpretation correctness via real-world effectiveness.
00:23:02 Incorrect data interpretation consequences and traceability importance.
00:24:37 Difficulties and results of poor data preparation.
00:24:49 The concept of ‘good’ data preparation.
Summary
In this Lokad TV episode, host Kieran Chandler and Lokad founder Joannes Vermorel are discussing the intricacies of data preparation in data science, a process that is often underestimated but currently being prioritized due to GDPR compliance. Vermorel is emphasizing that data preparation, which frequently consumes several months and extensive resources, is vital to circumvent the “garbage in, garbage out” issue. This requires transforming inconsistent or incomplete data into an understandable format, which calls for thorough documentation. The process is complex, shaped by the multifaceted nature of business problems and the historical context of the data. Vermorel is advocating a distributed approach, involving various organizational teams, and maintains that effective data preparation should be accessible and facilitate clear decision-making.
Extended Summary
In the episode of Lokad TV hosted by Kieran Chandler, he and Joannes Vermorel, the founder of Lokad, are discussing the complex yet pivotal role of data preparation in the realm of data science. With the rise in GDPR compliance laws, data is becoming a key focus for many executives, with an estimate that firms are currently spending over $450 billion on data preparation alone. The goal of data preparation is to convert raw, frequently inconsistent or incomplete data, into a format that is easy to interpret and apply.
Vermorel is addressing the frequent underestimation of the complexity of data preparation. Despite businesses investing substantial resources into it, numerous projects are falling behind schedule or overshooting their budgets. According to Vermorel, most IT system bugs can be linked back to problems at the data preparation stage. He explains that the multifaceted nature of business issues appears as data preparation steps, adding to the complexity of the task.
Regarding timelines, Vermorel suggests that large-scale data preparation projects may take at least a few months, often extending up to six months. Although the assumption is that improved tools or more scalable software should speed up the process, he suggests that the overall ecosystem’s level of maturity is slowing progress. To genuinely avoid the “garbage in, garbage out” problem, data first needs to be documented and clarified. This process, he argues, contributes to the longer timeline.
When asked about the feasibility of accelerating this process, Vermorel explains that it’s not as simple as just adding more resources. The data being dealt with was not originally produced for data preparation purposes but is instead a byproduct of enterprise systems. For example, he describes how a point of sale system’s primary function is to process customer payments, not collect data. However, even these systems produce data that may be inconsistent or flawed due to practical operational reasons, such as barcode errors. These inconsistencies require extensive preparatory work for effective use in supply chain optimization.
Vermorel is also speaking about the importance of documentation in data preparation. At Lokad, a data preparation project typically starts with less than one line of documentation per field per table and ends with one page of documentation per field per table. This comprehensive documentation is vital to prevent the issue of poor quality input data leading to poor quality output, or “garbage in, garbage out”. The six-month timeline for data preparation, therefore, includes the process of creating this extensive documentation.
Vermorel starts by addressing the complexities and potential misinterpretations that can arise from one simple piece of data: the date on a historical order. He explains that ‘date’ isn’t as straightforward as it may appear, as there are multiple potential interpretations, like when the client clicked on a product, when they validated their basket, when the payment was processed, or when the goods were made available in the warehouse.
He points out that the interpretation of such data may change over time due to system upgrades or changes in business practices. Thus, it’s crucial to understand not only the data itself but also the historical context in which it was produced. If these complexities aren’t acknowledged, businesses can face a “garbage in, garbage out” problem where incorrect interpretations lead to poor decision-making.
Vermorel is highlighting a case study with one of Lokad’s clients to illustrate his point. This client operates a demanding industrial setup with short lead times, where receiving the precise quantity of goods ordered is essential. The client’s system has a feature where if the delivered quantity doesn’t exactly match the order, the entire order is rejected and sent back. This leads to a predicament where if they receive slightly more than ordered, they have to modify the original purchase order in the system to match the delivered quantity. This workaround enables them to accept the delivery and
avoid disruptions in their industrial operations.
However, this process is being exploited by savvy suppliers who are now delivering slightly more than ordered, knowing it will be accepted. This results in purchase orders that seem inflated compared to actual needs, creating data artifacts that misrepresent the purchasing team’s performance. Vermorel underlines that this complexity needs to be documented to avoid incorrect interpretations. He insists that the issues arose not from poor performance by the purchasing team, but due to limitations in the system and how users coped with these limitations.
Shifting focus, Vermorel is discussing who cares about historical data, aside from Lokad, a company using it for probabilistic forecasts. He points out that companies closely monitor the money they expect to receive or pay, with those not doing so disappearing over time. This is, in his words, a form of business “Darwinism”. He suggests that companies that pay attention to their financial transactions over time naturally care about their historical data.
The conversation is turning towards data preparation. Vermorel emphasizes that data isn’t inherently “clean” or fully understood. He proposes that data preparation isn’t solely an IT issue; it’s about understanding all aspects of business data to address all business angles. The IT department, he notes, cannot be expected to master every business angle and should not bear sole responsibility for data preparation.
Vermorel is suggesting a distributed approach, involving different teams with distinct expertise across the organization. Data relevant to purchasing, for example, should involve the purchasing teams. Similarly, data required for a supplier scorecard should involve sourcing teams. This approach can harness the necessary insights for effective data preparation.
Addressing how to be sure about data interpretation, especially when information is incomplete, Vermorel relates it to scientific theories; it’s not possible to know a theory is right, but it can be validated when it withstands scrutiny. The correctness of data preparation is established when the decisions derived from the interpretation are correct. If incorrect data preparation leads to nonsensical decisions, the cause can be traced back, corrected, and reevaluated.
Vermorel then describes what good data preparation should look like, especially in complex supply chain scenarios. He likens it to a well-written book, providing relevant business insights and perspectives, not just technical details. It should be accessible and distributed throughout the organization, fostering shared understanding. It requires an ongoing effort to document, maintain, and understand the data.
Lastly, Vermorel stresses that data preparation should be an interpretation of a valid understanding of the data itself. Once this understanding is established and maintained, the logical operations on the data are fairly straightforward. Good data preparation, then, is both a well-written guidebook and a shared understanding that allows for clear and effective decisions in the supply chain.
Full Transcript
Kieran Chandler: In today’s episode, we’re going to be discussing data preparation, one of the fundamentals of data science. Given recent GDPR compliance laws, data is very much at the forefront of many executives’ minds. It’s big business too, with a recent survey estimating that companies are spending over 450 billion dollars on data preparation alone. Data preparation is all about taking raw data and transforming it into an easy-to-understand format so it can be used usefully. This is no mean feat when you consider that data is coming in from many different sources and can often be inconsistent, incomplete, and can also contain errors. So Joannes, why are we talking about data preparation today? I mean, if these businesses are investing over 450 billion dollars, it should be something that we understand by now.
Joannes Vermorel: Yes, absolutely. Data preparation is a field that’s quite well-known, yet systematically underestimated in terms of changes. It’s interesting because most data preparation projects end up missing their deadlines and incurring budget overruns. The core problem is that many actual bugs that you see in IT systems, enterprise software in general, trace back to problems at the data preparation level. It’s extremely complex. Although it’s a well-known problem, the core issue is that business complexities reemerge as data preparation steps, making it an unbounded domain. There’s no final recipe for data preparation.
Kieran Chandler: So how long should it take to prepare a sizable amount of data?
Joannes Vermorel: I’ve never seen a large-scale data preparation project that took less than a few months. Typically, it’s more like six months. People may argue that with better tools and more scalable software it should be faster. However, the reality is that there is so little maturity in the ecosystem that, for pretty much any company except some data champions like Google, the data first needs to be documented and clarified. There are so many things that need to be done with this data to avoid the ‘garbage in, garbage out’ problem. So it takes a couple of months, and six months is a reasonable target if you have a complex supply chain involved.
Kieran Chandler: Six months sounds like quite a long period of time. Is there a way we can speed up this process? If I’m a big organization, can I not just throw more people at the problem?
Joannes Vermorel: You end up with a specific problem here: can you have nine women to make a baby in one month? The issue is understanding the type of problems we’re dealing with. First, the data you have has not been produced to be data in the first place. It’s a byproduct of your enterprise system that just happens to operate your company. Let’s take an example of a point of sale software, something where you can pay in a supermarket. Its primary function is to process customers who want to exit the store while paying what they should be paying. So, if a barcode doesn’t scan for whatever reason, the cashier is likely going to scan a similar priced product twice. In the end, you’re going to pay the right price, but in terms of data, you’re going to have a product counted twice.
Kieran Chandler: One product counted zero times can create problems in terms of inventory management because your electronic records become off. This isn’t a good solution, and it’s advisable to avoid it. But the reality is, if you have the choice between solving a data problem and letting your company operate smoother, those on the ground, the people who need to operate supply chains physically, will always favor a solution that doesn’t disrupt the flow of goods, clients, service, and everything else. Operating the company is paramount and data is only a second-order byproduct. It’s never treated as a first-class citizen. That’s why you always have all this work that needs to be done because the data was not collected just for the purpose of letting you optimize the supply chain. Is that where all these challenges come from?
Joannes Vermorel: Indeed, that’s why all those changes come from.
Kieran Chandler: Let’s talk about that documentation you mentioned earlier. What are you expecting to see in this documentation and how does it help? Just to understand, if we go back to the six-month period, what kind of documentation volume do we expect?
Joannes Vermorel: A rule of thumb would be, typically, when we at Lokad start a project, we have less than one line of documentation per table per field. Usually, we don’t even have that. We start many projects where we don’t even have one line of documentation per table in the ERP, MRP, WMS, or whatever system is used to run your supply chain. When we are done, we end up with one page of documentation per field per table. So, if you have 20 tables with 20 fields, we are talking about 400 pages of documentation. Yes, it takes six months to produce those 400 pages of documentation.
Kieran Chandler: That sounds like a huge amount of documentation. Is it really all needed?
Joannes Vermorel: It’s all needed if you want to avoid garbage in, garbage out.
Kieran Chandler: Why is that?
Joannes Vermorel: Consider a practical case: let’s say I have a table named ‘orders’. It contains my historical orders and has a date. But is it simple? Are we really talking about what kind of date it is? Is it the date when the client clicked on a product to put it in the basket on the ecommerce site? Or when the client validated the basket and made payment? Or when the payment was validated by the credit card processor? Or when the entry was registered into the system? Or when the purchase order was last modified in the system? There are about 20 different interpretations for just this ‘date’ field.
Plus, if your company has more than a decade of history, chances are that the fine line of the interpretation of the ‘order date’ has changed over the years. You might end up with a situation where you had an upgrade of a system, and the semantics of this column changed.
It’s also not naturally something that is completely homogeneous for your entire history. Then, you can have further complications, such as edge cases. For example, this date is supposed to be the date when the client validated the basket, except if the payment was ultimately rejected as fraud. In this case, it’s the date time of when the order was rejected as fraud.
Again, it’s not actually a very good design, but companies running complex supply chains have complex systems and a lot of history. So, IT wasn’t necessarily done perfectly from day one, and you have to cope with all these historical complications. They end up being reflected in this documentation. If you fail to acknowledge all those complications, you’ll face issues whenever you want to analyze this data.
Kieran Chandler: To generate an optimized decision for your supply chain, you can end up with a ‘garbage in, garbage out’ problem if you incorrectly interpret the data. So, what you’re saying is that it’s all about clarifying the semantics of the data, right?
Joannes Vermorel: Exactly. You need to understand that data is more than just numbers. It represents various factors that might be combined in one cell. It’s not just about understanding the software that produces data, but also how people interact with the software. Your documentation needs to take into account the human angle of what people are doing.
Kieran Chandler: Do you have a good example of how one of your clients have encountered this issue in the past and how it’s affected their company?
Joannes Vermorel: Yes, I can give an example. We had a client who was running a highly demanding operation that required a high level of availability. They passed purchase orders with very short lead times to their suppliers. Their system had an interesting design feature. If the quantity of goods delivered didn’t match the quantities that were originally requested, then the entire purchase order would have to be rejected and sent back to the supplier.
Let’s say, for example, you order 1,000 units, but the supplier delivers 1,050 units, then you would have to reject it. The problem with that is if you reject it, it could lead to serious operational issues. The system didn’t allow them to modify the quantity, so what ended up happening was that when the delivered quantities didn’t match the ordered quantities, they would modify the original purchase order to reflect the delivered quantity.
Kieran Chandler: So, what you’re saying is they would change the original purchase order to match what was delivered?
Joannes Vermorel: Exactly. However, this opened up another issue. Suppliers caught onto this practice. They realized that they could deliver more than what was ordered, knowing that the company needed those supplies. They wouldn’t deliver an outrageous amount, but something the company would consider accepting.
In the data, it would appear as if the initial order was for the larger quantity. This resulted in some strange data where purchase orders seemed too large compared to what was needed, which made it look like the purchasing team was bad at choosing the right quantities. However, the issue was not with the purchasing team but with the limitations of the system and how people were coping with those limitations.
All of these details needed to be documented to avoid arriving at the wrong conclusion. The purchasing team was not bad at their job. The problem was that they had a system with limitations that they were just trying to navigate.
Kieran Chandler: The system is generating all those bizarre side-effects. It needs a page of explanation if you want to make sense of it, but there is no escape. It’s just the complexity of the business itself that is reflected in this data. Let’s move away from those sneaky suppliers. That’s definitely an entertaining example. So, you mentioned the sort of person angle. Obviously, at Lokad we’re using historical data to make probabilistic forecasts for the future. Who else is actually bothered about the historical data, other than us?
Joannes Vermorel: Typically, everything that was really touching the amount of money you had to expect to receive or pay garnered a lot of attention. It’s not that people weren’t paying attention, but companies who weren’t closely monitoring the money they should receive tended to disappear over time. That’s like Darwinism in action. If you don’t even pay attention to that, you just disappear. That’s why, like five centuries ago, double entry accounting was invented by some Italian monks. If you don’t pay attention, your monastery just collapses due to bad accounting practices. It’s not exactly a new problem, but there is a lot of data that we did not consider mission-critical in the past that are now becoming mission-critical.
To give you an example, to properly account for stock-outs in the past, you were taking into account what you were purchasing, so that you know what you should be paying your suppliers, and what you’re selling, so that you know what your customers should be paying you. But what about tracking the historical stock-outs? As long as you have a supply chain practitioner who decides the quantities to be purchased by hand and remembers that there were some bizarre periods with stock-outs, then you don’t need to have those historical records. They are part of your system.
The problem is that as soon as you want to transition towards something that is more quantitative, like what we do at Lokad with automated decisions for all those mundane tasks such as deciding how much to order, having accurate records about the historical stock levels becomes much more important. Otherwise, your automation gets the interpretation of what is the sales and what was the lack of demand wrong.
If you want to have higher automation running your company, then you need to pay attention to a broader spectrum of data, not just the raw accounting thing. Your accountant does not care about your days of stock-outs, but your supply chain optimization software will. You need to expand the circles of data that are really part of your scope, that is documented, and where you need quality control and assurance.
Kieran Chandler: So, we’re quite reliant on this single person who remembers what happened in the past. Shouldn’t we be better prepared for data as it’s coming in? Shouldn’t the IT department or someone else be preparing that data and making sure it’s clean from the get-go? It seems an easier way to do things.
Joannes Vermorel: Yes, but the problem is not with IT competence. There is no such thing as clean data. The point is that data is not naturally understood with enough depth and not all business angles are properly covered.
Kieran Chandler: It’s often said that companies are investing billions into AI, but the reality is that it’s the complexity of the business itself that emerges in this task, this challenge of preparing the data. And saying, “oh, the IT department should take care of that”, is akin to expecting IT to run the company and be knowledgeable on every single business angle.
Joannes Vermorel: Absolutely, that suddenly creates an organizational problem because you’re expecting IT to be as much an expert in Human Resources, marketing, purchasing and so on. I mean, you expect complete mastery of all the business angles from the IT department. But that’s asking too much. The IT department already has to deal with all the IT changes, so they shouldn’t be expected to address every single business problem in the company. Alternatively, you could redefine IT as being your entire company, but that defeats the purpose.
Back to the case at hand, data preparation has to be a fairly distributed effort within the company because, ultimately, the only people who can provide the level of insight it takes to prepare the data relevant for, let’s say, purchasing, are the purchasing teams. Similarly, if you want to establish a supplier scorecard and have the data prepared with enough accuracy so that it really makes sense, you’ll have to talk to your teams responsible for sourcing.
Every time you tackle a problem, you need to have the people who are specialists in that problem within your company involved because they are the people who will give you the insight necessary for the preparation of the data to make sense. It’s not strictly an IT problem. It’s about collecting all the required understanding so that when you process data, you don’t end up with data that is nonsensical with respect to the business problem that needs to be solved.
Kieran Chandler: So, we’re not quite there yet from the sounds of things. Some companies are there, but they’re the exception, not the rule. How can you be sure, if you haven’t got all of the information and there are gaps you’re interpreting, that your interpretation is the correct one? There could be many possibilities.
Joannes Vermorel: Absolutely, and that’s an interesting point because it’s similar to scientific theories. You never know that your theory is right, you just know that it’s good enough and when it’s challenged in the wild, it works. You don’t have anything better to make it work better.
So, what does that mean for data preparation? It means that you know that your data preparation is correct when, at the end of the data pipeline, the decisions that you generate automatically based on this interpretation are correct. If you have correct decisions, it means that your optimization logic is efficient, your machine learning layers are accurate, and plenty of other things. Fundamentally, it’s just not possible to end up with correct decisions generated by an incorrect interpretation. Usually, when you don’t correctly interpret and prepare your data, it will so profoundly garbage your results that there’s no chance for it to work.
Bottom line is, there’s no workaround but to do the preparation, be confident about it, then generate the decisions. If the decisions are nonsensical, you walk back, trace back the problem to its root cause – frequently it’s data preparation – and fix it. If, at the end of the day, the decisions that come out of your system make sense to a practitioner who’s using their own human mind to assess them, then you know you’ve got it right.
Kieran Chandler: You might say, I believe that they are properly prepared, but usually, it’s all about shades of gray. The supply chain practitioner might say, ‘It’s a good decision, but it could be further improved. For example, if we were taking into account the price of our competitors because that explains unusual spikes or drops in demand, and we don’t have that data yet.’ So, it’s not a black and white situation.
Joannes Vermorel: We’ve spoken quite a lot about the difficulties with data preparation and what poor data preparation looks like. But to sum things up, what does good data preparation look like? In a moderately complex supply chain situation, good data preparation would look like a well-structured book. Let’s think about it like a 400-page book with one page per table or twenty fields in 20 tables. But it’s not enough for it to be just a book, it has to be a well-written book.
If you write something incredibly boring, no one is going to read it, and it will not have any effect on your organization. So, it has to be well-written. And by well-written, I mean it should be readable. It also needs to be written from a business perspective. It’s not an IT documentation. Data preparation is really not an IT problem, it’s more about having all the business insights.
The valid perspective in business is always something that changes. If the competitive landscape in your industry changes, then the valid perspective on a given problem changes as well. So this book needs to be well-written and maintained.
This is a very distributed effort in your company because, for example, only the merchandising team has the proper insight and perspective to know how the merchandising tables should be documented in the first place. This data preparation looks like clean, well-written materials that are widely distributed and accessible in your company.
The interesting thing is that once you have all those insights, data transformation, which is logic, becomes a straightforward interpretation of the valid understanding of the data itself. You need to put in a huge amount of effort to understand the data, document it, write it, and maintain it. But once you have done all of that, writing the logic becomes straightforward. So what does good data preparation look like? It’s like a well-written book, a shared understanding, a sort of supply chain bible that is internal to your company.
Kieran Chandler: Sounds good! So, data shades of grey, is that going to be a new bestseller coming from the organization?
Joannes Vermorel: Possibly, who knows?
Kieran Chandler: Okay, well, we hope you’ve enjoyed today’s episode on data preparation. As always, get in touch if you’ve got any further questions and we’ll see you again next time on Lokad TV. Goodbye for now.