00:00:00 Data better than you think
00:03:43 Why “bad data” becomes a scapegoat
00:07:26 Transactional facts versus parameter junk
00:11:09 Reporting-layer “cake” breaks decision-making
00:14:52 Decision-first view of useful information
00:18:35 Automate parameters: lead times, seasonality, newness
00:22:18 ERP complexity isn’t bad data
00:26:01 Date fields hide multiple real-world meanings
00:29:44 Semantics proven by generated decisions
00:33:27 Outliers expose bad methods, not bad data
00:37:10 Data lakes should copy, not “improve”
00:40:53 Data quality measured in dollars
00:44:36 AI readiness: reliable transactions, semantics first
00:48:19 Dual-run demands fully unattended execution
00:52:02 Vendor blame games: Lidl-SAP cautionary tale
00:55:45 Quality equals fitness for decisions
Summary
“Bad data” is often a scapegoat. Most firms’ transactional records—what was bought, sold, shipped, returned—are good enough, or they wouldn’t survive. The real mess is the mountain of manually maintained parameters and the confusion over what fields actually mean (semantics), especially in sprawling ERPs. Don’t “clean” reality to save weak methods: outliers are often the business. Judge data by results: if decisions improve profitably, the data was sufficient; if outputs are insane, fix the interpretation.
Extended Summary
A familiar complaint in supply chain is that “bad data” blocks progress. Yet much of what gets called bad data is neither bad nor even central to the problem. It is often a convenient excuse—sometimes for vendors who need someone else to blame when their software fails, and sometimes for analysts trained on tidy classroom datasets who are shocked by the sheer sprawl of real enterprise systems. An ERP with thousands of tables is not evidence of bad data; it is evidence of complexity.
The conversation draws a hard line between two kinds of “data.” First is transactional reality: what was bought, sold, shipped, returned, scrapped, paid—events with financial consequences. This information is usually reliable, for a simple reason: firms that cannot keep basic transactional truth straight do not last long. Markets punish that level of confusion quickly. Errors exist, but typically at low rates.
Second is a mountain of manually maintained parameters—service level targets, safety-stock coefficients, seasonality flags, lead times entered by clerks. These “numerical artifacts” are routinely stale, inconsistent, and costly to maintain at scale (millions of SKUs times multiple parameters). But the important point is that they are often unnecessary. Many of these inputs should be inferred from observed history or automated through better methods. Treating them as “core data” creates a self-inflicted burden.
A major hidden issue is semantics: what a field means. The same column can change meaning over time, across business processes, or even by sign (sale vs return). Documentation is usually thin at the start. The only reliable way to validate semantics is not by endless workshops, but by putting interpretations to the test: generate real decisions—what to buy, produce, stock, price—and see whether the outputs become absurd. When they do, you reverse-engineer the pipeline to find the mistaken assumption.
This also reframes “noisy data.” If customers sometimes order 1 and sometimes 100, that is not bad data—it is the business. Methods that collapse under outliers are defective; the data should not be falsified to rescue weak mathematics.
Finally, on “AI readiness”: the bar is not moral purity of data. It is fitness for purpose. If you know what you buy, make, and sell, you can begin. The real work is mapping semantics system by system, then iterating rapidly until decisions are sane. In the end, quality is not a slogan; it is measured by the economic performance of decisions.
Full Transcript
Conor Doherty: This is Supply Chain Breakdown, and today we will be breaking down why your data is, in fact, better than you think. Many companies say to us that their data is what prevents them from initiating the projects they want. Well, today we’re here to challenge that idea. We’re being constructive.
Who are we? Well, I’m Conor, Marketing Director here at Lokad. To my left, as always, Lokad founder Joannes Vermorel. Now, before we start, let us know down below: one, where in the world are you watching us from? We’re in Paris. And two, do you agree, or do you even think, that master data is in fact a bottleneck for digital transformation projects, the kinds that companies really want to get into these days? Get your comments and questions in below. We’ll get to them a little bit later.
And with that, Joannes, to the discussion. Now, before we get into it, a little bit of setting the table: how this discussion came about. As you know, pull back the curtain: at the end of every single episode, I say, “Hey, if you want to continue the conversation, reach out to Joannes and me privately. We’re happy to talk.” All of which is true. Well, some people do that. They want to talk.
And recently, collectively and individually, we’ve had discussions, and a recurring theme for practitioners is kind of bemoaning the state of data: like, “My master data is spaghetti,” “My ERP is Swiss cheese,” whatever. But like, that’s the problem. That’s what’s preventing me from doing all the cool stuff I want to do. And that’s how we got to today’s topic.
So, Joannes, why does data get such a bad rap in supply chain?
Joannes Vermorel: I can think of several reasons, depending on the audience.
For software vendors, bad data is the ultimate scapegoat. It’s a way to politely but firmly blame the clients for whatever defect the software product has. So it’s extremely convenient, and good vendors, talented vendors, will manage to convince the clients that it’s all on them if the whole thing blew apart because they had insufficient data practices, quality assurance, and whatnot.
But my take here: it’s really scapegoating. The second audience is data scientists, especially data scientists who come from this… who have done, I would say, Kaggle competition. They are used, especially at university, to data sets that are super neat and tidy, and they think this is a standard. They think that having a data set where you have five tables, each table has five fields, and you have extensive documentation for every field, and it’s very, very clear—okay. That is, for me, a problem of wildly unrealistic expectations.
In companies, I mean, if we take mid-market ERPs, we are going to talk about 10,000 tables. Some tables have 50 fields. That’s what we’re talking about. So here we have a problem of data scientists having completely unreasonable expectations about what corporate data actually is.
And there is a third segment, which is practitioners. The thing is: what they are looking at is they are looking at typically the parameterization of their business system. They are not looking at anything real, as in: a unit being sold at this date, at this price point. This is not typically the concern. This is not about those key transactional events in the past.
What they’re looking at is: “Oh, but look, the seasonality settings that we entered in the APS two years ago, they are completely off now because of this, because of that.” “Look, we have so many corrective coefficients for safety stock formula, and they are completely off as well,” etc. So in fact, when those practitioners complain about bad data, they complain very specifically about numerical artifacts—things that do not represent the past or the future. They represent some kind of parameterization of the policies of the company.
And my take is: if we go back to the first case, which was the software vendor, the problem is that if the software solution that you used is dependent so heavily on this parameterization that is manually maintained, then it is guaranteed that you’re heading for trouble. Thus, the only solution is to actually stop treating those things as they are. They should not even be part of the picture, so that you don’t have to cope with them.
Conor Doherty: So you can just correct me if I’m wrong, but it does sound like what you’re saying is that there are broadly two categories of data. So when people use the term “data,” inherently they’re thinking of two categories. One is what you call the numerical artifacts—the things that they’ve set for themselves, certain targets—but the other is the actual raw transactional data, and you see that as being much, much more important.
Joannes Vermorel: Yeah. I mean, the real transaction data is the stuff that happened. Again: you sold or you didn’t. You paid the supplier, you passed an order, you picked a unit from the stock, you destroyed one unit that was actually perishable and passed its shelf life, etc., etc. Those are transactional, and this is usually excellent.
However, what you have is also, in many business solutions, you have a mountain of parameters that are supposed to be maintained by clerks—people who just operate the system. And the thing is: why do you even need to have those zillion parameters? Because very frequently we’re talking of a huge amount of parameters.
Just to have a sense of scale: a company that has a million SKUs, which is not that huge actually—if you have, let’s say, 10 parameters per SKU, and I’m conservative here, it can be a lot more—10 parameters per SKU, so we are talking of 10 million parameters to be more or less manually maintained. And that is insanity.
Then people say, “Oh, this is crap because we can’t really maintain it.” I would say: yes, absolutely. But the reality is that you don’t really need it. And all those things, they don’t truly convey information because the way those parameters were actually set was by looking at the rest of the data anyway.
When people set, let’s say, a service level target, they did it by having a look at the transactional history. So you see, this information is completely, in fact, transient, and it should not be considered as part of your core data. That’s why I say: bad data is a problem that is much, much smaller than what people expect. It’s just because they treat as data a big portion of things that they should not.
There is a second angle also, but that’s a separate problem. It is when people want to do further analysis based on pre-processed data, especially the one that is obtained from the reporting layer.
Again, I split enterprise software in three classes: systems of record—that’s going to be ERPs minus the planning, because there is no planning involved—systems of reports, that would be business intelligence and all those systems for dashboarding and reporting, and then systems of intelligence, the ones that do decision-making.
The only source of truth for the data lies in the systems of record. But very frequently, because people are sometimes misguided, or busy, or whatever, they want to exploit the data as it is presented from the system of reports. And here we are in a situation that, as an analogy, is very much like a cook.
You imagine you have a kitchen, you have all the raw ingredients—flour, sugar, salt—and that is a system of records, the raw ingredients. And then the cook, through the system of reports, can cook a cake. You have a cake, it’s nice, you can consume the cake. It’s intended for human consumption, so it works.
Now you’re asking the cook: “Just take this cake and do something else with it.” And that’s exactly what happens when you’re trying to use the data that comes from your reporting layer to, for example, drive decision-making processes. That goes very, very badly. It’s not that the cook is a bad cook; it’s just that if you start from the cake, you can’t do anything else. It’s already the finished product. Trying to disentangle the sugar, the flour, and whatnot—everything is lost.
So you need to go back to the raw ingredients if you want to do something else. That’s also a typical mistake that I’ve seen many companies do: they look at the data in the reporting layer, which is intended for human consumption, and they try to extract this data and reprocess it for other purposes. That is a mistake. You need to go back to the raw ingredients.
Conor Doherty: Well, again, on that note—the idea of going back to the raw ingredients, going back to your foundational data—I actually have a… think what you will about the source. I’m just using it for a little bit of background context to show that we’re not just speculating here.
I’m looking at a Gartner report now, surveyed hundreds of very large firms, and confirmed that most firms don’t even measure data quality consistently. What actually happens is there’s just this feeling that our data is bad, but it’s not necessarily a measured fact.
So there’s two questions here. One: does that overall surprise you? And two: how could people quickly diagnose the health of their data?
Joannes Vermorel: First, if we look at transactional data, the answer is simple: does your company make profits? If so, your company is having high quality data. Why? Because I’ve never met a company that didn’t know what it was buying, or selling, or producing, survive.
If you don’t even know that, you will go bankrupt at the speed of light. If you don’t know what your suppliers sent you, then you’re going to be charged multiple times. If you don’t know what your customers actually bought, then you’re not actually charging them correctly. It’s very fast. It’s Darwinism at play. Markets ruthlessly eliminate those companies; they are gone.
So, as a rule of thumb: if you are in a company that is not on the verge of bankruptcy for mismanagement, the transactional history is most likely very high quality. Yes, a clerical error may slip in, like every line in a thousand—typically the order of magnitude that I can see in most companies. You will have between, let’s say, 0.1% sometimes up to 0.5% of clerical errors, but it’s very, very low. Many companies are even below that.
Now, if we are talking about all the optional data—for example, you have a parameterization in your system that lets you decide the target service level for a SKU—what does it even mean to have high quality data here? This is nonsense.
If I were doing the devil’s advocate, I would have to look at how close is this setting to maximizing the rate of return of the inventory investment of the company when this parameter is in place. We’re never going to do that. It’s way too complicated. If you actually embrace the idea that your supply chain should maximize rate of return, great, but then you will very, very quickly throw away all those non-economic approaches.
Bottom line: in this parameterization, yes, they tend to be a lot of garbage. The reality is that companies can live with that because, let’s say, an inventory planner will just look at the recommended replenishment, and it will not make much sense, but the same planner will have a look at the recent history, a few things, and in one minute the person will say, “Okay, order 50 units,” and next I’m going to look at the next SKU.
So yes, if you want to robotize, or do anything automated, based on this parameterization data, the quality of the data is very, very low. But the only solution is to actually think of software systems as systems of intelligence that do not rely on this parameterization, which is only supporting capabilities for a workflow-centric process.
Conor Doherty: Well, I’m just writing down a thought to follow up on that because, again, where possible I like to amplify any points of constructive thought.
Our perspective, as is well known to anyone who follows us, is the purpose of data, the purpose of even just going to work in a company, is you’re making better decisions—better decisions that actually produce more money. Now, you’ve made the point there that better decisions don’t reside in the system of reports; it’s already in your transactional data.
So essentially, for people to make better decisions—define “better” however you please, as we define it it will be maximizing rate of return—but for everyone listening, if you have transactional data, you can already start making positive decisions.
Joannes Vermorel: What I noticed for pretty much all the clientele of Lokad—and we’re talking of 20 billion plus of annual flow of merchandise that Lokad pilots—99% of the information comes, and even more actually, it’s probably 99.99% of the mass of information, comes from transactional systems.
On top of that, you will probably have a few dozens of meta-parameters—economic drivers—that need to be manually maintained. But here we have to be very conservative, and in practice it’s just a few dozen. So data quality here must be high. It’s difficult, but we’re talking of a very small number of important parameters—parameters that are sufficiently important to be worth several meetings per parameter.
But in the end it has to be a strategic, economic, high-level parameter, not something that is at the SKU level. All of that needs to be completely robotized away, and it can.
That’s why I say that data is usually excellent, because what you need is transactional history. If you approach a problem right, you don’t need to have millions of parameters that live in your system, that need to be manually maintained, and that can take so many forms.
Many systems ask practitioners to enter the lead time. But why do you have to enter the lead time? You observe the lead times from your supplier. So lead times need to be forecast, not entered by the user.
Many systems ask the user to classify and say, for example, “Is this a seasonal item?” Why should you manually do these sort of things that should be completely automated, including if the only thing that you have is the product label? Nowadays with LLMs, it has never been easier to automate this sort of detection. “I have a new product; will this thing exhibit seasonal patterns?” It’s fairly straightforward.
You don’t need a human to step in to say, “Oh yes, a ski combination, oh yes, that’s going to be seasonal. Okay, thank you.” That’s the reality: those are super basic questions, and all systems were asking practitioners to just keep entering so many things that are completely obvious and that can be automated away.
Even things that are sometimes super baffling: practitioners have to manually enter if this is a new product. Why do you need to have people to tell the system it’s a new product? You can see that in history there was no transaction. The product has been created in the system recently, there is zero transaction—why do you need to manually set it as a new product? That is tons of nonsense.
But again, my take is that all the garbage in this area, in this parameterization for all those policies, reflects an obsolete way to look at supply chain. So whatever bad data you have on this front, it’s irrelevant. What matters is the transactional data, and this transactional data is good, and it is good because your company is alive.
Conor Doherty: Well, on that—and again, this is like to dig down a little bit deeper into the causes of the perception—because it’s helpless if you just say, “I have a problem.” It’s more like, “Okay, what are the causes of this? What is the root of this?”
So listening to you, it does sound like there’s… okay, I’m going to give a concrete example and then you can bounce off of it. I have heard, and I know you have heard, some version of this: “My ERP is a mess.” And that is the system of records that they’re talking about, the book of transactional data: “I’ve got duplicated tables, I’ve got mislabeled columns, it’s a mess.”
Now technically all the transactional data, the raw transactional data, is there. The problem—if there is a problem, and we can discuss that—is: okay, the ERP migration that you did produced a mess. And let the disclaimer be: we don’t sell an ERP, we can work with anything, so we have no skin in that game.
But my question to you is: how big a role is selection of software in producing the epidemic of bad data here?
Joannes Vermorel: It’s again, here, I would say it’s wrong expectations. That’s exactly my case when I mentioned the second audience: data scientists, Kaggle competition. I didn’t say that the ERP, the systems of records, were going to be neat and tidy. The complexity is going to be off the charts, and this is just fine.
This is not bad data. This is just very complex and very opaque data—different problems. Now yes, when you have 10,000 tables, it is very difficult to pinpoint where is the stock level. It’s difficult, and it can take weeks to chase where a piece of data actually lives in the system. But again, this is not bad data.
Then indeed you have another problem: the semantic for any given column may be heterogeneous. What do I mean? I mean that the semantic that you can have for a given column of data might vary depending on the line. That is a complication.
Just an example: some misguided ERP vendors years ago decided, for example, that in the orders table, if the quantity was positive, it’s a sale, so you’re selling to a client, and the date is a date of transaction for the sale. But if the quantity is negative, it’s a return, so it’s a date of the return of the item. That means I have a column called “order date,” except it’s not an order date when the quantity is negative: it’s actually a return date.
That’s what I mean: heterogeneous semantic—that in the same column, depending on the line, depending on certain conditions, sometimes it’s just like the ERP had an upgrade and from January 1st, 2020, then the order date meant something else. It was due to an upgrade of the system, the semantic changed over time.
Sometimes it can even be the teams working the company that at some point decide to change the process, and they re-specify what is the semantic of a given field, and so it has a new semantic. So this is very complex, yes, and uncovering this—yes.
But again, is it bad data? If you know your ERP vendor, maybe because they were a little bit incompetent in terms of software design, decided that “order date” could be either the date of the sales or the date of the returns—yes, it’s misguided, but is it bad? I would argue the data is just good. It’s just confusing semantics.
We are back to the fact that it’s a lot of work to re-establish, and that I agree. But when people tell me “bad data,” very frequently I say: no, your data is just fine. It’s just that we need to do the work seriously of re-establishing the actual semantic of your data, and that is a lot of work.
As a rule of thumb, when we start with clients, we typically have not even a line of documentation per field, per table. When we are done, we typically have a page of documentation per field, per table, for the fields that are actually relevant for the Lokad initiative, which is supply chain optimization.
Nevertheless, 20 tables, 20 fields—we are talking of 400 pages worth of documentation. Not IT documentation: supply chain documentation, because we need to have the semantic of what does this field mean and imply from a supply chain perspective. So yes, that’s a lot of work.
I think that, again, under the scapegoat of bad data, very frequently it is just that a lot of people have not realized the amount of effort that goes into this semantic qualification. On top of that, we have incompetent software vendors who are happily using that to scapegoat the data, which is a polite way to tell the client: “It’s your fault.”
Conor Doherty: Well, on that note, I actually… you didn’t know I was going to do this, but obviously your new book is there, but in preparation for this, I actually went back to your old book.
And for OGs who’ve already read both of Joannes’s books, you can take out your copy now. Obviously the code in the last few hundred pages of this book might not be as relevant today, but the first… I will say the first hundred pages are still very, very relevant.
And for anyone who has their copy to hand: on page 60, on the topic of semantics—and again, this is just to demonstrate that it’s not abstract philosophy that you’re making there—I’m about to give a very concrete example and ask you a very concrete question, but I am just going to momentarily read from this because I find it very enlightening.
So in here, page 60: when we refer to quantity per day relating to a specific date, the date alone comes with its own set of ambiguities. It might be the date when the client has made an order. The client has confirmed a pre-order. The product has been shipped to the client. The order entry finally arrived in the ERP. The order entry was last modified within the ERP. The payment from the client has finally been received. You said, etc. But you could say when the warranty or the return period has expired.
Now, those are all concrete semantical meanings for a simple date.
Joannes Vermorel: Yeah. For a simple date.
Conor Doherty: Exactly. But my point being: what’s the damage of, like, if I chose any one of those? Is that like a decision tree where all of a sudden the decisions I can make differ wildly? Is that the scale of the damage? Explain why, so they understand that.
Joannes Vermorel: Yes. The tricky part with the semantic is that when you get it wrong, the only reliable way to know that you get it wrong is that you will, at the end of the pipeline, generate insane decisions.
Until you have a completely robotized pipeline that generates endgame decisions—allocation of resources for supply chain: what you buy, what you produce, where do you stock the things, your price points for every article that you sell—as long as you don’t have a fully unattended data pipeline, a numerical recipe that generates those decisions, you are lacking the very instrument that you need to assess, to challenge your semantic.
If, in your documentation, you write it wrong—you think this order date was when the payment was cleared; in fact, it’s not. It’s when you’re ready to ship the article to the client—documentation is wrong. You won’t be able to notice that. Nobody will be able to notice that.
Maybe if you go and do a specific deep dive on this field after two days of work, you will be able to correct that. But you have to know that there was a mistake in the first place. We are talking of, even for a small initiative, hundreds of fields. Are you going to do multi-day workshops for every single field to make sure that your semantic is correct? It’s not reasonable.
So the reasonable approach is to make a best effort, a best guess, and then let the decision be generated based on this interpretation. Lo and behold, occasionally decisions will be insane. When we start initiatives, we generate plenty of downright insane decisions.
Then we have a look, and people say, “Oh, this number is nonsense.” Okay. Reverse engineer what brought us to this nonsensical decision, and then by chasing back, we do reverse engineering. You need to have the instrumentation for that.
You will end up saying, “Ah, this date—oh, we misunderstood.” In fact, the lead time that was applicable in this situation is completely different because we misunderstood the date. Okay, fine. Let’s regenerate the decision with a corrected interpretation for this date. Oh, it looks much more reasonable now. Good. Next.
Fundamentally, the only way to assess if your semantic is correct is ultimately to put it to the test of the real world: generate a decision, and let practitioners say, “Is it making sense?” If not, you need to go back.
The mistake that many alternative tools are doing is that when the data that is generated is nonsense, they would say, “Oh, you have to tweak the parameters.” I say: absolutely not. You have only to tweak the parameters if the parameters are deemed to be the root cause of the problem that you’re seeing. If not, this is not the solution.
Very, very frequently, the problem… I can’t overemphasize the importance of semantics. It is very, very tricky. The only way to do it is to look at the generated decisions, which is even more a problem since many tools in the planning space do not ever generate endgame decisions, and thus they deprive themselves—the software vendors who are doing that—from the very instrument that would let them assess whether they have the right semantics.
Conor Doherty: Right. Well, again, in terms of bridging off of what you’ve just pointed out: the idea of data—we use data to make decisions, whether or not you use probabilistic forecasting or not, that’s what a Lokad approach would advocate. But fundamentally, the data is used to facilitate at least one step: forecasting, after that arriving at a decision.
But one of the common things that we do hear from practitioners is: “There’s a lot of noise in the data.” In the forecasting context, is noisy data a problem in a forecasting context?
Joannes Vermorel: Absolutely not. If, for example, your business is erratic—you have clients that sometimes they order one, sometimes they order 100—that is the reality of your business.
Many supply chain methods, numerical methods, are extremely defective. When they face something that would qualify as a statistical outlier, numerically the recipe misbehaves. So you have an outlier and the thing derails and gives you complete nonsense.
Then people say, “Oh, we need to go back and modify this history, prune outliers.” They say, “Those outliers are bad, they are symptoms of bad…” Here I completely disagree. If your clients did order, for real, 100 units in the past, this may be an outlier but this is reality. It happened.
Obviously, if you have a record in your system that says a million units were ordered but no such order was ever created, okay, that’s bad data. We go back to transactional data: transactional data are accurate.
But if you have a numerical method that gives you crazy results because it is faced with an outlier in historical data, the problem is not the data. The problem is the numerical method that is just crap. You’re facing a defective method. You should discard this method and put something that is numerically better behaved. That’s the essence.
That’s typically a situation where data is perfectly fine, and where vendors who propose methods that are highly defective will actually convince their own clients that they need to manually correct their bad data, while in fact the data is absolutely correct and the defect lies in the numerical recipe itself.
Conor Doherty: Well, that’s actually a perfect opportunity to switch to… so there were some… okay, so two have come through private. I’m sorry, just give me a second, tidy up, frame it as a question.
So, on that note, practitioner-based: “We’ve already built a data lake and we obviously have our catalog, yet our users, our end users, say the data is wrong. In your opinion, is the bottleneck tech or semantics? And how does Joannes, or how does Lokad, avoid endless relabeling?”
Because again you talked a lot about semantics and the importance. Depends on how you build your data lake. Is your data lake a perfect one-to-one copy of your records in the system of records? No pre-processing, no improvements, no joins, no filter of any kind. It’s literally one-to-one. Possibly a small delay because the data may not be copied live from the ERP, but putting aside the delay for the copy, it is exactly a copy of the data as it exists.
Joannes Vermorel: When people complain about that, again, we are back to the second problem with data scientists: “Oh, the data is not neat and tidy. This is not quite like Kaggle experiments. We are confused.” I say: unfortunately, this is the world you live in. You live in a world where the information that lies in your business systems is very complex. There is no escape. There is no alternative.
So you may complain about that, but it’s like complaining about gravity. It’s just a fact of the universe. You have to live with it.
Very frequently, what happens is not the ideal scenario I described for the data lake, just a vanilla pure replicate of the various business systems. And you could ask: why do you have a data lake if it’s just a copy? The short answer is because you do not want to create load on your ERP system, your transaction system. Your transaction system needs to stay super snappy.
If you have people that query, “I want all the sales orders over the last five years, just dump me all of that,” it’s going to slow down the system. It means somebody who is trying to beep something is going to wait multiple seconds because the resources will be starved due to this massive data extraction query. That’s why it’s a best practice to create a replicate and let those massive queries be executed on the replicate, not on the primary instance of the system of records.
Back to the original point: what I observe in many data lakes is that the IT team does a very severe mistake. They reprocess the original data. They want to improve it.
What is the catch? The catch is: they do not generate the decisions. Thus, they do not know the semantic. Thus, the sort of transformation that they apply to the data is guaranteed to be misguided, to be wrong, and as a result you end up with data that is not what you expect, not what you need, and there is no fix.
No matter how competent, no matter how dedicated they are, they are lacking the very instrument, which is reverse engineering the insane decisions. By definition, this is not the role of IT, generating those business decisions. Thus, IT can only deal with the plumbing: set up the databases, make sure the instances are secure, make sure the instances have enough RAM, disk bandwidth, infrastructure—yes.
But if you want semantics, IT cannot be dealing with semantics. The semantics are way too specific to every trade. It cannot be expected to be an accounting specialist, marketing specialist, supply chain specialist, legal specialist, and whatnot. So that’s why the semantics can only be in the hands of the practitioners. By definition it will overwhelm IT if you try to have it fight this battle for you.
Conor Doherty: Well, again, there are two previous comments that I’d received, both personally in calls and in meetings, so I’m choosing which one would be the best follow-up.
I’m going to build on the idea of the conflict between IT and ops. Literally, the comment was: “IT says our data is terrible, yet my ops, the practitioners using it, say it’s fine,” because as you pointed out earlier, if you’re making decisions and you’re making money, it’s therefore fine for business.
So the question is: how do you guys—being us—objectively assess quality and choose what needs to be fixed now versus later versus just rolling something out?
Joannes Vermorel: Rate of return on the generated decisions. If you have data that is supposedly crap, but the decisions that you generate come out fine, is it crap? Maybe not. Maybe it’s completely irrelevant. We don’t even care.
If you have a piece of data that is fundamentally inconsequential, the fact that it’s correct or not correct is moot. We don’t care. That’s why, for us at Lokad, we really assess data quality in terms of dollars: what is the payback of improving—whatever that means, and that varies depending on the case—this piece of data?
If improving this data means millions of dollars per year because we have better decisions, incredible. This should be a priority. Probably we should invest. If this data could be even worse and it makes no difference, then we don’t care.
That’s why I say: the only way to assess the quality of the data… that’s why IT cannot make this assessment, because this assessment is rooted in the decisions that your business ultimately generates. Is it crap or not? Well, it depends.
For example, you can have a field that, in aviation, we have plenty of fields, a field that is incomplete for 99% of the items, and very frequently there is a note that says something like, “The C number is at this place on the component.” Is it good or bad data?
The reality is that for the quasi-totality of airplane parts, locating the C number is super obvious. You don’t need a note telling you where is the C number. You just pick it, it’s obvious, you read it. But in some rare cases it is tricky and it is in a place that is a little bit hard to reach, frequently for mechanical reason. In this case you may have a small note that tells you where to look.
If you look from the IT perspective, you would say, “Oh, you’re so inconsistent with your data entries. Look at this field: it’s only like 0.5% of the items that get this attribute being set.” But the reality is: yes, but it’s the only items where it actually matters.
So again, I’m saying: the only way to assess whether data is good or bad is to put it to the test of the decision-making process.
Conor Doherty: Well, that may, in fact, answer the very next comment. Again, this is very, very specific, but it touches on a topic that I think is the elephant in the room: people want to use their data nowadays for AI. They want AI projects, etc.
So this is from a friend of the channel: “We run 40-plus countries. We use several ERPs and WMS’s to manage inventory for 40 countries. When is our data good enough to start AI, and what do you actually do in the first 90 days, I guess the first quarter essentially?”
Joannes Vermorel: If you have a country where you know what you’re buying, you know what you’re producing, you know what you’re selling, you’re good for AI. That’s it. For us it has always been that. So the bar is not high. That’s the interesting thing.
The bar to be able to robotize the decision-making process—with something that you call AI or not, the label you want to put on that—it’s enough and it is sufficient. That is what we’re doing. That’s why very frequently the maintenance of parameters that govern all the subtle workflow aspects for people to do the work manually—those things are irrelevant because we’re not going to use them.
Fundamentally, AI completely challenges the division of labor. The entire workflow where you have forecasters followed by planners followed by budget managers followed by inventory managers followed by blah blah blah—you know—with a workflow, it does not make sense in the age of AI.
The AI will just process, as a monolith, the data from the raw data coming from the system of records, and it will output directly the decisions, with all the instrumentation to support them. That will be the economic drivers to explain why this decision has been taken. Perfect.
Your AI system—what I call the systems of intelligence—is fundamentally something like a monolith that takes records in and generates decisions out with supporting instrumentation, and that’s it.
When people tell me, “We have 40 ERPs,” I would say: I don’t think anyone has 40. That would be… I’ve seen companies… I’ve seen one company that had 17 ERPs in the same country. That is the record holder. I’m not going to name this company. It’s a very, very large one. Same company, same country. That’s where things were really mental.
Bottom line: you will have to carry this effort of re-establishing the semantics ERP by ERP. That’s going to be a pain. Obviously.
He was asking for the first 90 days. Typically it takes us two months to establish the semantics. That’s something we do for one set of business systems that operate, for example, a country. But the real boundaries are not necessarily the country. They are much more related to which IT systems do we need to include in the scope: ERP, WMS, e-commerce platform, and whatnot.
Our scope is very much driven by the boundaries of the IT systems, very frequently precisely because the effort is about establishing the semantic. Then once we have the semantic, the first data pipeline, we will start iterating on the decisions, and typically that takes about two months.
So: two months to establish the data pipeline, to get a first educated opinion on the semantic of every field. But then you need two months of extra iterations to eliminate all the insanity, very frequently by identifying the semantic that you got wrong in the first iteration.
So in 90… typically it’s not 90 days; it’s going to be, let’s say, 120 days. You can get, with zero insanity, production-grade decisions. That’s a typical Lokad initiative.
But the gist is: you need to be able to iterate very rapidly, typically multiple times a day, on those insane decisions, because you will identify so many problems with the semantic of the data.
Conor Doherty: Well, again, and I’m only doing that because you brought that up: you give the explicit example of how we might do that. A core point there is that those things, what you’re describing, the implementation, would run in parallel with what we call a dual run. It runs in parallel with here’s what you’re currently doing, so you can with your own eyes see the difference.
Joannes Vermorel: Yes. That’s where it’s completely critical to have something that is completely unattended. Why is that? Because if your parallel process, your dual run, if the second one requires a lot of manpower, where does this manpower come from?
The company, let’s say if they have 15 planners, they are all busy 100% of the time. If they were not busy close to 100%, the company would have only 12 planners or 10. By definition, companies don’t have spare employees unless special circumstances. As a rule, for most jobs there are no spare employees doing nothing, just there as a backup to test the extra system.
Those people already take the eight hours of work to just do their job on the day. They can spare maybe half an hour a day to have a quick look at another system, just to identify the outliers, the bad decisions, the insane decisions, but they cannot spend eight hours on their system and then eight hours on another system where they have to go through the same workflow.
That’s why I say it is absolutely critical that this new decision-making system needs to be completely unattended; otherwise, operationally, you will not be able to deploy. That’s something that Lokad learned a decade ago.
Conor Doherty: All right. Well, again, I’ve exhausted the list of comments that were posed, or I was explicitly asked, “Please ask you on the air.”
In terms of closing on a constructive note: we’ve covered a lot there. What do you see in the next week—or pick any time horizon, but don’t say a year—in the next 30 days, for example, easy enough to implement changes that people can actually do, if not to improve the quality, but at least improve the internal perception of the power of the current state of their data?
Joannes Vermorel: I’m not sure there is anything to do in short term. For me, it is really about realization of what should you treat as actual primary source of information versus as numerical artifact. It’s not the same.
Once you do this segregation and you look coldly at your real, event-driven data—the data that represents the true events that have a financial implication for the company—you will most likely notice that this data is excellent. Yes, it’s going to be messy, but it’s excellent. So this is not the problem.
That would be my message: bad data in companies that have been digitalized for a decade or more is never the problem. Lokad has done 100-plus implementations over the last decade and a half.
Sometimes we had problems with systems that were too slow to extract the data. That was sometimes a problem. By the way, that’s why you want to add a data lake, because sometimes we were making, like, “SELECT * FROM table X,” and the system would actually do an out-of-memory exception when we’re doing that.
So, okay, we had problems where sometimes extracting the data was extremely complicated because the system would collapse when we’re trying to pull the data out of the system. That is a real concern. I really hope that you’re not in such a situation, but that might happen. That’s the reason why you want to have the data lake.
But other than those very technical problems that are related to ancient infrastructure, we never had really bad data. What we had plenty of is extremely opaque, obscure data, but fundamentally it was something to be solved on the Lokad side.
So yes, it was a big problem for us, but fundamentally it was not a problem of the client. The client was just using their system the way they should. The result was: for someone who wants to implement an automated decision-making system on top of that, it is a challenge. But again, the challenge is in the correct interpretation of the data, not in blaming the client for collecting this data in the first place.
Conor Doherty: Again, we’ve been going for an hour, so I think I’m justified for a slight comment there in terms of marketing, but it’s a core point to put across.
One of the things that I have personally learned from a lot of the conversations that I’ve had recently is that it’s not necessarily always clear to people that when we’re talking about, “Hey, look, your data is good enough as it is,” they don’t realize that if you were to work with Lokad, whatever, maybe not, or any competent vendor, will take that burden on.
So: you give them the data, you give us the data. It’s not that you have to process. And they should… again, if the vendor…
Joannes Vermorel: If the vendor doesn’t take this burden on his own shoulders, it’s a recipe for scapegoating.
Conor Doherty: Exactly.
Joannes Vermorel: The vendor will just blame you, and you will end up in a situation where the company wastes half a billion over seven years. In the end, the report concludes that it was all the fault of… and the vendor is just like: “No, nothing to see here. It’s not my fault. Come on.”
By the way, on the Lidl case, that’s very interesting, because they blamed that the data was bad on a specific point. SAP said, “Oh, Lidl, they drive their analysis on the purchase price, and we drive all our analysis on the selling price, and that was the cause.”
For me, I say: guys, this is semantic 101. First, it’s trivial as a problem: is it the selling price or the purchase price? Yes, there is a semantic challenge here: what price are we talking about? But let’s be clear: it is not a very subtle nuance distinction. As far as semantic distinction goes, it is a very, very easy to tackle distinction.
Then the thing that is even more baffling is that obviously, to me, you need both. You need both the purchase price and the selling price so that you can know the margin.
So the idea that the vendor would, after seven years, manage to blame the client saying, “Oh, you know what, they are organizing all their supply chain around the purchase price, it’s such a complication for us because we are expecting the selling price,” this is exactly the sort of shenanigans that you get if the vendor is not committed into delivering proper economically performant decisions.
That should be the starting point; otherwise, in the supply chain space, you are going to get a lot of nonsense. At the end of the day, after spending a lot of money, the vendor will manage to put the blame on yourself by finding bad data.
Again: if we don’t accept the fact that only decisions matter, there are so many pieces of data that are completely inconsequential. Obviously those inconsequential pieces of data are going to be very low quality, and it’s completely fine.
An ERP has 10,000 tables. Every table has dozens of fields. Should all of this data be neat and tidy? Insanity. Why would you ever want that? It’s way too costly.
So if you play this game of bad data with the vendor, the vendor will always be the winner because there will always be many tables and many fields where, objectively, they are crap and inconsequential. That’s the key: inconsequential.
Conor Doherty: Okay. So in conclusion: pick your vendor carefully, if you’re collapsing it down.
Joannes Vermorel: Yeah. And really understand: when you say quality of the ERP, what is the intent? What is the intent? There is no such thing as purity of ERP in a moral sense for a company. It serves a purpose.
Quality is fitness for purpose, and that’s it.
Conor Doherty: That’s it. Thank you very much. I’m out of questions. We’ve been going for an hour, so I think we’re out of time.
As always, thank you for your insight, and thank you all for attending and for your private messages. As I said right at the start, and where this topic came from, if you want to continue the conversation, reach out to Joannes and me privately. We’re always happy to chat.
As I said last week, and I’ll say every week, we’re lovely people. Look at us. On that note, we’ll see you next week. I have a special topic for next week, so we’ll announce that on Monday. But on that note, get back to work.