- Containers
- Copacking
- Cross-docking
- Drop shipping
- Decision-driven optimization
- DDMRP
- Economic Drivers
- Initiative (Quantitative SCM)
- Kanban
- Lean SCM
- Manifesto (Quantitative SCM)
- Micro fulfilment
- Product life-cycle
- Resilience
- Sales and Operation Planning (S&OP)
- Supply Chain Management (SCM)
- Supply Chain Scientist
- Test of Performance
- Third Party Logistics (3PL)

- Backorders
- Bill of Materials (BOM)
- Economic order quantity (EOQ)
- Fill Rate
- Inventory accuracy
- Inventory control
- Inventory costs (carrying costs)
- Inventory Turnover (Inventory Turns)
- Lead demand
- Lead time
- Min/Max inventory method
- Minimum Order Quantity (MOQ)
- Phantom inventory
- Prioritized ordering
- Reorder point
- Replenishment
- Service level
- Service level (optimization)
- Stock-Keeping Unit (SKU)
- Stockout

Generalization is the capacity of an algorithm to generate a model - leveraging a dataset - that performs well on previously unseen data. Generalization is of critical importance to supply chain, as most decisions reflect an anticipation of the future. In the forecasting context, the data is unseen because the model predicts future events, which are unobservable. While substantial progress, both theoretical and practical, has been achieved on the generalization front since the 1990s,

A seemingly ineluctable paradox presents itself:

While memorizing the observations can be dismissed as an inadequate modelling strategy, any alternative strategy to create a model is potentially subject to the same problem. No matter how well the model appears to perform on data

By way of anecdotal evidence, in 1979 the SEC (Securities and Exchange Commission), the US agency in charge of regulating capital markets, introduced its famous

Even science itself is struggling with what it means to extrapolate “truth” outside a narrow set of observations. The “bad science” scandals, which unfolded in the 2000s and 2010s around p-hacking, indicate that entire fields of research are broken and cannot be trusted

Under its most far-reaching guise, the generalization problem is indistinguishable from that of science itself, hence it is as difficult as replicating the breadth of human ingenuity and potential. Yet, the narrower statistical flavor of the generalization problem is much more approachable, and this is the perspective that will be adopted in the coming sections.

Generalization remained, by and large, a baffling problem for most of the 20th century. It was not even clear whether it belonged to the realm of natural sciences, governed by observations and experimentations, or to the realm of philosophy and mathematics, governed by logic and self-consistency.

The space trundled along until a landmark moment in 1982, the year of the first public forecasting competition - colloquially known as the M competition

A few decades later, Kaggle, founded in 2010, added a new dimension to such competitions by creating a platform dedicated to general prediction problems (not just time-series). As of February 2023

Visualizing

Figure 1: A composite graph depicting three different attempts at “fitting” a series of observations.

With two parameters, the “under-fitting” model is robust, but it, as the name suggests,

A fairly complex model could be adopted, as per the “over-fitting” curve (Figure 1). This model includes many parameters and exactly fits the observations. This approach has a low bias but a demonstrably high variance. Alternatively, a model of intermediate complexity could be adopted, as seen in the “proper-fitting” curve (Figure 1). This model includes three parameters, has a medium bias and a medium variance. Of these three options, the

These modeling options represent the essence of the bias-variance tradeoff

Historically, from the early 20th century up to the early 2010s, an overfitted model was defined

Numerous variants of cross-validation exist. The most popular variant is the k-fold validation where the original sample is randomly partitioned into k subsamples. Each subsample is used once as validation data, while the rest – all the other subsamples – is used as training data

Figure 2: A sample K-fold validation. The observations above are all from the same dataset. The technique thus constructs data subsamples for validation and training purposes.

The choice of the value

Cross-validation assumes that the dataset can be decomposed into a series of independent observations. However, in supply chain, this is frequently not the case, as the dataset usually reflects some sort of historicized data where a time-dependence is present. In the presence of time, the training subsample must be enforced as strictly “preceding” the validation one. In other words, the “future”, relative to the resampling cutoff, must not leak into the validation subsample.

Figure 3: A sample backtesting process.

Backtesting represents the flavor of cross-validation that directly addresses the time-dependence. Instead of considering random subsamples, training and validation data are respectively obtained through a cutoff: observations prior to the cutoff belong to the training data, while observations posterior to the cutoff belong to the validation data. The process is iterated by picking a series of distinct cutoff values.

The resampling method that lies at the core of both cross-validation and backtesting is a powerful mechanism to steer the modeling effort towards a path of greater generalization. In fact, it is so efficient that there is an entire class of (machine) learning algorithms that embraces this very mechanism at their core. The most notable ones are random forests and gradient boosted trees.

To better grasp what used to be wrong with

Thus, a simple linear regression model with 3 input variables is introduced. The model can be written as Y = a1 * X1 + a2 * X2 + a3 * X3, where

- Y is the output of the linear model (the failure rate that the engineers want to predict)
- X1, X2 and X3 are the three factors (specific types of workloads expressed in hours of operation) that may contribute to the failures
- a1, a2 and a3 are the three model parameters that are to be identified.

The number of observations it takes to obtain “good enough” estimates for the three parameters is largely dependent on the level of noise present in the observation, and what qualifies as “good enough”. However, intuitively, to fit three parameters, two dozen observations would be required at minimum even in the most favorable situations. As the engineers are able to collect 100 observations, they successfully regress 3 parameters, and the resulting model appears “good enough” to be of practical interest. The model fails to capture many aspects of the 100 observations, making it a very rough approximation, but when this model is challenged against other situations through thought experiments, intuition and experience tell the engineers that the model seems to behave reasonably.

Based on their first success, the engineers decide to investigate deeper. This time, they leverage the full range of electronic sensors embedded in the machinery, and through the electronic records produced by those sensors, they manage to increase the set of input factors to 10,000. Initially, the dataset was comprised of 100 observations, with each observation characterized by 3 numbers. Now, the dataset has been expanded; it is still the same 100 observations, but there are 10,000 numbers per observation.

However, as the engineers try to apply the same approach to their vastly enriched dataset, the linear model does not work anymore. As there are 10,000 dimensions, the linear model comes with 10,000 parameters; and the 100 observations are nowhere near enough to support regressing that many parameters. The problem is not that it is impossible to find parameter values that fit, rather the exact opposite: it has become trivial to find endless sets of parameters that perfectly fit the observations. Yet, none of these “fitting” models are of any practical use. These “big” models perfectly fit the 100 observations, however, outside those observations, the models become nonsensical.

The engineers are confronted with the

In the middle of the 1990s, a twofold breakthrough

On the experimental front, models later known as Support Vector Machines (SVMs) were introduced almost as a textbook derivation of what VC theory had identified about learning. These SVMs became the first widely successful models capable of making satisfying use of datasets where the number of dimensions exceeded the number of observations.

By boxing the

Returning to the problem of unscheduled repairs mentioned earlier, unlike the classic statistical models – like linear regression, which falls apart against the dimensional barrier – ensemble methods would succeed in taking advantage of the large dataset and its 10,000 dimensions even though there are only 100 observations. What is more, ensemble methods would excel more or less

The impact on the broader community, both within and without academia, was massive. Most of the research efforts in the early 2000s were dedicated to exploring these nonparametric “theory-supported” approaches. Yet, successes evaporated rather quickly as the years passed. In fact, some twenty years on, the best models from what came to be known as the

What came to be known later as the

The genesis of deep learning is complex and can be traced back to the earliest attempts to model the processes of the brain, namely neural networks. Unpacking this genesis is beyond the scope of the present discussion, however, it is worth noting that the deep learning revolution of the early 2010s began just as the field abandoned the neural network metaphor in favor of mechanical sympathy. The deep learning implementations replaced the previous models with much simpler variants. These newer models took advantage of alternative computing hardware, notably GPUs (graphics processing units), which turned out to be, somewhat accidentally, well-suited for the linear algebra operations that characterize deep learning models

It took almost five more years for deep learning to be widely recognized as a breakthrough. A sizeable portion of the reticence came from the

The contradiction remained largely unresolved until 2019, when

Figure 4. A deep double descent.

Figure 4 illustrates the two successive regimes described above. The first regime is the classic bias-variance tradeoff that seemingly comes with an “optimal” number of parameters. Yet, this minima turns out to be a local minima. There is a second regime, observable if one keeps increasing the number of parameters, that exhibits an asymptotic convergence towards an actual optimal test error for the model.

The deep double descent not only reconciled the statistical and deep learning perspectives, but also demonstrated that generalization remains relatively little understood. It proved that the widely-held theories – commonplace until the late 2010s - presented a distorted perspective on generalization. However, the deep double descent does not yet provide a framework – or something equivalent – that would predict the generalization powers (or lack thereof) of models based on their structure. To date, the approach remains resolutely empirical.

Furthermore, the

In machine learning jargon, modeling demand is an

Simply put, the prediction created by the manufacturer will shape, one way or another, the future the manufacturer experiences. A high projected demand means that the manufacturer will ramp up production. If the business is well-run, economies of scale are likely to be achieved in the manufacturing process, hence lowering production costs. In turn, the manufacturer is likely to take advantage of these newfound economics in order to lower prices, hence gaining a competitive edge over rivals. The market, seeking the lowest priced option, may swiftly adopt this manufacturer as its most competitive option, hence triggering a surge of demand well beyond the initial projection.

This phenomenon is known as a

At this point, the generalization challenge, as it presents itself in supply chain, might appear insurmountable. Spreadsheets, which remain ubiquitous in supply chains, certainly hint that this is the default, albeit implicit, position of most companies. A spreadsheet is, however, first and foremost a tool for deferring the resolution of the problem to some ad-hoc human judgement, rather than the application of any systematic method.

Though deferring to human judgement is invariably the incorrect response (in and of itself), it is not a satisfying answer to the problem, either. The presence of stock-outs does not mean that

Thus, when considering a real-world supply chain, generalization requires a two-legged approach. First, the model must be statistically sound, to the extent permitted by the broad “learning” sciences. This encompasses not only theoretical perspectives like classical statistics and statistical learning, but also empirical endeavors like machine learning and prediction competitions. Reverting to 19th century statistics is not a reasonable proposition for a 21st century supply chain practice.

Second, the model must be supported by high-level reasoning. In other words, for every component of the model and every step of the modeling process, there should be a justification that makes sense

From afar, this proposition might seem vulnerable to the earlier critique addressed to spreadsheets – the one against deferring hard work to some elusive “human judgement”. Though this proposition still defers

1. Why Most Published Research Findings Are False, John P. A. Ioannidis, August 2005

2. Makridakis, S.; Andersen, A.; Carbone, R.; Fildes, R.; Hibon, M.; Lewandowski, R.; Newton, J.; Parzen, E.; Winkler, R. (April 1982). "The accuracy of extrapolation (time series) methods: Results of a forecasting competition". Journal of Forecasting. 1 (2): 111–153. doi:10.1002/for.3980010202.

3. Kaggle in Numbers, Carl McBride Ellis, retrieved February 8th 2023,

4. Grenander, Ulf. On empirical spectral analysis of stochastic processes. Ark. Mat., 1(6):503– 531, 08 1952.

5. Whittle, P. Tests of Fit in Time Series, Vol. 39, No. 3/4 (Dec., 1952), pp. 309-318] (10 pages), Oxford University Press

6. Everitt B.S., Skrondal A. (2010), Cambridge Dictionary of Statistics, Cambridge University Press.

7. Support-vector networks, Corinna Cortes, Vladimir Vapnik, Machine Learning volume 20, pages 273–297 (1995)

8. A Theory of the Learnable, L. G. Valiant, Communications of the ACM, Volume 27Issue 11Nov. 1984 pp 1134–1142

9. Random Forests, Leo Breiman, Machine Learning volume 45, pages 5–32 (2001)

10. Deep Double Descent: Where Bigger Models and More Data Hurt, Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever, December 2019

a. There is an important algorithmic technique called “memoization" that precisely replaces a result thatby could be recomputed by its pre-computed result, hence trading more memory for less compute. However, this technique is not relevant to the present discussion.

b. From the time-series forecasting perspective, the notion of generalization is approached via the concept of “accuracy”. Accuracy can be seen as a special case of “generalization” when considering time-series.

c. The 1935 excerpt "Perhaps we are old fashioned but to us a six-variate analysis based on thirteen observations seems rather like overfitting", from “The Quarterly Review of Biology” (Sep, 1935 Volume 10, Number 3pp. 341 – 377), seems to indicate that the statistical concept of overfitting was already established by that time.

d. The asymptotical benefits of using greater k values for the k-fold can be inferred from the central limit theorem. This insight hints that, by increasing k, we can get approximately 1 / sqrt(k) close from exhausting the entire improvement potential brought by the k fold in the first place.

e. The Vapnik-Chernovenkis (VC) theory was not the only candidate to formalize what “learning” means. Valiant’s 1984 PAC (probably approximately correct) framework paved the way for formal learning approaches. However, the PAC framework lacked the immense traction, and the operational successes, that the VC theory enjoyed around the millennium.

f. One of the unfortunate consequences of Support Vector Machines (SVMs) being heavily inspired by a mathematical theory is that those models have little “mechanical sympathy” for modern computing hardware. The relative inadequacy of SVMs to process large datasets – including millions of observations or more – compared to alternatives spelled the downfall of those methods.

g. XGBoost and LightGBM are two open-source implementations of the ensemble methods that remain widely popular within machine learning circles.

h. For the sake of concision, there is a bit of oversimplification going on here. There is an entire field of research dedicated to the “regularization” of statistical models. In the presence of regularization constraints, the number of parameters, even considering a classic model like a linear regression, may safely exceed the number of observations. In the presence of regularization, no parameter value quite represents a full degree of freedom anymore, rather a fraction of one. Thus, it would be more proper to refer to the number of degrees of freedom, instead of referring to the number of parameters. As these tangential considerations do not fundamentally alter the views presented here, the simplified version will suffice.

i. In fact, the causality is the other way around. Deep learning pioneers managed to re-engineer their original models - neural networks - into simpler models that relied almost exclusively on linear algebra. The point of this re-engineering was precisely to make it possible to run these newer models on computing hardware that traded versatility for raw power, namely GPUs.

j. The vast majority of the data science initiatives in supply chain fail. My casual observations indicate that the data scientist’s ignorance of what makes the supply chain *tick* is the root cause of most of these failures. Although it is incredibly tempting – for a newly trained data scientist - to leverage the latest and shiniest open-source machine learning package, not all modeling techniques are equally suited to support high-level reasoning. In fact, most of the “mainstream” techniques are downright terrible when it comes to the whiteboxing process.