The cross-entropy is a metric that can be used to reflect the accuracy of probabilistic forecasts. The cross-entropy has strong ties with the maximum likelihood estimation. Cross-entropy is of primary importance to modern forecasting systems, because if it is instrumental in making possible the delivery of superior forecasts, even for alternative metrics. From a supply chain perspective, cross-entropy is particularly important as it supports the estimation of models that are also good at capturing the probabilities of rare events, which frequently happen to be the costliest ones. This metric departs substantially from the intuition that supports simpler accuracy metrics, like the mean square error or the mean absolute percentage error.

The cross-entropy departs from this perspective by adopting the Bayesian probability perspective. The Bayesian perspective reverses the problem. When trying to make quantitative sense of an uncertain phenomenon, the Bayesian perspective starts with a model that directly gives a probability estimate for the phenomenon. Then, through repeated observations, we assess how the model fares when confronted with the real occurrences of the phenomenon. As the number of occurrences increase, the measurement of the (in)adequacy of the model improves.

The frequentist and the Bayesian perspectives are both valid and useful. From a supply chain perspective, as collecting observations is costly and somewhat inflexible – companies have little control on generating orders for a product – the Bayesian perspective is frequently more tractable.

By adopting the Bayesian perspective, we can evaluate the probability that the model would have generated all the observations. If we further assume all observations to be independent (IID, Independent and Identically Distributed actually), then the probability that this model would have generated the collection of observations that we have is the product of all the probabilities estimated by the model for every past observation.

The mathematical product of thousands of variables that are typically less than 0.5 - assuming that we are dealing with a phenomenon which is quite uncertain – can be expected to be an incredibly small number. For example, even when considering an excellent model to forecast demand, what would be the probability that this model could generate all the sales data that a company has observed over the course of a year? While estimating this number is non-trivial, it is clear that this number would be astoundingly small.

Thus, in order to mitigate this numerical problem known as an arithmetic underflow, logarithms are introduced. Intuitively, logarithms can be used to transform products into sums, which conveniently addresses the arithmetic underflow problem.

In information theory, cross-entropy can be interpreted as the expected length in bits for encoding messages, when $Q$ is used instead of $P$. This perspective goes beyond the present discussion and isn’t of primary importance from a supply chain perspective.

In practice, as $P$ isn’t known, the cross-entropy is empirically estimated from the observations, by simply assuming that all the collected observations are equally probable, that is, $p(x)=1/N$ where $N$ is the number of observations. $$H(q) = - \frac{1}{N} \sum_x \log q(x). \!$$ Interestingly enough, this formula is identical to the average log-likehood estimation. Optimizing the cross-entropy or the log-likelihood is essentially the same thing, both conceptually and numerically.

From a supply chain perspective, the take-away is that even if the goal of the company is to optimize a forecasting metric like MAPE or MSE (mean square error), then, in practice, the most efficient route is to optimize the cross-entropy. At Lokad, in 2017, we have collected a significant amount of empirical evidence supporting this claim. More surprisingly maybe, cross-entropy also outperforms CRPS (continuous-ranked probability score), another probabilistic accuracy metric, even if the resulting models are ultimately judged against CRPS.

It is not entirely clear what makes cross-entropy such a good metric for numerical optimization. One of the most compelling arguments, detailed in Ian Goodfellow et all, is that cross-entropy provides very large gradient values, that are especially valuable for gradient descent, which precisely happens to be the most successful scale optimization method that is available at the moment.

From the CRPS perspective, the model is relatively good, as the observed demand is about 10 units away from the mean forecast. In contrast, from the cross-entropy perspective, the model has an infinite error: the model did predict that observing 1011 units of demand had a zero probability – a very strong proposition – which turned out to be factually incorrect, as demonstrated by the fact that 1011 units have just been observed.

The propensity of CRPS to favor models that can make absurd claims like