Columnar Random Forests

January 11, 2019

technology

Joannes Vermorel

Many supply chain challenges can be framed as either classification or regressionproblems. For example, forecasting demand can be seen as a regression; while deciding whether aligning a price with the price point of a competitor is acceptable can be seen as a classification.

A random forest is a machine learning technique that can be used to learn patterns from data, typically with the intent of performing either a classification or a regression.

While random forests are no longer state-of-the-art machine learning - deep learningoutperforms them in many if not most situations - there are still distinctive practical advantages associated with random forests, which have been nicely summarized by Ahmed El Deeb in The Unreasonable Effectiveness of Random Forests.

Indeed, when Ahmed El Deeb points out that It’s really hard to build a bad Random Forest!, I do concur, and this represents a significant practical advantage. In contrast, deep learning models are, well, finicky to say the least, and a trove of obscure parameters can improve - or degrade - performance in ways that are not always very clear to the data scientist.

Thus, random forests are now built-in within Envision. Bonus: the predictions of random forests are returned as random variables which makes a nice combo for probabilistic approaches of supply chain optimization.

Under the hood, we have rolled out our own high-optimized random forest implementation. We stole many insights from xgBoost. The main insight is that we are leveraging a columnar data processing strategy - unlike earlier approaches, which were tabular. Within the Envision context, this approach yields further performance benefits as the data itself is already organized in a columnar format within Envision. Also, in a supply chain context, input features are frequently either sparse or of low cardinality - e.g. slow movers. The columnar approach lets us significantly compress the data, which yields further speed-ups for those random forests.

Faster random forests may seem a smallish feature, however performance is a feature. The scarcest resource is usually the supply chain scientist himself/herself. Spending less time on waiting for the numerical results to be produced means that more time can be spent on thinking and solving the actual supply chain challenge.

Columnar Random Forests

More Posts

Ask Lokad