Update May 2016:
The complications covered by this page have been entirely eliminated through better design in our probabilistic forecasting engine
. We strongly recommend to upgrade toward this technology.
The notion of data aggregation by day, week or month presents several subtle problems starting with the respective definitions of day, week and month. This article reviews the approach adopted by Lokad to deal with those aggregations, both at the input level (historical data) and at the output level (forecasts). This article also provides guidance how to handle, in practice different, data aggregation conventions.
Terminology: item and order
The terms item
refer to the input data format
of Lokad. In this article, those terms should be understood as defined by Lokad. An item
represents a target to be forecast. An item
can be an article, a product, a SKU, a barcode depending on the context. An order
represents a quantity associated with an item at a given date in the past. An order
can be a sale, a shipment, a consumption depending on the context.
Aggregations as defined by Lokad
Lokad relies on the following conventions:
- Days are starting at 00h00 in the morning, thus each daily forecast value covers a period starting at 00h00 and ending at 23h59:59.
- Weeks are starting on Mondays, thus, each weekly forecast value covers a period starting with a Monday (inclusive) and ending with a Sunday (inclusive).
- Months are starting the 1st day of the month, thus, each monthly forecast value covers a period starting with the first day of the month and ending with the last day of the month.
Those conventions cannot be changed in Lokad, however, we review in the following how to deal with other conventions that might required in certain companies.
Tolerance toward input aggregation patterns
Lokad tries to be tolerant when it comes to processing the historical data. In particular, while we recommend applying a daily aggregation, that is, for a given item on a given day summing all orders as a single quantity for the day, other situations are acceptable for Lokad.
Raw disaggregated history
The daily aggregation reduces the size of the dataset to be uploaded toward Lokad. This reduction comes without any downside as far as the forecast accuracy is concerned. However, if the dataset is relatively small from the start then the gain is mostly irrelevant. Thus, Lokad can also process a fully disaggregated history with one line per transaction. In this case, daily quantities are computed as the sum of order
values for the given item and the given day.
Weekly or monthly pre-aggregated history
Sometimes the daily aggregated historical data has not been preserved in the enterprise system where the input data is originating from: only the weekly or monthly aggregated data remains. Lokad can process weekly or monthly pre-aggregated orders
If the data is weekly aggregated, Lokad should be restricted to classic weekly forecasts. Daily, monthly or quantile forecasts should not be used, as the statistical results would be meaningless. Similarly, if the data is monthly aggregated, Lokad should be restricted to classic monthly forecasts.
The main drawback of the pre-aggregated data is the constraints it creates concerning the nature of the forecasts that can be produced. Another drawback is a minor loss of accuracy. Indeed, Lokad can leverage daily patterns to refine the forecasts even for weekly or months forecasts.
When do the forecasts start?
In order to generate a forecast, Lokad defines a threshold that represents the present
, that is, the starting date of the forecasts. In order to compute this threshold, Lokad mostly adopts a data-centric viewpoint: the forecasts start when the historical data end. For example, if the historical data end on May 15th (inclusive) then the daily forecasts start on the next day, on May 16th.
The data-centric viewpoint is convenient, especially for benchmarks. The data can be truncated somewhere in the past, and Lokad generate forecasts that can already be compared with the (untruncated) historical data as the forecasts themselves are also part of the past.
In particular, both classic daily forecasts and quantile forecasts behave similarly: the forecast is always starting from the day after the most recent order
found in the input dataset.
In case of classic weekly or classic monthly forecasts, the situation is more subtle. For example, let’s assume that the historical data (i.e. the orders
in the Lokad terminology) end on May 15th: should the forecasts start on May 1st or on June 1st? The answer given by Lokad is that the forecasts start on May 1st. Indeed, as the data for the month of May is still only partial, Lokad truncates the data to the last day of April, so that a regular (*) classic monthly forecast can be made for May.
(*) Technically, it would be possible to design forecasting models that would be able to leverage the data available for the partial period; however this would represent a complex forecasting feature, and, as such, it is not supported by Lokad at this time.
Weekly or monthly aggregated data
In case of monthly (or weekly) aggregated data, the truncating behavior of Lokad can lead to unintended results. Let’s assume for example that we have a monthly aggregated order
history, with a single order
for each 1st day of the month. Let’s assume that May 1st 2013 represents the last date of the historical data. In this case, Lokad interprets the month of May as only partially observed; hence May gets truncated, and as a result, forecasts start on May 1st. As the data point for May 1st represents the whole month of May, this is not the intended behavior.
In order to avoid the truncation, it is advised to insert a single order
line - the order quantity being zero - at the date that represents the last day of the current month. In this example, the introduction of an order
line dated from May 31st indicates to Lokad that the month of May is complete, and thus that the forecast should start on June 1st instead. A similar technique, i.e. inserting an order line with a zero quantity, can be used to address the same concern when considering weekly forecasts.
Adding such a “dummy” order line would negatively impact any quantile forecast performed in parallel to the classic monthly forecast in Lokad. In practice, when the data is monthly aggregated, no quantile forecasts should be applied on the dataset.
Sanitization of future data, an exception to the data-centric viewpoint
There is one exception to the data-centric viewpoint: as a sanity check, Lokad filters out orders
that are dated of more than 1 week further in the future (the future
being defined based on the server clock of the machine running Lokad).
Indeed, we routinely observe that some of our clients end up extracting artifacts
from their systems dated in more or less distant future. Those lines are typically not real sales or shipments but the results of some tests performed at some point with the company system.
Obviously, from a business perspective, it makes little sense to rely on historical data that should not even exist yet. Thus, Lokad truncates this data as a data sanitization
procedure, and resumes the forecasting process afterward.
Alternative monthly or weekly aggregation
From the Lokad viewpoint, the monthly period always starts the 1st day of the month. However, some companies require a different convention, for example stating that each period starts the 25th day of the month.
In order to handle such a situation, we recommend to pre-aggregate the orders
history toward the target periods. In the previous case, the date range from May 25th to June 24th would be aggregated into a single order line. Then, this line represents the custom
monthly value for the item, and it should be positioned 1st day of the month prior
to the original range (that is at the 1st of May, in this case). Then, Lokad generates forecasts starting the 1st day of the month that can be translated back into the original convention, for example the 25th day of the month.
It is important to use a date that precedes the date range, because otherwise, part of the resulting data can end-up positioned in the future and then truncated by Lokad based on the policy that future data should be truncated (see here above about the sanitization process).