TSV format of Shelfcheck

TSV format of Shelfcheck


Home » Shelfcheck » Here

Intended audience: database administrators, developers

Shelfcheck is a On-Shelf-Analytics (OSA) webapp for retail networks that delivers Out-Of-Shelf (OOS) alerts.It can import data from a remote file repository of TSV files through FTP or through cloud storage.

See also our BigFiles, our file hosting service that supports FTP. Push your files directly to Lokad.
This document details how to setup the data in the expected format that will allow Shelfcheck to auto-import the data directly.

Big picture

The data to be imported by Shelfcheck falls into 3 categories:

  • Stores which provide high level informations about each store (for example the store name).
  • Items which provide generic information about every single item sold accross the network.
  • Receipts which provide the sales history, with flexibility offered concerning the data granularity.

Shelfcheck relies on statistical analysis; it therefore works better with more data. In particular, we suggest including as much sales history as possible. For example, if you have 3 years of daily sales data available please include the 3 years. Pushing a longer sales history to Shelfcheck is the easiest way to improve the precision of the OOS alerts.


Convention over configuration

Instead of following a rigid data format that might not be appropriate for your business, Shelfcheck relies on a flexible approach based on file naming conventions and column naming conventions. This approach is similar to the Salescast SQL Intermediate Schema, but tailored for the specific context of retail networks.

Shelfcheck can potentially import a lot of data, down to each receipt with its full details, or fall back to pre-aggregated daily data. Shelfcheck lets you choose which data you want to include in the data repository. Nearly all fields are optional. If a field cannot be found, Shelfcheck will continue to process your data and skip the missing field.

Caution: Shelfcheck is case-sensitive, make sure to respect the specified capitalization for fields and file names.

Flat text files and files hierarchy

Dealing with a lot of data is one of the challenges faced by software apps when working with store-level data for any larger retail network. At Lokad, our experience indicates that while SQL databases are great for warehouse data, at the store level, SQL itself becomes a problem as it becomes increasingly hard to tune a SQL database containing millions of rows to deliver the desired performance.

Instead, Shelfcheck relies on a repository of flat tab-separated files. Futhermore, Shelfcheck expects the data to be partitioned according to a set of precise guidelines (detailed below). The files are either hosted on a classic filesystem (reachable through FTP), or hosted in a cloud storage.

FTP repository or Windows Azure Blob Storage

At this time, Shelfcheck supports two repository types (where the data files can be stored):

  • FTP repository (hosted by Lokad)
  • Blob Storage on Windows Azure

An FTP repository can be made available on client request by Lokad. The client is expected to push the data toward her FTP account. The main drawback of FTP is the lack of scalability: FTP is a statefull protocol that is basically attached to a single server. Hence, FTP does not easily scale over multiple machines to distribute the workload.

Once the data to be uploaded toward Shelfcheck exceeds 10GB we suggest switching to cloud storage. Shelfcheck can pull the data from a Blob Container hosted on Windows Azure. A storage key for the Windows Azure Storage can be made available by Lokad upon client request. Unlike FTP, the Blob Storage is stateless and supports a very large number of concurrent uploads.

Tab Separated Value format (TSV)

TSV is a popular format, and Shelfcheck follows the most usual patterns:

  • The file name extension is .tsv.
  • The encoding of strings is UTF-8.
  • The delimiter is the TAB character (Ascii code: 9).
  • Fields can contain any string, excluding TABs.
  • Fields can be left blank (zero characters), but TABs cannot be omitted.
  • Each line contains the same number of fields.
  • The first line contains the names of the fields (headers).

File names and (relative) folder or header names must start with a letter and contain only letters, numbers and hyphen signs (-). These names must be between 3 and 63 characters long.

Headers must start with a letter, and contain only letters and numbers. Headers must be from 3 to 63 characters long.

When a field contains numbers, the number must be formatted as 123456.789 with the dot . used as decimal separator and without separators for thousand blocks.

When a field contains dates, dates should be formatted as ISO 8601, that is 2012-05-16T12:38Z (combined date and time in UTC). Dates with week numbers are not supported.

GZip compression

If the file name extension is .tsv.gz, then Shelfcheck expects the file to be compressed with GZip. This option is offered to facilitate data transfer, as TSV file can typically be compressed from 80% to 90% using GZip.

Files hierarchy

The block below outlines a file hiearchy as it should be exposed to Shelfcheck.

    /items.tsv
    /shelfcheck.config
    /stores.tsv
    /store-123/ ...
              /backorders-2012-05-17.tsv
              /backorders-2012-05-18.tsv
              /backorders-2012-05-19.tsv
              / ...
              /deliveries-2012-05-17.tsv
              /deliveries-2012-05-18.tsv
              /deliveries-2012-05-19.tsv
              / ...
              /items.tsv
              / ...
              /receipts-2012-05-17.tsv
              /receipts-2012-05-18.tsv
              /receipts-2012-05-19.tsv
              / ...
              /status-2012-05-17.tsv
              /status-2012-05-18.tsv
              /status-2012-05-19.tsv
              / ...
              /uploading.tsv
    /store-124/ ...
    /store-125/ ...

The root folder / contains two files /stores.tsv and /items.tsv that contain respectively the list of stores and the list of items. Shelfcheck.config is a special file to configure Shelfcheck itself.

As explained here above, you can compress any TSV files with GZIP. When doing so, you need to replace the file extension .tsv by .tsv.gz. For the sake of concision, we will not repeat this option in the following sections.

Format of stores.tsv

The purpose of stores.tsv is to list the stores contained in the retail network, along with their meta-data.

stores.tsv
StoreIdstring Required. The StoreId represents the identifier used by Shelfcheck to locate the store data in the repository, located at /store-StoreId/. All lines are supposed to contain distinct StoreId identifiers. StoreId must contain only letters and numbers and be between 1 and 57 characters long.
Namestring A human-readable identifier used for reporting purposes.
AdminEmailstring An email address associated to the administrative contact - within the retail network - for managing Shelfcheck for this particular store.
OosAlertEmailstring An email address where the daily OOS (out-of-shelf) alert gets delivered. Leave blank if no email notification is intended for this store.

Format of items.tsv

The file items.tsv contains the description of the items sold in the retail network. This file is found in two locations: at the root / and/or within a store folder /store-123/.

items.tsv
ItemIdstring Required. Uniquely identifies items from a data analytics viewpoint. It means that all the sales performed under the same ItemId will be consolidated as a single unit of need. In particular, OOS alerts are produced at the ItemId level.
GTINsstring Required. A list of comma-separated GTIN identifiers (aka separated by the , symbol). GITNs (Global Trade Item Numbers) include both UPC (North America) and EAN (Europe). The first GTIN of the list is intended as the primary GTIN associated to the item, i.e. the GTIN to be reported within web UIs.
Namestring A human-readable name for the item. When omitted, the ItemId is used instead for reporting.
ReorderCodestring An operating code typically used by store employees to replenish the item. This field is supported to make the reports more practical when iterating over OOS alerts. This value is only intended at the primary reorder code associated to the item. In the receipt's files however, multiple reoder codes might appear associated to the same ItemId.
ShelfCodestring An in-store proximity code. When this field is present, OOS alerts are grouped by ShelfCode, and listed in lexical order. This field is provided to facilate the task of store employees. By grouping alerts according to the location of the items, employees can save time while iterating over OOS alerts through the store.
OosAlertEnabledbool This flag is boolean and should only be associated the strings true or false. When the column is omitted, or when the value is left blank, then true is used by default. This flag indicates whether an OOS alert can be produced for the item. A typical situation where alerts should not be produced is when the stockout is intentional. In this case, there is no need to notify employees about the stockout.
TagFoostring The fields prefixed with Tag (aka TagFoo where Foo is an ad-hoc string) will be used by Shelfcheck as extra information about the underlying time-series. See our guide about tags and events to learn more about which data should be used for tags. In practice, tags will be used to represent product family, sub-family, category ... that is, to reflect the catalog hierarchy.
LabelFoostring The fields prefixed with Label (LabelFoo where Foo is an ad-hoc string) will be used in the Shelfcheck reports. For example, having a column named LabelBrand will enable Shelfcheck to display the brand within the reports. The convention allows adding any arbitrary set of extra fields for reporting purposes (the Label prefix will be removed from the final reports). Those fields are for display only and they do not impact the OSA process itself.
TagLabelFoostring If you wish to combine both Label and Tag behaviors into a single field, then you need to prefix the column with TagLabel.

The file items.tsv located in the store folder complements and overwrites the information available in /items.tsv :

  • if a header Foo is found in /store-123/items.tsv which does not exist in /items.tsv, then the local Foo information is added to items for the store 123 as an extra field. The matching is operated based on the ItemId identifier.
  • if a header Bar is found in /store-123/items.tsv, and the same header already exists in /items.tsv, then the local Bar information overwrites the global Bar information. The matching is operated based on the ItemId identifier.

The GTINs list represents all the barcodes that, in the end, should be merged as a single item. There are multiple reasons to operate such a merge, such as grouping items sold by the unit or in packs, group promoted and non-promoted items, establishing a link between the old and the new item when the supplier is only operating a minor packaging evolution.

Format of receipts-yyyy-MM-dd.tsv

The receipts represent the most heavyweight information to be processed by Shelfcheck. Hence, the intermediate format enforces a daily partitioning of the data files. This structure is particularly convenient to enable the quick delivery of small incremental data files (i.e. no need to resend the whole sales history on a daily basis).

The suffix of the file indicates the date for all the receipts listed within the file. Except for the initial receipt file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

receipts-yyyy-MM-dd.tsv
ReceiptIdstring Identifies all the lines that are part of the same market basket (i.e. same receipt). This field can be omitted.
DateTimedatetime Required. Identifies the purchase date associated to the article. This value is typically produced by the barcode scanner itself when items are processed at the checkout. Within each file, lines should be sorted by this column from oldest to newest.
GTINstring Required. Typically read by the barcode scanner.
Quantityfloat Required. As the name suggest, the quantity of items being purchased. Note that Shelfcheck does not require to pre-aggregate lines within a single receipt base on the ItemId.
UnitPricefloat The price per item of the goods being acquired. Dollar cents or Euro cents (or any equivalent fractional currency value) should be represented with a fractional number as well. Fractional numbers are supported, by convention, we suggest not expressing prices in cents here.
Discountfloat An optional discount (expressed in currency unit) that might have been offered to the client. The price that the client is paying is the end must match the equation: Total = Quantity x UnitPrice - Discount. The Discount value is optional. If the field is present but the value is left blank, then it counts as zero.
IsPromotedbool A optional Boolean field indicating whether the purchased item was promoted at the time of the purchase. When present, its value should be true or false. When omitted, it's false by default.
LoyaltyIdstring Represents the client identifier when a loyalty card is used. When the ReceiptId is present, all the lines associated to a particular ReceiptId identifier are expected to be associated with the same LoyaltyId (if this field is present).

Within the file, lines must be ordered against DateTime, starting from the early history and moving toward the recent one.

If the store is closed, then you should include an empty file (properly named). A missing data file for a particular date is interpreted as, well, missing sales data, which is not equivalent to zero sales data. Since this aspect does impact the analytics performed by Shelfcheck, we suggest being careful about this aspect.

The first file (earliest date) may contain the entire sales history observed so far. This feature is supported in order to facilitate the initialization of the Shelfcheck data repository. Indeed, the first data transfer is likely to include as much as 1000x more data (i.e. 3 years) than the routine daily updates that happen afterward.

For a given store, all receipt files must contain the exact same fields (headers). If you need to change the columns of those files, then you need to revise all the files. Yet, you do not need to revise all stores at once, you can iterate from one store to the next. The receipt format is not required to be the same across the entire network.

The data format of the receipt file is not normalized in the relational sense. Indeed, the association between ReceiptId and LoyaltyId is repeated across each line of the receipt. However, our experience indicates that it's easier to tolerate a bit of redundancy rather than to deal with a greater number of files.

Format of backorders-yyyy-MM-dd.tsv

Backorders represent the replenishment orders passed from the store to the suppliers. Like it is done for receipts, the intermediate format also enforces a daily partitioning of the data files. Providing the backorders to Shelfcheck is optional.

The suffix of the file indicates the date for all the backorders listed within the file. Except for the initial backorder file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

backorders-yyyy-MM-dd.tsv
BackorderIdstring An optional field to specify a unique identifier of the backorder. This field can be leveraged to match backorders with deliveries made to the store later on.
DateTimedatetime Required. Identifies the date when this line occured. Within each file, lines should be sorted by this column from oldest to newest.
ReorderCodestring An operating code typically used by store employees to replenish the item.
GTINstring Required, unless a BackorderId was specified on the same line.
Quantityfloat Required. The quantity to be backordered.

Format of deliveries-yyyy-MM-dd.tsv

Deliveries represent the flow of goods being delivered to the stores, once backorders have been made. Like it is done for receipts, the intermediate format also enforces a daily partitioning of the data files. Providing the deliveries to Shelfcheck is optional.

The suffix of the file indicates the date for each delivery listed within the file. Except for the initial delivery file, the dates listed within the file must strictly match the day as indicated by the suffix.

Caution: the suffix does not represent the date when the data file is pushed. The suffix matches the date of its file content.

deliveries-yyyy-MM-dd.tsv
BackorderIdstring Optional field that should match its original backorder. This identifier allows to track the lead time, among other uses.
DateTimedatetime Required. Identifies the date when this line occured. Within each file, lines should be sorted by this column from oldest to newest.
ReorderCodestring An operating code typically used by store employees to replenish the item.
GTINstring Required. The GTIN of the delivered product.
Quantityfloat Required. The quantity that was delivered.

Format of status-yyyy-MM-dd.tsv

Status files contain a list of misc information, represented as key-value pairs, about the data processing itself and/or the store context. Providing the status file to Shelfcheck is optional.

This file is primarily intended as a flexible mechanism to provide some ad-hoc extra information for each store on a daily basis.

The suffix of the file indicates the date for each status listed within the file. Unlike for receipts, backorders and deliveries, status do not come with any exceptional behavior with respect of the first file found in history.

status-yyyy-MM-dd.tsv
Keystring Required. Expected to start with a letter, and contains only letter and numbers, and have less than 64 characters.
Valuestring Required.

The following keys are reserved with special meaning (all of them are optional):

  • PreIntermediateDataAvailabilityTime: This key should be attached to a value representing a date-time information. It represents the time when the data was made available to the process actually populating the intermediate TSV format.

Format of uploading.tsv

Because of the nature of the TSV files that could be read line-by-line, it is complicated for Shelfcheck to determine if the daily upload of the data, for a given store, is still in progress or completed already.

Thus, to remove this ambiguity, Shelfcheck supports an upload flag named uploading.tsv. The process goes as follow:

  1. Before starting the upload, create the file /store-123/uploading.tsv.
  2. Upload the data for the store 123.
  3. After completing the upload, remove the file /store-123/uploading.tsv.

Then, once the file uploading.tsv has been removed, Shelfcheck will start its process of the newly added data. The lock is defined per store, thus, it's possible for Shelfcheck to start working on the first stores while data delivery for the others is still in progress.

Format of shelfcheck.config

The file name /shelfcheck.config is reserved. It contains the configuration settings of the Shelfcheck instance.

Annex: Terminology

Id vs Code: you might have noticed that sometime we use FooId, and sometime it's FooCode. Fields with Id as suffix represent identifiers that can be used by Shelfcheck to perform data aggregations. On the other hand, when it's FooCode, it means that the field is typically intended for reporting only. Enforcing unicity on FooCode fields is not required - but typically better as far reporting goes.