Reading tabular files with Envision

Reading tabular files












Home » Resources » Here

Envision loads and processes tabular files that are stored in your Lokad account. Furthermore, Lokad provides native support for a series of third party apps. When your app is supported by Lokad, importing data and creating the necessary files is a process that is completely managed by us internally. However, when this is not the case, or when Lokad cannot retrieve all the relevant data, it is possible to import files into Lokad manually. This page details how these tabular files should be formatted to be compatible with Envision. The Envision syntax relative to how these tabular files are read by the system is also documented below.


In brief

The file formats supported by Lokad are:

  • CSV, TSV and other similar usual flat text files, .csv or .tsv or .txt files.
  • Microsoft Excel 2007+ files, .xlsx files.
  • GZip archives, .gz files; content is expected to be a flat file.
  • WinZip archives, .zip files; content is expected to be a flat file; archive should contain exactly 1 file.
  • 7z archives, .7z files; content is expected to be a flat file; archive should contain exactly 1 file.

The syntax to read files in Envision is:
read "/foo/myorders.csv.gz" as Orders with
  "My Id" as Id
  Quantity : number

// wildcard '*' here, multiple files will be read
read "/foo/promos*.xlsx" as Promos[*]

// max within the wildcard '*', only one file is read
read max "/foo/stocks*.xlsx" as Stocks[*]

// 'unsafe' tolerate parsing issues
read "/foo/backorders.tsv.zip" unsafe as Backorders[Date, *]

A file storage with connectors

Every Lokad account comes with its own file storage service. Simply put, it is possible to create folders, and to upload files into these folders. In this respect, Lokad offers a service similar to other file-hosting services like Box or Dropbox, except that most file-sharing features are absent. In fact, the intent of Lokad’s file storage system is not to be another file-sharing app, but to provide a fully transparent way of having access to all the data used by Lokad whenever your commerce is being analyzed. Thus, whenever a forecast or a dashboard is generated by Lokad, the corresponding data exists within your account in the form of files that can be downloaded, and possibly analyzed independently of Lokad.

For smaller businesses, producing files that gather all the relevant historical data of the company can be a very tedious exercise because such businesses do not always have significant IT resources in the first place. Thus, for many popular commerce apps - Brightpearl, Linnworks, QuickBooks, TradeGecko, Unleashed, Vend … to name a few - Lokad provides built-in connectors. These connectors can, well, connect to the app, typically using an online API (Application Programming Interface), and generate, within a Lokad account files that are directly formatted in a way that make them most suitable to be processed by Envision.

If your business app is not supported (if that is the case , as we tend to support the apps that are most frequently requested), or if Lokad’s built-in support does not cover all the relevant data, in this case it is possible to upload files directly into Lokad as we support manual uploads through the web. However, as data will have to be refreshed on a regular basis, it is a lot more practical if the data transfer can take place in a fully automated manner. Therefore, Lokad also supports protocols such as FTP and SFTP that offer the possibility to automate all file transfers.

Extensive support for tabular files

Lokad supports quite a diverse range of file formats that can contain tabular data. We support popular formats such as Excel sheets or CSV (comma-separated value) files. If you have experience with flat file import/export, you might already know that there are a myriad of small technical details that can make the process quite tedious:

  • Encoding of the file might differ from the encoding expected by the app.
  • Dates or numbers might not be formatted as expected.
  • Column delimiter or the line return encoding might also differ.

We have designed our system to be capable of auto-detecting all the technicalities such as encoding, date and number formats, delimiters … While everything happens automatically, it is a fairly non-trivial process that takes place within Lokad when files are read by our system. For example, we are now supporting the auto-detection of more than 100 distinct date formats.

In practice, the guidelines for the tabular file format are simple: the column names should be listed as the first line of the file (this requirement applies to Excel sheets too), and one has to make sure that the token values do not collide with delimiters when using flat text files such as CSV or TSV.

Also, when flat files grow in size, it might become more practical to compress them before uploading them to Lokad. Lokad supports flat text files compressed with GZip as long as the .gz extension is added at the end of the file name. For example, Lokad_Items.tsv.gz is recognized as a TSV file that is compressed with GZip. The same pattern works with the .zip extension for WinZip archives, and with .7z extension for 7z archives. In case of WinZip and 7z, the archives should only contain a single compressed file; Envision does not support multi-file archives. Excel sheets are already compressed, so there is no need to compress them even if they happen to be very large.

File and column naming guidelines

Lokad can accommodate pretty much any file name and any column name, but if certain guidelines are followed, the resulting Envision script can be made a lot simpler. The sample dataset is a good example of a set of files that follow these guidelines. If you have not yet had a chance to look at this dataset until now, we suggest to start here.

The files are expected to be named following this pattern: in Lokad_TableName.xyz, the TableName is replaced by the name of the table, and xyz is replaced by the actual file extension, for example xlsx for Excel sheets. The table name should not contain any spaces, and it should not begin with a digit either. By following these file naming conventions, Envision is capable of auto-detecting the relevant tables to be uploaded into Lokad without any further effort.

In particular, the items table is a special case, and for many Envision scripts, there are benefits in structuring your data around such an items table. However, the items table is not required, and Envision can run without loading such a table. If a file named Lokad_Items.xyz is provided to Envision, it will be treated as the items table by default.

If a table happens to be split into multiple files, for example because the individual files are too large, then, Envision can consolidate all these files into one table. In order to achieve this, all these files should be named Lokad_TableName_Suffix.xyz where the Suffix varies from one file to another. For example, if daily extracts are produced for sales extracts, then one of the daily sales extracts is likely to be named Lokad_Orders_2015-03-16.tsv. The Suffix plays no specific role except for keeping the files distinct, so anything goes here. Naturally, Envision expects to find the same columns across all files varying only by their suffix, otherwise Envision cannot consolidate these files into one table without additional instructions.

The column names come with little expectations from an Envision point of view. Envision also provides some default behaviors:
  • When a column named Id is found in a flat file, by default, this column is treated as a foreign key to the Id column originally found in the Lokad_Items.xyz file.
  • When a column is named with a name that ends with Date, by default, this column is treated as a date by Envision.

Most column names are acceptable as long as they do not contain spaces and do not begin with a digit. Envision also provides ways to override the behaviors listed above. However, whenever possible, we suggest sticking to the guidelines as it will reduce the amount of scripting overhead required within Envision to start getting results.

Syntax to read files

Envision supports a rich syntax for reading files that can handle situations when our naming guidelines cannot be followed (e.g. when the files cannot be internally modified). The syntax is outlined below: it is not yet the fully general syntax, some additional options are detailed below.
read "/foo/bar*.xyz" as MyTable with
  "Foo1" as Id
  "Foo2" as Date
  "Foo 3" as Foo3
The /foo/bar*.xyz path may contain a wildcard (*), and this wildcard can be replaced by any sequence of characters. Using a wildcard is optional, but when used, the wildcard offers the possibility of capturing multiple files at once. This pattern is similar to the shell syntax when listing files from the command-line.

The first as keyword, found right after the path, indicates the name of the table. If the table is named MyTable, then the Envision script will refer to this table by writing MyTable.Id for example. If the table to be uploaded is the items table itself, then this as MyTable can be skipped.

The instructions that come afterwards in the form of the with keyword act as column renaming instructions. In the example above, the Foo1 column is renamed as Id and the Foo2 column is renamed as Date. Through renaming these two columns, the MyTable table becomes a legitimate Envision table containing historical data as both the Id and Date columns are now properly defined.

The third renaming operation takes place as Foo 3 (notice the space in the original column name) is renamed into Foo3. This illustrates how an incorrect column name can be turned into a correct one. It is also important to note that vector variable names do not allow for spaces. For the sake of concision, we did not include any further renamed pairings, but actually, as long as the pairings are properly delimited by a comma, there is no limit to the number of renaming operations that can take place for any single file. In practice, renaming a column can be useful for making a script more readable, but it can also be used to make the required adjustments when it is necessary to consolidate multiple files into one table while these files have inconsistent column names.

For example, let’s assume that the order history is split between two files. First, we have Lokad_Orders_Old.tsv that contains all the data up to December 31st 2014. This file has three columns named ItemId, OrderDate and Quantity respectively. This file's columns do not follow Envision’s column naming guidelines. Second, we have Lokad_Orders.tsv, a more recent file, that contains all the history starting from January 1st 2015. This file contains three columns too, and these columns are named Id, Date and Quantity, which fits with Envision's default guidelines. The script below illustrates how the two files can be consolidated into a single Orders table.
read "/foo/Lokad_Orders_Old.tsv" as Orders with
  "ItemId" as "Id"
  "OrderDate" as Date

read "/foo/Lokad_Orders.tsv" as Orders
Naturally, it is possible to define as many read statements as it takes to properly cover all the files and all the tables. Envision does not impose a precise location for these statements within your script, so technically they could be placed anywhere, even at the very bottom of the script. However, we advise to keep these statements at the beginning of the scripts because this is where people familiar with programming languages would expect to find them.

Wildcard filters

The wildcard (*) offers the possibility to select multiple files; however, sometimes, the intent is to read only a single file out of many. If the intent is to read only the "last" file when ordering the files based on their file names, this can be done with:
read max "/foo/bar*.xyz" as MyTable with
  "Foo1" as Id
  "Foo2" as Date
  "Foo 3" as Foo3

There are 3 wildcard filters available:

  • min: the first file when ordering files based on their respective names (lexicographic ordering).
  • max: the last file when ordering files based on their respective names (lexicographic ordering).
  • latest: the last file when ordering files based on their last modified date.

For example, if a snapshot of the stock levels is pushed to Lokad on a daily basis with a file named stocks-2016-12-21.csv, the suffix being adjusted to reflect the current date, then most likely, the intent will be to read only the latest file. In this specific case, both max and latest would give the same result because the file naming convention is consistent with the update dates.

Defining the type expectations

Unless specified otherwise, Envision expects any table, besides the Items table, to contain both an Id and a Date column, which can be obtained through renaming the relevant columns, as detailed in the previous section. However, sometimes, a table might not exactly correspond to these expectations, and the Envision syntax can be used to clarify any expectations associated with a table. Similarly, the type of column is typically inferred from the Envision script itself, but sometimes, type inference alone is not sufficient. Again, the Envision syntax offers the possibility to specify any column type that may be necessary.

Envision offers the possibility to define the primary keys for a table, by listing them between parentheses right after the definition. There are only 7 primary key combinations that are allowed:
read "a.csv" as A[*]
read "b.csv" as B[Id]
read "c.csv" as C[Id, *]
read "d.csv" as D[Date]
read "e.csv" as E[Date, *]
read "f.csv" as F[Id, Date]
read "g.csv" as G[Id, Date, *]
Case A[*] defines a table with no constraints at all. It's a wildcard and it accommodates very diverse scenarios, but when there is no Id column, you will be missing some of niceties of the Envision language.

Case B[Id] defines a table with one line per item at most. For example, our inventory settings fall into this category.

Case C[Id, *] defines a table with any number of lines per item. For example, a table representing a distribution of probabilities falls into this category.

Case D[Date] defines a table that contains a single scalar value for certain days. For example, such a table could be used to list the national holidays applicable to a certain geographical location.

Case E[Date, *] defines a table that contains any number of scalar values for certain days. It is an extension of the previous case, where a given day can be associated with multiple factors.

Case F[Id, Date] defines a table that contains at most one value for each [Id, Date] pairing. For example, such a table could be used to list the past stock-out occurrences.

Case G[Id, Date, *] is the default behavior, the one you get when omitting parentheses. In practice, most historical data falls into this case, such as the sales order history for example.

Type constraints on table columns

Envision is a strongly typed language. It means that all vectors are associated with one of the types available in Envision. There are only 4 types available in Envision: text, number, date and boolean, and these types are typically inferred directly from the script itself. However, it is possible to specify the type expectations with:
read "a.csv" as A[*] with
  "Foo" as X : text
  Y : number
  Z : date
Then, going back to the sample dataset, if we write:
// these three lines are standard
// and may be from now on omitted
read "/sample/Lokad_Items.tsv"
read "/sample/Lokad_Orders.tsv" as Orders
read "/sample/Lokad_PurchaseOrders.tsv" as PO

Quantity = sum(Orders.Quantity)
then, the vector Orders.Quantity is implicitly typed as a number because only numbers are eligible to be summed. As a result, this means that when Envision parses the Lokad_Orders.tsv file, it expects the column Quantity to contain tokens that can be successfully parsed as numbers. If we were to try writing:
Nonsense = sum(Orders.Client)
In this case, Envision would try to parse the Client column of the Lokad_Orders.tsv file with numbers, and the process would fail, because this column contains client identifiers that are not numbers. In addition to the type inference mechanism, Envision also provides a syntax to explicitly specify the type to be expected for every column of every table. For example, the sample dataset can be explicitly typed using the following statements:
expect Supplier : text
expect Orders.NetAmount : number
expect PO.ArrivalDate : date
The syntax simply goes as follows - expect MyTable.MyColumn : type, where type is one of the four eligible types: text, number, date or boolean. The most widely used type assertion is the date type, because based on arithmetic operations and typing inference alone, it is not always possible to correctly infer that a column is expected to be a date.

Parsing options

The Envision's file parser is very tolerant, however, there are some situations where the parser needs a bit of help. The skip option can be used to tell parser to skip the N first lines of the flat file. The syntax is as follow:
read "/foo/bar*.csv" skip:2 as MyTable with
  "Foo1" as Id
  "Foo2" as Date
In the example above, the parser skips the first two lines of the flat file and expects the column headers to be found on the third line. The skip option is optional, with skip:0 being the default behavior. This option is intended to cope with systems which introduces meta-data at the very beginning of their flat file extractions.

Unsafe read

By default, Envision's file parser is strict: if a value is expected to be a date or a number, and if this value cannot be successfully parsed as such, the read statement will fail with an error. As a rule of thumb, when Envision encounters parsing errors, the best option consists of investigating why the file is corrupted in the first place. In fact, as Envision is capable of recognizing many formats of dates and numbers, the chances are that if Envision fails, data is likely to be corrupted. Data corruption can lead to all sort of errors. When Envision fails, more often than not, this is merely the visible sign of a problem that happened earlier.

However, with large datasets, minor data corruptions become much more difficult to avoid. In particular, when corrupted data also happens to be old, fixing the data corruption might not even be worth the effort. Consequently, for these specific situations, Envision supports an unsafe read mode, as illustrated by the syntax below:
read "/foo/orders.tsv" unsafe with
  "My Id" as Id
  "My Date" as Date
When unsafe is used, Envision treats the parsing issues as warnings rather than errors. This option allows calculations to go ahead even if values or lines have been discarded by the parser because they could not be read.