Joining tables with the By-At aggregation in Envision

Joining tables with the By-At












Home » Resources » Here

Envision provides an implicit mechanism to join tables whenever those tables contain an Id (item index) field, as it is frequently the case to represent sales history or purchase history. However, this implicit mechanism - called natural joins - is not suitable for more complex situations. Thus, Envision offers a more general join mechanism referred to as the by-at. The by-at is a special kind of aggregation that performs arbitrary joins over tables.

An illustrating script

In this section, we use the sample data set which should be available from the /sample path in your Lokad account. If it's not already done, we suggest first to read the entry reading files with Envision, with a closer look at the file type expectations, as this section is highly relevant to better understand the following script.
read "/sample/Lokad_Items.tsv"
read "/sample/Lokad_Orders.tsv" as O
read "/sample/Lokad_Suppliers.tsv" as S[*]

Moq = same(S.Moq) by S.Supplier at Supplier
show table "Item list" a1d4 tomato with
  Id
  Name
  Supplier
  Moq
The first three lines are our usual statements for reading files. On line 3, the type expectation S[*] specifies that the suppliers table isn't getting any type expectations - it's an independent table with no primary key and no foreign key. Indeed the file Lokad_Suppliers.tsv - unlike all the other .tsv files of the folder - does not contain an Id column. As this column is missing, Envision cannot implicitly join this table with the Items table. As a matter of fact, the supplier's table is aggregated per supplier with one supplier per line, while all the other files are detailed per item.

On line 5, we leverage the by-at aggregation, called this way because it uses the by and at keywords, to perform a join between the Items table and the S table. We could have similarly written:
Items.Moq = same(S.Moq) by S.Supplier at Items.Supplier
However, per Envision's usual conventions, the Items table name is omitted.

In this specific example, we have populated a vector Items.Moq with the data obtained from the table S. A join has been made between the two fields Items.Supplier and S.Supplier.

In the case that a supplier entry found in the Items table does not have a counterpart in the S table, a zero value will be used. However, through the optional or keyword, following the usual behavior of the Envision aggregators, other default values could be used. We could have written:
Items.Moq = same(S.Moq) by S.Supplier at Items.Supplier or 0
On lines 6 to 10, the resulting Items table is displayed in order to illustrate the calculation performed by the by-at the line above. As expected, each item is associated with the MOQ value of its respective supplier.

General syntax for the by-at

The general syntax for the by-at is:
T1.A = agg(T2.B) by [T2.X, T2.Y, T2.Z] if E at [T1.X, T1.Y, T1.Z] or T1.C
Much like the usual aggregators in Envision, the by-at offers the possibility to perform a tuple match, that is, matching n fields at a time. Then, all the usual Envision aggregators can be used: sum, but also min, median, same, etc. The or block can be omitted just like it is done with a regular aggregation.

The semantic of the by-at is as follows:

  • Build all distinct groups of tuples for the by [T2.X, T2.Y, T2.Z], filtering with the condition E if it is present. For each one of those groups, if at least one matching group exists on the left, compute the aggregate agg(T2.B).
  • Build all distinct groups of tuples for the at [T1.X, T1.Y, T1.Z]. For each one of those, assign the matching aggregate from the right.
  • For the groups on the at that don't have a matching group on the right, then fall back on the default T1.C value.

Just like regular aggregators, the or statement is optional, and provides the value to be used as a result of the aggregation if the group is empty.

Tips for the by-at

The by-at is a powerful construct that can be leveraged in many situations, not just to join tables from the usual SQL perspective.

A table can be joined to itself: For example,
O.DaySum = sum(O.Quantity) by O.Date at O.Date
illustrates how to compute daily totals over the O table without resorting to the Day table.

Calendar tables Day and Week can be joined to: For example,
Day.Shift = sum(O.Quantity) by [O.Id, O.Date - 1] at [Day.Id, Day.Date]
illustrates how to shift the quantities from one day through a by-at.

Translating the by-at into SQL

For the readers who are already familiar with SQL, then it can be noted that the Envision expression:
Moq = same(S.Moq) by S.Supplier at Supplier or defaultMoq
has the following equivalent in SQL, which uses an outer left join: UPDATE Items LEFT OUTER JOIN S ON Items.Supplier = S.Supplier SET Moq = COALESCE(S.Moq, defaultMoq) The Envision syntax emphasizes Excel-like calculations more than the relational algebra itself.

Cross Joins

The cross join is another classic relational operation supported by Envision. A cross join operates over a Cartesian product of the lines involved in the join. As a cross join can easily generate a large number of lines to be processed, we suggest to pay close attention to the size of the joined tables in order to keep the processing time under control. The syntax for cross join is the following:
T1.R = sum(T1.A * T2.B) cross (T1, T2) if (T1.C & T2.D) or T1.Default
This statement creates a temporary anonymous table that is the cross join of tables T1 and T2. The aggregator argument, as well as the if filter, are evaluated in this table (and therefore, forcibly inflated from tables T1 and T2). The result of the aggregation is computed into table T1 by aggregating together the filtered lines of the cross join that correspond to each line of T1.

If there is already a relationship between T1 and T2 - such as Items and Orders where the latter can be inflated into the former - or if both tables are the same, then the alias-based variant can be used:
T1.R = sum(T1.A * TAlias.B) cross (T1, T2 as TAlias) if (T1.C & TAlias.B) or T1.Default
The alias table TAlias contains the same variables as T2, but has no relationship with any table (except the cross table itself). The alias table is not available outside the aggregator argument and if filter. An alias name may be used more than once in a given script (in separate aggregators), but must not conflict with input tables, built-in tables, or tables created with a table T = ... statement (as illustrated above).

The size of the cross-join is currently limited to 4e9 (four billion) lines. If crossing a table with itself, this means the table may be no larger than about 60k lines. In practice, Envision will feel noticeably slow much before actual limit at 4 billion lines is reached. We suggest to make sure that joined table are properly filtered before the cross join in order to keep the complexity in control.