Aggregating data with Envision

Aggregating data










Home » Resources » Here

Data aggregation is the process of combining multiple table lines through special functions called aggregators. Whenever sums, averages, counts or medians are involved, the calculations are carried out through aggregators. Aggregation also provides a way of combining data originating from different tables. Envision provides a rich syntax to support all such scenarios. In this page, we detail and illustrate how aggregators work with Envision.

An illustrating script

In this page we use, once again, the sample dataset which should be accessible from the /sample path in your Lokad account. The script below is moderately complex and illustrates some of the data aggregation patterns available in Envision. Data aggregation can take place both outside the tiles and inside the tiles. We advise you to read Making calculations with Envision first, as it should greatly help you in understanding the contents of this page.
read "/sample/Lokad_Items.tsv" as Items
read "/sample/Lokad_Orders.tsv" as O
read "/sample/Lokad_PurchaseOrders.tsv" as PO
expect O.Quantity : number

show label "Aggregating data" a1f1 tomato

oend := max(O.Date)
obegin := oend - 365

totalPurchased := sum(PO.NetAmount)
totalSold := sum(O.NetAmount) when date >= obegin
show table "Total purchased" a2c2 with 
  totalPurchased unit:"$"
show table "Sold over 1 year" d2f2 with
  totalSold unit:"$"

VolumeSold = sum(O.NetAmount)
UnitSold = median(O.Quantity)
show table "Top sellers" a3f4 tomato with
  Name
  VolumeSold as "Sold" unit:"$"
  UnitSold as "Median" 
  order by VolumeSold desc

avgRet := avg(distinct(O.Date) by O.Client)
avgQty := avg(sum(O.Quantity) by [O.Client, O.Date])
show table "Average client returns" a5c5 with
  round(avgRet)
show table "Average backet quantity" d5f5 with
  avgQty

O.Brand = Brand
show table "Top Suppliers" a6f7 tomato with 
  Supplier
  distinct(Category) as "Categories"
  sum(O.NetAmount) as "Sold" unit:"$"
  mode(O.Brand) if (O.Brand != "Fellowes") as "Most frequent brand sold" 
  group by Supplier
  order by sum(O.NetAmount) desc
We suggest to begin by copy-pasting this script into your Lokad account, and running it once in order to observe the resulting dashboard. If everything works, you should see the dashboard illustrated below.

Image

Scalars for single-value aggregations

A variable that has no “item” affinity in Envision is referred to as a scalar variable (the variable is not attached to any particular line of the Items table). With the Excel analogy in mind, Envision variables are akin to Excel columns by default: these variables are vectors and they hold many values at once (one value per item actually). However, it is also possible to have variables that behave like a single Excel cell, and that only hold a single value: these are the scalar variables. The script above illustrates how two scalar variables can be computed; the relevant lines are copied below for the sake of clarity.
totalPurchased := sum(PO.NetAmount)
totalSold := sum(O.NetAmount) when date >= obegin
show table "Total purchased" a2c2 with 
  totalPurchased unit:"$"
show table "Sold over 1 year" d2f2 with
  totalSold unit:"$"
Several aspects can be highlighted in these few lines of code. First, the two assignments on lines 1 and 2 are carried out using :=, the scalar assignment operator. When the operator := is used instead of the plain assignment operator =, this means that the aggregation should result in a single value that is independent of the items. Second, the variables totalPurchased and tableSold begin with lowercased names. While variable names are not case-sensitive in Envision, the spelling of these two variables is not an accident in this case. As a coding practice, we recommend to lowercase the first letter of the variables that contain scalar values. In contrast, the names of the tables and the names of the columns should start with an uppercase letter. Following these guidelines will make your script more readable and easier to debug. Third, the scalar aggregation takes place by default within the show table statement if the table has only a single column. For example, lines 1 and 3-4 could be combined into a single show table statement illustrated just below.
show table "Total purchased" a2c2 with
  sum(PO.NetAmount) unit:"$"
Numeric scalar variables can also be used to do arbitrary calculations just like any numeric variable in Envision. For example - while it may not make much sense from a business perspective - the following calculation could be added at the end of the Envision script, once the variables totalPurchased and totalSold have been defined:
mySum := (totalSold - totalPurchased) / 2
show table "My Sum" with mySum

Multi-value aggregations

Envision also supports the possibility to perform multi-value aggregations, typically aggregating the data of a given table and projecting the result on the lines of another table. Thus, aggregation is the most frequent approach of combining data coming from different tables. In particular, one of the most frequent use cases consists of taking an historical table, that is, a table also indexed with a Date column, and aggregating this table into a vector aligned with the Items table.

The first script above illustrates this pattern with a table that displays the top sellers, that is, the items ordered according to their respective sales volume, putting the highest values at the top. The relevant lines of script are copied below for the sake of clarity.
VolumeSold = sum(O.NetAmount)
UnitSold = median(O.Quantity)
show table "Top sellers" a3f4 tomato with 
  Name
  VolumeSold as "Sold" unit:"$"
  UnitSold as "Median" 
  order by VolumeSold desc
Lines 1 and 2 are performing aggregations in a way that is very similar to the scalar aggregations that we have reviewed in the previous section. However, the assignments are carried out with the equal sign = instead of the := operator used for the scalar assignments. As a result both VolumeSold and UnitSold are vectors which are part of the Items table. As a result, these variables are similar to columns in Excel, and they contain one value per item.

Both sum and median are special functions named aggregators in Envision. There are many other aggregators in Envision such as avg, min, max, first, last, etc. For the sake of concision, we are not reviewing all the aggregators available in Envision in this page; however for more details, you can check out the full list of aggregators.

The result of these two aggregations is displayed in the table defined with the lines 3 to 7. The two vectors VolumeSold and UnitSold are listed as arguments after the keyword with and their values are displayed in the table. Finally on line 7, the order by statement indicates that the table should be sorted starting with the items that have the highest VolumeSold values.

Readers who may be familiar with the GROUP BY syntax in SQL might be wondering how Envision knows which grouping should be used when computing the sum at line 1. By default, Envision performs a grouping using the columns which act as “keys” on the right side of the assignment. In the case of a variable that belongs to the Items table – the table whose name is implicit – the column acting as the (primary) key is the Id column. This explains why a per-item aggregation results from using the = sign.

Explicit aggregation groups with by

So far, the aggregations we have performed relied on the implicit aggregation patterns of Envision. However, the behavior of all the aggregators can be modified with an optional keyword by that is used to explicitly specify the applicable grouping. Let’s illustrate how this keyword by is used:
VolumeSold = sum(O.NetAmount)
SameVolumeSold = sum(O.NetAmount) by Id // Same!
Line 2 is assigned to a second vector named SameVolumeSold but the values of this vector are strictly identical to the one of the vector VolumeSold at line 1. In fact, the option by Id is also used, albeit implicitly, at line 1. Intuitively, when the option by is used, it is as if groups where created first according to the grouping target, and second, as if the aggregator was separately computed for every group. The by option offers the possibility to compose fairly complex aggregations, as illustrated by the script on the top of this page. Let’s review the two lines of script where the aggregations are performed through by options.
avgRet := avg(distinct(O.Date) by O.Client)
avgQty := avg(sum(O.Quantity) by [O.Client, O.Date])
The aggregator distinct counts the number of distinct values observed within each group. At line 1, the lines of the O (for Orders) table are first grouped according to their respective Client values; and then for every client, the number of distinct ordering dates is counted. Intuitively, this aggregation can be interpreted as counting the number of times each client has returned. Then, this result is re-aggregated into a single scalar value with the avg aggregator that contains the inner distinct aggregation.

The scalar avgQty can be interpreted as the number of units purchased per basket. The calculation begins with a sum() by statement, however after the by option, we have not one but two variables separated by commas and listed within brackets: [O.Client, O.Date]. This syntax should be understood as follows: create a group for every pair of Client and Date. From a business perspective, we are treating all units purchased the same day as the same basket, which is a reasonable approximation for most situations. Finally, the external call to avg produces the final average over all the sum aggregates computed for all the pairs.

More generally, the by option supports an arbitrary number of variables following the syntax sum(foo) by [arg1, arg2, …, argN]. In practice, however, it is rather rare to encounter situations where it makes sense to group by more than 4 variables at once. Also, the order of the arguments has no impact on the resulting groups used to compute the aggregates.

Explicit aggregation groups with into

Explicit aggregation within a table with group by

Sometimes, through aggregation, we can produce a new table which is more relevant to be used within a dashboard that the original non-aggregated table. Thus, Envision also supports the possibility of aggregating data directly from the tile declaration statement. The most direct way of visualizing this Envision capability is to aggregate the data which is to be displayed in a table. This is precisely what is being done in the script at the top of this page. Let’s have a second look below at the relevant lines copied below.
O.Brand = Brand
show table "Top Suppliers" a6f7 tomato with 
  Supplier
  distinct(Category) as "Categories"
  sum(O.NetAmount) as "Sold" unit:"$"
  mode(O.Brand) if (O.Brand != "Fellowes") as "Most frequent brand sold" 
  group by Supplier 
  order by sum(O.NetAmount) desc
The show table statement goes from line 2 to line 8, and more specifically, the aggregation is specified on line 7 with the statement group by which is exactly the same semantic than the by option that we have reviewed in the previous section. At this point, you might be wondering why Envision is not using the keyword by instead of group by if the semantic is the same. The if at line 6 is an aggregator filter, discussed in the next section.

The answer is simple: it is possible to use the by option within the list of expressions used after the with for a given tile (such a situation is not illustrated here however). Thus group by allows to differentiate between a by statement that is part of an expression passed to the tile, and a group by statement that applies to the tile as a whole. In other words, the group by applies to all the expressions listed after the with keyword, in contrast to the by statement that only has a local impact.

When group by is used, all expressions passed to the tile after the with keyword should offer the possibility of being aggregated. For example, the 1-line script below is incorrect because Name does not offer an aggregation pattern while group by Supplier is specified.
show table "WRONG!" with
  Name
  group by Supplier
However, if we modify this script by introducing an aggregator, distinct for example, then our Envision script becomes valid.
show table "CORRECT!" with
  distinct(Name)
  group by Supplier
The only exception to this rule is the aggregation of the target itself. In the script at the beginning of this section, we have a group by Supplier statement on line 7. On line 3, the variable Supplier is listed without any aggregator, yet the script remains valid, precisely because the grouping takes place according to the Supplier variable.

The need to expose an aggregator also applies to the order by statement on line 8. Indeed, the table is first aggregated by supplier, and only then is it sorted according to the Id column – unless another sort order is specified. As a result, Envision needs to compute one value per group in order to sort all these groups. This is exactly what we see happen with the statement group by sum(O.NetAmount).

While we illustrated group by with the table tile, this syntax is not specific to the table tile, and the same pattern can be used with most other tiles. For example, it would be possible to extend the script listed at the very top of this page with a barchart tile aggregated by Brand:
show barchart "Sales by brand" with
  sum(O.NetAmount)
  group by Brand
The group by also supports multi-grouping, that is grouping by multiple vectors or multiple expressions, with a syntax that is identical to the one of the by option that we have covered previously.

Aggregation filters

The Envision aggregators also support filters through the if keyword. The script above illustrates the use of such a filter at line 6:
O.Brand = Brand
show table "Top Suppliers" a6f7 tomato with 
  Supplier
  distinct(Category) as "Categories"
  sum(O.NetAmount) as "Sold" unit:"$"
  mode(O.Brand) if (O.Brand != "Fellowes") as "Most frequent brand sold" 
  group by Supplier 
  order by sum(O.NetAmount) desc

Here the brand Fellowes is explicitly excluded from the report through the if filter. Filters are useful within tile statement because they offer the possibility to filter each column separately. In contrast, a where filter positioned outside the tile statement would have filtered all the selected lines of the O table indifferently for all columns.

Filters can also be used inline:
TwoAndMore = sum(O.1) if (O.Quantity >= 2)

FORMATTER ERROR (Transcluded inexistent page or this same page)