Only measuring the "latest" values


(fabio) #1

What’s in a measure?

All too often, we limit ourselves needlessly when we equate measures, metrics, or aggregates with the few basic aggregates that SQL provides out of the box: SUM, COUNT, AVERAGE, maybe MEDIAN if we’re feeling creative and our dialect supports it.

For a moment, though, let’s step back and think about what measures are essentially for - taking a potentially large number of underlying data points, usually within a group/dimension, and condensing them down into a digestible datum. Thousands of orders? Total revenue. Millions of sessions? Average time on site. Hundreds of users per segment? A histogram of visit frequency. A dozen campaigns with hundreds of dates and thousands of vists? A sparkline of visits over the last 30-days.

Ok, some of the above are almost over-the-top, but the point is - your imagination is the limit. Here’s a simple way to use measures when summarizing a set of data points in which only the latest datapoint should be used. Examples are inventory snapshotting, balance sheets, historical state of attributes given a timestamped list of changes, etc.

The “latest” measure

Let’s say you have inventory snapshots that you take for each piece of inventory at the end of each day:

INVENTORY_SNAPSHOT
inventory_id     | taken_at  | amount
------------------------------------
1                | 2018-01-01 | 17
2                | 2018-01-01 | 9
3                | 2018-01-01 | 29
1                | 2018-01-02 | 13
2                | 2018-01-02 | 37
3                | 2018-01-02 | 19
...

For any given window(s) of time, we want to aggregate all the rows for a given inventory ID by selecting the latest one. There are different ways to do this, but I like this one because it’s short, reusable, and has O(n) time complexity and O(1) memory complexity, as opposed to approaches based on sorting/ordering which may have worse complexity:

SELECT
    inventory_id, 
    SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int
...
GROUP BY 1
-- Written for Redshift. The :: operator is type casting, || is string concatenation

Let’s explain a bit - for all the rows matching the filter and in a given inventory ID group, we first construct a concatenated field from a fixed-length version of the sorting field (taken_at) and the field we care about (amount). Then, with this calculated field, using the MAX aggregate function gets us the latest row for each inventory id, and then SUBSTRING discards the date, leaving just the value.

Using it in LookML

All well and good, how do we translate this to LookML? You’ll notice that the “latest” aggregate is tightly bound to one grouping (inventory id in our example), so it’s not a great candidate to directly be a measure in our view, where users could pair it with arbitrary dimensions. Instead, it makes sense to put that aggregation and grouping inside of a derived table where the pairing will be hard-coded.

On the other hand, the aggregate is loosely bound to time grouping, since we want users to be able to swap in different levels of granularity - or even no time grouping at all for just the absolute latest data. In addition, this inventory-x-time aggregate often needs to be further aggregated to other uncoupled levels of grouping. With all that in mind, here is one example way to work all these things together using our _in_query feature to write some “dynamic” SQL:

explore: inventory {
  join: inventory_snapshot_dynamic {
    type: left_join
    relationship: one_to_many
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
  }
}
view: inventory_snapshot_dynamic{
  derived_table:{
    sql:
        SELECT inventory_id,
        {% if    inventory_snapshot_dynamic.dynamic_granularity_date._in_query %}
             DATE_TRUNC('day',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_week._in_query %}
             DATE_TRUNC('week',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_month._in_query %}
             DATE_TRUNC('month',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_quarter._in_query %}
             DATE_TRUNC('quarter',taken_at)
        {% elsif inventory_snapshot_dynamic.dynamic_granularity_year._in_query %}
             DATE_TRUNC('year',taken_at)
        {% else %} 0
        {% endif %} as dynamic_granularity
        --------
        {% if inventory_snapshot_dynamic.sum_ending_amount._in_query %}
             , SUBSTRING(MAX( taken_at::varchar(10) || amount ),11)::int as ending_amount
        {% endif %}
        FROM inventory_snapshot
        WHERE {% condition inventory_snapshot_dynamic.dynamic_granularity_date %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_week %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_month %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_quarter %} taken_at {% endcondition %}
          AND {% condition inventory_snapshot_dynamic.dynamic_granularity_year %} taken_at {% endcondition %}
        GROUP BY 1,2
    ;;
  }
  dimension: inventory_id{}
  dimension_group: dynamic_granularity {
    group_label: "History Granularity"
    type: time
    datatype: date
    sql: {$TABLE}.dynamic_granularity ;;
    timeframes: [year,quarter,month,week,date]
  }
  measure: sum_ending_amount{
    type:sum
    sql: ${TABLE}.ending_amount
  }
}

Sharing is caring

The above works well enough in isolation, but when you want this and other similar dynamic views to be composable into one explore, it helps to isolate and externalize the date fields so multiple views can reference them. Here’s an illustrative partial example:

explore: inventory {
  join: dynamic_granularity {sql: ;; relationship: one_to_one} #Field-only view
  join: inventory_snapshot_dynamic {
    type: left_join
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id} ;;
  }
  join: sales_dynamic {
    type: left_join
    sql_on: ${inventory_snapshot_dynamic.inventory_id} = ${inventory.id}
    {% if inventory_snapshot_dynamic._in_query %}
       AND sales_dynamic.dynamic_granularity = inventory_snapshot_dynamic.dynamic_granularity
    {% endif %} ;;
  }
}
view: dynamic_granularity {
  label: "[History]"
  dimension_group: dynamic_granularity {
    group_label: "Granularity"
    type: time
    datatype: date
    sql: COALESCE(
      {% if inventory_snapshot_dynamic._in_query %}
        inventory_snapshot_dynamic.taken_at, 
      {%endif%}
      {% if sales_dynamic._in_query %}
        sales_dynamic.sale_at, 
      {%endif%}
      NULL ) ;;
    timeframes: [year,quarter,month,date]
  }
}
view: inventory_snapshot_dynamic{
  derived_table:{
    sql:
        SELECT inventory_id,
        {% if    dynamic_granularity.dynamic_granularity_date._in_query %}
             DATE_TRUNC('day',taken_at)
        {% elsif dynamic_granularity.dynamic_granularity_week._in_query %}
             DATE_TRUNC('week',taken_at)
...

Footnotes

  1. The term “semi-additive measure” is often invoked in these kinds of discussions

Limiting a table to the max value per group
Summarizing over Start and End Dates