A cure for the one_to_many blues


(fabio) #1

If you want to build an explore that lets users start from a “one” table and then fan-out into more detailed “many” records from multiple tables, you can often feel constrained by the fact that, by default, these joins will be conflicting.

For example, if you have an application where users can order service and then separately make payments against their account, you’ll find that there isn’t a direct relationship between payments and orders, and so you want to let your explore expose both as child entities for a given user, but not with a specific relationship.

I’ve talked about this before with my Outer Join On False approach, but that approach tries to handle a broader class of possible queries, at the expense of the model being very difficult to generate and maintain.

In this article, instead, I’ll show you how to take an existing explore that starts from a “one” table and has multiple “one_to_many” joins, and how to make it work more smoothly with minimal dynamic SQL fuss. (Or, it could be applied to a specific set of joins within a larger explore too)

Let’s say our example explore started out like this:

explore: users {
   join: orders {
      relationship: one_to_many
      sql_on: ${orders.user_id} = ${users.id} ;;
   }
   join: payments {
      relationship: one_to_many
      sql_on: ${payments.user_id} = ${users.id} ;;
   }
}

We want the orders and the payments to not “mix” or “fan out” one-another. That’s as simple as adding a join:

explore: users {
   join: user_join_paths {
       type: cross
       relationship: one_to_one
       # The user_join_paths view has this SQL that selects two literal rows
       #      SELECT 'orders' as path
       #      UNION ALL
       #      SELECT 'payments' as path
   }
   ...

… and constraining our orders and payments joins to live in their own parts of the query

   join: orders {
      sql_on: ${user_join_paths.path} = 'orders'
      AND ${orders.user_id} = ${users.id} ;;
      relationship: one_to_one
   }
   join: payments {
      sql_on: ${user_join_paths.path} = 'payments'
      AND ${payments.user_id} = ${users.id} ;;
      relationship: one_to_one
   }
}

(You may have noticed I’ve changed the relationships from one_to_many to one_to_one. While this is not technically true, we are managing the fanout on our own, and this declaration tells looker not to worry about applying symmetric aggregates.)

Here’s what the intermediate query looks like after the first join

| User | Path     |
|------|----------|
|Alice | Orders   |
|Alice | Payments |
|Bob   | Orders   |
|Bob   | Payments |

And if we join on our orders and payments:

| User | Path     |Orders.Item |Orders.Amt | Payments.Amt |
|------|----------|------------|-----------|--------------| 
|Alice | Orders   | Widget     | $5        |            ∅ |
|Alice | Orders   | Sprocket   | $1        |            ∅ |
|Alice | Payments |          ∅ |         ∅ |  $1          |
|Alice | Payments |          ∅ |         ∅ |  $1          |
|Alice | Payments |          ∅ |         ∅ |  $1          |
|Bob   | Orders   | Widget     | $5        |            ∅ |
|Bob   | Payments |          ∅ |         ∅ |            ∅ | (Bob has no payments)

From here, any of our familiar dimensioning and measuring operations on orders or payments will return a valid result! Note: Be careful that any counts you do are like COUNT(${TABLE}.id) instead of COUNT(*). You can do this with a measure: my_count {type:count filters:{field:id value:"-null"}}.

As a matter of optimization, we can change that cross join in the middle of our query to a conditional left join depending on what views a user has selected:

   join: user_join_paths {
       type: left_outer
        sql_on: 0=1
            {% if orders._in_query %} OR ${user_join_paths.path} = 'orders' {%endif%}
            {% if payments._in_query %} OR ${user_join_paths.path}  = 'payments' {%endif%}
       ;;
   }

Full Example with FAA Data

explore: carriers {
  view_label: "[Carriers]"

  join: carrier_joinpaths {
    relationship: one_to_one
    sql_on: 0=1
    {% if accidents._in_query %} OR ${carrier_joinpaths.path} = 'accidents' {%endif%}
    {% if flights._in_query %} OR ${carrier_joinpaths.path} = 'flights' {%endif%}
    ;;
  }
  join: accidents {
    relationship: one_to_one
    sql_on: ${carrier_joinpaths.path} = 'accidents'
      AND ${accidents.air_carrier} = ${carriers.name}::varchar;;
  }
  join: flights {
    relationship: one_to_one
    sql_on: ${carrier_joinpaths.path} = 'flights'
      AND ${flights.carrier_id} = ${carriers.id};;
  }
}

view: carrier_joinpaths {
  derived_table:{
    sql:
      SELECT 'accidents' as path
      UNION ALL
      SELECT 'flights' as path
    ;;
  }
  dimension: path {hidden:yes}
}
#Watch out for type:count measures in your views!
view: carriers{...}
view: flights{...}
view: accidents{...}

For only a handful of additional lines of LookML, you get some pretty powerful stuff, like filtering in those sub tables without affecting your other joined tables, and without writing manual derived tables for them:

##FAQ #1: Conforming the paths
After I posted this, someone asked me - what if I want to group by date, which date do I use? For example if I want to see each user’s payments and orders by date? Just group the relevant rows from the two join paths by using a coalesce:

view: combined_date {
 # reference this in your explore as join: combined_date { sql: ;; relationship: one_to_one }
 dimension_group: combined_date {
   sql: COALESCE(
       {% if orders._in_query %}orders.date_col,{% endif %}
       {% if payments._in_query %}payments.date_col,{% endif %}
       NULL
   );;
}

(Of course, the same goes for any other shared denormalized dimension fields, or even for shared foreign keys to a dimension table for joining.)

FAQ #2: Subtotals!

If you’re feeling clever, you can also use this approach to mix subtotals into an explore! For more reading on subtotals, you can refer to this other article which explains a subtotal approach in detail. From there, you can use this pattern to mix the subtotals in as one of these paths.

FAQ #3: Measuring users

Someone pointed out “but symmetric aggregates would still be needed for any measure on users, right?” Looking at the example tables I showed above, that seems true. Afterall, you can see the users are necessarily being repeated. But, let’s back up a bit first.

My premise with this article is that I would rather apply regular aggregates on a non-fanned out table than apply symmetric aggregates on a fanned out table. In other words, dimensions can be taken from fanned out tables, but if I plan to be able to measure something, can I work it into my query in a non fanned out way? (The answer is yes :wink: )

In the users example, the user table is fanned out. If I also needed to measure on the user table, I could join it a second time , on its own join path. The first user join, then, would be useful for dimensions for each of my user, order, and payment records, and the second (unfanned) user join could be used for measuring users. This is actually quite easy to do in LookML by referencing the same view twice and using the fields parameter to exclude measures from the explore, and then to only include measures from the path-restricted view.

FAQ #4: Optimizations on Large Tables

If your tables are large, which is often the case with these types of queries, you’ll want to make sure that your tables have a commonly-filtered sort key / partition column / clustered index. Then, to benefit from these, you’ll want to push these filters down into subqueries on each individual table, using templated filters. (It may sound like we’re heading back towards a dynamic SQL mess, but this is quite maintainable since it only has to be done for a specific column per view, rather than for every dimension/measure/filter you add to the view)


Avoiding inconsistencies across explores
[Analytic Block] Flexible Period-over-Period Analysis
Symmetric Aggregate Performance in BigQuery