Outer Join on False - or: How I learned to stop fanning-out and love the NULL


(fabio) #1

Note: Although everything demonstrated in here is basic SQL, there are notable dialect considerations.

TL;DR? Show me the model

Let’s imagine we have the following three tables. (For the time being, we’ll show de-normalized account names to make the examples easier to read)

It feels very natural to write SQL like FROM accounts LEFT JOIN products or like FROM accounts LEFT JOIN managers, so many LookML developers default to writing an explore this way too. Something along the lines of: explore accounts, left join products (1:M on accounts), left join managers (1:M on accounts). You probably already know that in SQL, this can lead to some unexpected results. We informally call this situation “fan-out”.

Illustrative example of fan-out

Let’s say a business user now wants to get counts of products and of managers by account. Assuming you had used the above explore definition, Looker would produce a query of the form:

...
FROM accounts
LEFT JOIN products ON products.account=accounts.id
LEFT JOIN managers ON managers.account=accounts.id
GROUP BY ...

Before we fill in the rest of the query, let’s visualize the result set before any grouping or aggregation happens. This animation shows how that result set is built, and we can see the fan-out in action:

Here is the result of that fan-out, omitting duplicate or id columns for legibility:

You can tell already from all the repeated 80’s that you may run into problems with double-counting…

But wait! Symmetric Aggregates!

If you’ve been working with Looker for long, you’ll probably know about Symmetric Aggregates. This Looker feature detects when there may be fan-out (via one_to_many declarations in your joins) and automatically applies distinct counts, and (rather sophisticated) sums of rows having distinct id’s, and so forth, in order to prevent double counting. But double-counting is not the only problem…

Specifying the problem(s)

Although we simply talk about “fan-out”, in truth, it’s not a singular problem, but rather 3 closely related problems.

Problem 1 - Measures/aggregates and double counting: As you can see from our example result set, aggregates would be wrong on all three of the tables, whether a sum of employees (we would expect 200, but would get 600), a count of products (we would expect 3 but get 7), or a count of managers (we would expect 4 but get 7). As already discussed, Looker automatically applies Symmetric Aggregates when necessary and gets us the right answers. Hoorah!

Problem 2 - Unintended relationships in table: If we do not choose to group to the account level, but instead select dimensions on account name, product name, and manager name, we would get precisely the above rows in our final result set. But what does that result set mean? On any given row, you have a product and a manager that are not actually related, but the table suggests they are.

This is a more subtle problem than #1, though it does come up frequently. In fact the rows of the table do represent a relationship - in our example, it enumerates pairs of products and managers that are related via an account. And, in this context, the count measures do make sense - they are the count of these relationships. However, this relationship and this measure are rarely intended by the LookML developer or end user. In non-trivial datasets, users can often be misled by this or, what is nearly as bad, can realize that the data is non-sensical and lose confidence in the data.

To provide an example where this is particularly dangerous, consider if you had a key metric for accounts that was the ratio of products to managers. This would return the right number when grouping by accounts, but would return incorrect numbers when grouping by managers or products.

Problem 3 - Geometric growth of result set: Or, the elephant in the room. If you look at the Acme account in the above dataset, you’ll notice that there are 2 (products) x 3 (managers) = 6 rows. If we had higher cardinality relationships, for example, if you counted pageview events and add-to-cart events by product to calculate a click-through rate, you might easily have 10,000 (pageviews) x 1,000 (add-to-carts) = 10^7 rows per product in your intermediate result set.

This grows geometrically, so that if you wanted to pull a complete funnel, you might have 10,000 (pageviews) x 1,000 (add-to-carts) x 100 (orders) x 10 (upsells) = 10billion rows per product, which you probably have hundreds of, so your intermediate result set is trillions of rows. For context, if each row was 100 bytes, that would be 100’s of petabytes of data. This would be prohibitive to run on virtually any data warehouse. And even if you could, wouldn’t you want to do this in a more efficient way? Is there one?

The section wherein the reader learns of the point of the article

It turns out there is a pretty easy way to solve this - FULL OUTER JOIN ON FALSE:

...
FROM accounts
FULL OUTER JOIN products ON FALSE
FULL OUTER JOIN managers ON FALSE
....

It looks bizarre, but bear with me. Here is the intermediate result set:

Notice it only grows linearly with the cardinality of the joins, rather than geometrically. From there, we only need to add:

SELECT
  COALESCE(accounts.name, products.account, managers.account),
  SUM(accounts.employees),
  COUNT(products.id),
  COUNT(managers.id)
...
GROUP BY 1

For a final result set of:

To recap, this gives us:

  1. Correct counting/summing/averaging of measures.
  2. No risk of spurious product <-> manager relationships in the final result set, no matter what you end up grouping by.
  3. Elimination of geometric growth of the intermediate result set.

What about normalization? Dimension tables?

We’ve been using some denormalized example tables up until now. To translate this to a normalized schema, we’ll simply JOIN {normalized_table} ON COALESCE(...). So the full query structure for our example would look something like this:

SELECT
    -- assocated_account.name
    -- associated_product.name,
    -- DATE_TRUNC('month',COALESCE(pageviews.date,orders.date)),
    -- Any measures from any of the 'outer join' tables
FROM accounts
OUTER JOIN products ON FALSE
OUTER JOIN pageviews ON FALSE
OUTER JOIN orders ON FALSE
LEFT JOIN products AS associated_product
  ON associated_product.id = COALESCE(
    products.id,
    pageviews.product_id,
    orders.product_id
    )
LEFT JOIN accounts AS associated_account
  ON associated_account.id = COALESCE(
     accounts.id, 
     products.account_id, 
     associated_product.account_id)
--GROUP BY any of the dimensions suggested at the top

It gets better!

Not only does this pattern solve our three fan-out problems, but it also gives us our query immense adaptability:

Flexibility in groupings - Users won’t be limited to a select few groupings that work - any grouping is up for grabs. Let’s consider again some typical product analytics data:

We might either want to aggregate this by product or by month. You can do this by just switching your dimensions with no risk of measures being incorrectly repeated at the wrong level in your result set. To see why, consider how your result set changes as you comment/uncomment the commented lines below:

SELECT 
  -- associated_account.name,
  -- DATE_TRUNC('month',COALESCE(pageviews.date,orders.date)),
  SUM(accounts.employees),
  COUNT(pageviews.id),
  SUM(orders.amount)
FROM accounts
OUTER JOIN pageviews ON FALSE
OUTER JOIN orders ON FALSE
LEFT JOIN associated_product ON COALESCE(...)
LEFT JOIN associated_account ON COALESCE(...)
-- GROUP BY 1, 2 --(Group by whichever lines are uncommented in the select)

Click these to see how the final result sets look:

With no grouping

With account grouping

With month grouping. Notice that employee count is not associated with any month

With both. Notice that measures aren't repeated as incompatible grouping is kept on separate rows

Flexibility in the “base” of the explore - Consider what happens if structure your query like so:

SELECT ...
FROM (SELECT NULL) as base
OUTER JOIN accounts ON FALSE
OUTER JOIN products ON FALSE
OUTER JOIN pageviews ON FALSE
OUTER JOIN orders ON FALSE
...

…and then you leverage Looker’s query writing to choose which joins to include based on the user’s selection in the explore UI. Suddenly, you have one explore that selectively queries from any of the four tables without requiring that you start from another specific table.

Show me the model!

The model has a lot of interdependencies and as a result, it is rather difficult to maintain manually. So, I made a generator that will build this model for you. Put in a few tables to see how it can work with your schema.

Update 10/6/17: I am in the process of majorly overhauling the generator including significant performance optimizations. Hope to share more soon!

Update 11/20/17: You may also be interested in my “one-to-many multiple join paths” approach which is a bit less general, but much easier to implement: A cure for the one_to_many blues

Appendix: Dialect considerations

  • BigQuery Standard :smiley:
  • Redshift :slight_smile: Redshift inherited a limitation from Postgres that only merge-joinable conditions can be full outer joined. Although ‘ON FALSE’ is trivially merge joinable, Redshift doesn’t detect that. However, we can trigger a merge join by adding superfluous equality constraints between the distribution and sort keys of all our “base” tables. See the commented model code for examples.
  • MySQL :’( No support for full outer joins
  • Postgres ?? - Similarly to Redshift, Postgres only allows full outer joins if the tables are merge joinable. Testing would be required to see if this could be consistently enabled.
  • Other dialects ?? -Test it out and let us know in the comments

A cure for the one_to_many blues
[Analytic Block] Flexible Period-over-Period Analysis
(lloyd tabb) #2

Wow, what an interesting pattern. Essentially a wide UNION!


(Kevin Marr) #3

Wow. This enables us to avoid rolling up fact/event tables into DTs before being able to join them. Fabio, congratulations on melting my face! :thumbsup:


(fabio) #4

With the official release of _in_query in 4.18, customers can now try out this pattern for themselves!