A simple explanation of Symmetric Aggregates or: Why On Earth Does My SQL Look Like That?

symmetric
lookml

(Daniel Mintz) #1

One of Looker’s most powerful features is what we call Symmetric Aggregates. But because they can look pretty gnarly and mostly happen behind the scenes, coming upon them can be a bit confusing. So here’s quick rundown of why they exist, how they work, and why they’re so…um…unattractive.

Why Do You Need Symmetric Aggregates?

SQL, the language of data analysis, is amazingly powerful. But with that great power comes great responsibility, and one of the responsibilities analysts have had to manage is not accidentally calculating incorrect aggregates (e.g. sums, averages, counts). (There are literally threads all over the internet about this problem.)

Doing these calculations incorrectly is surprisingly easy and probably the cause of more analyst heartburn than we’ll ever know. An example makes it clear how you can go wrong.

Imagine you have two tables (orders and order_items). The order_items table records one row for each item in an order, so the relationship between the tables is one-to-many (because one order can have many items, but each item can only be part of one order). Let’s say the orders table looks like this:

order_id | user_id  | total | order_date
-------------------------------------------
1        | 100      | 50.36 | 2017-12-01
2        | 101      | 24.12 | 2017-12-02
3        | 137      | 50.36 | 2017-12-02

If you wanted to get the total amount spent, that’d be easy. SUM(total) = 124.84. Now let’s say the order_items table contains six rows:

order_id | item_id  | quantity | unit_price
--------------------------------------------
1        | 50       | 1        | 23.00
1        | 63       | 2        | 13.68      
2        | 63       | 1        | 13.68
2        | 72       | 1        | 5.08
2        | 79       | 1        | 5.36
3        | 78       | 1        | 50.36

If you wanted to get the count of items ordered, that’d be easy, too. SUM(quantity) = 7. Now let’s say we joined these two tables using their shared column, order_id, which gives us:

order_id | user_id  | total | order_date | item_id | quantity | unit_price 
---------------------------------------------------------------------------
1        | 100      | 50.36 | 2017-12-01 | 50      | 1        | 23.00
1        | 100      | 50.36 | 2017-12-01 | 63      | 2        | 13.68      
2        | 101      | 24.12 | 2017-12-02 | 63      | 1        | 13.68
2        | 101      | 24.12 | 2017-12-02 | 72      | 1        | 5.08
2        | 101      | 24.12 | 2017-12-02 | 79      | 1        | 5.36
3        | 137      | 50.36 | 2017-12-02 | 78      | 1        | 50.36

Now we can see some new things, like that three items were ordered on Dec 1 and four items were ordered on Dec 2. And some of our previous calculations, like SUM(quantity) will continue to work fine. But what if we try to get the total spent? Now we’ve got a problem.

Because now, if we use our previous calculation, SUM(total), the 50.36 total for order_id = 1 is going to be counted twice because it contained two different items, and the 24.12 total for order_id = 2 is going to be counted three times! So now SUM(total) will give us 223.44, instead of the correct answer, 124.84.

This is bad. Because while it’s easy to avoid this kind of mistake when you have two tiny example tables, things get way more complicated in real life, with lots of tables and lots of data. And this is exactly the kind of miscalculation someone could make without even realizing.

This is the problem Symmetric Aggregates solves.

How Symmetric Aggregates Work

Symmetric Aggregates prevent analysts—and anyone using Looker—from accidentally miscalculating these types of aggregates. This is a huge burden off analysts’ shoulders, knowing that no one is going to charge ahead with incorrect data.

The way Symmetric Aggregates does this, in plain English, is by keeping track of what you’re calculating and making sure to count each fact in the calculation the correct number of times. So, in the above example, Symmetric Aggregates realizes that total is a property of orders (not order_items) and so it needs to count each order’s total only once to get the correct answer.

It does this using a unique primary key that analysts define in Looker. That means when Looker is doing calculations on the joined table:

order_id | user_id  | total | order_date | item_id | quantity | unit_price 
---------------------------------------------------------------------------
1        | 100      | 50.36 | 2017-12-01 | 50      | 1        | 23.00
1        | 100      | 50.36 | 2017-12-01 | 63      | 2        | 13.68      
2        | 101      | 24.12 | 2017-12-02 | 63      | 1        | 13.68
2        | 101      | 24.12 | 2017-12-02 | 72      | 1        | 5.08
2        | 101      | 24.12 | 2017-12-02 | 79      | 1        | 5.36
3        | 137      | 50.36 | 2017-12-02 | 78      | 1        | 50.36

it says to itself, “Hey, even though there are two rows with order_id = 1, I shouldn’t double count the total, because I’ve already included it in my calculation. And for the three rows with order_id = 2, I should ignore the total in the second and third rows because they’re repeats.”

Now, it’s worth noting that Symmetric Aggregates depend on a unique primary key and the correct join relationship being specified in the model. So if the results you’re getting look wrong, talk to an analyst to make sure that’s all set up right.

Why Do Symmetric Aggregates Look That Way:

If you haven’t seen Symmetric Aggregates in the wild, they can be a bit unsettling. Because when they kick in, Looker goes from writing nice, well-behaved SQL like:

SELECT 
	order_items.order_id  AS "order_items.order_id",
	order_items.sale_price  AS "order_items.sale_price"
FROM order_items  AS order_items

GROUP BY 1,2
ORDER BY 1 
LIMIT 500

to wild, crazy SQL like:

SELECT 
	order_items.order_id  AS "order_items.order_id",
	order_items.sale_price  AS "order_items.sale_price",
	(COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(users.age ,0)
	*(1000000*1.0)) AS DECIMAL(38,0))) + 
	CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
	* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0)) ) 
	- SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))
	* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CONVERT(VARCHAR,users.id )),15),16) AS DECIMAL(38,0))) )  
	AS DOUBLE PRECISION) 
	/ CAST((1000000*1.0) AS DOUBLE PRECISION), 0) 
	/ NULLIF(COUNT(DISTINCT CASE WHEN  users.age  IS NOT NULL THEN users.id  
	ELSE NULL END), 0)) AS "users.average_age"
FROM order_items  AS order_items
LEFT JOIN users  AS users ON order_items.user_id = users.id 

GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500

The exact format that Symmetric Aggregates takes depends on the dialect of SQL that Looker is writing, but they’re all doing the same basic thing: If multiple rows have the same primary key, Symmetric Aggregates only counts them one time. It does this by using the little-known SUM DISTINCT and AVG DISTINCT functions that are part of the SQL standard.

To see how this happens, let’s take the calculation we did above and work it through with Symmetric Aggregates. Of the seven columns in the joined tables, we only need two—the one we’re aggregating (total) and the unique primary key for orders (order_id).

order_id | total
-----------------
1        | 50.36
1        | 50.36
2        | 24.12
2        | 24.12
2        | 24.12
3        | 50.36

Symmetric Aggregates takes the primary key—order_id in this case—and creates a very large number for each that is guaranteed to be unique but always give the same output for the same input. (It generally does this with a hashing function, the details of which are beyond the scope of this article.) So that’d look something like this:

big_unique_number        | total
---------------------------------
802959190063912          | 50.36
802959190063912          | 50.36
917651724816292          | 24.12
917651724816292          | 24.12
917651724816292          | 24.12
110506994770727          | 50.36

Then for each row, Looker does this:

SUM(DISTINCT big_unique_number + total) - SUM(DISTINCT big_unique_number)

And guess what. That works.

It reliably gives you the correctly aggregated totals, counting each total exactly the right number of times. It isn’t fooled by repeated rows or by multiple orders that have the same total. Don’t believe me? Do the math yourself and you’ll get a better feel for how it works.

There’s no question that the SQL required to do this, with CAST() and md5() and SUM(DISTINCT) and STRTOL(), isn’t the prettiest to look at. And you certainly wouldn’t want to write it by hand. But luckily, you don’t have to. Looker writes it for you and it just works. No wrong answers, no joins that “aren’t allowed”, and no headaches for analysts.

P.S. As of Looker 5.2, Symmetric Aggregates have gotten even smarter. Now, when an aggregation is going to work properly without needing Symmetric Aggregates, Looker detects this automatically and doesn’t bother with Symmetric Aggregates. Since Symmetric Aggregates impose some performance costs, this further optimizes the SQL that Looker generates to make it as efficient as possible, while still guaranteeing the right answer.


(Patrick Barry) #2

Great article Dan!

When you say Symmetric Aggregates have changed in Looker 5.4, can you elaborate? I don’t see anything in the release notes about that.


(Morgan Imel) #3

Hey @Patrick_Barry!

It was actually changed in 5.2, which you can see in the release notes here. I’ll let Daniel know that that’s a typo.
For a one-to-many relationship, there isn’t any risk of fanout, so there’s no need to apply symmetric aggregates, which we used to do. This change improves performance in that situation.


(Oliver Jones) #4

Wow. You guys are channeling Kurt Gödel. Very interesting stuff.