Top 5 and Other

normal_priority
open
reply

(Paul Wadsworth) #1

As part of a dashboard I’ve got a chart which is pivoted to show dates down the rows, and categories across the columns. Unfortunately, there are 20+ categories, and sorting them by volume the last 10+ are only showing a count of 1 or 2.

The resulting chart (stacked) looks really busy (even sorted descending) and the legend is especially busy.

What I’d like to do is group into a Top 5, and put anything else in “Other”. Preferably I’d like to show the Top 5 for each date period, so the legend might still be showing 10 categories + other.

I found something that used some sort of Row Order calculation, but I couldn’t get it to work, because I need the ordering over the columns not the rows, and there didn’t seem to be a Column equivalent. And if I switch the Date and Category in the pivot, I couldn’t get the stacked chart to still show dates on the bottom axis. Even if that had have worked, it wasn’t showing an “Other” anyway.

I’ve also come across references to using Derived Tables to get this to work, but honestly every time I try and look through the articles and work out how to apply it I get completely lost.

Is there a simple way to get a Top 5 and Other?


(Izzy) #2

The simplest way by far is to belay the derived table approach and instead use table calculations. Zev outlines a nice approach here: Other buckets using Table Calculations.

The calculations look a bit more complex than they are cause there’s comment blocks in them, but they’re actually pretty simple. Check it out!


(Paul Wadsworth) #4

Hi, the trouble I have with this solution is the same as I had with the Row Order stuff I found, this only seems to work on an unpivoted table of data, and only when I’m wanting the top 5 rows or similar.

What I have is:


And this is covering 36 categories, a lot of which might be empty on a particular date period, or only have a volume or 1 or 2.

The solution so far has been to hard code 5 categories into the LookML with a CASE, which obviously isn’t dynamic, this gives us:


While this looks similar, it makes a huge difference on the Legend at least, as it’s 6 categories rather than 36. And the managers are happy with this, but I’d imagine at some point it could do with being dynamic.

The problem with the linked solution is that it needs me to sort by volume. I can add in a total line, and limit the report to 5 columns, but I can’t sort by the total row, only by name it seems, so I end up with the first 5 or last 5 categories alphabetically.

If I switch the Axis’ in the data, I can add a row total, sort it by that and get my top 5 categories, but now the visualisation isn’t showing me the changes over time.

Like I’ve said in this post (not the original one) they seem happy with these 5 categories hard coded in the LookML right now, so it’s not a huge priority, just something I’d quite like to know going forward


(Hugo Selbie) #5

Hi @Paul_Wadsworth, the crux of what you’re asking is to create some kind of identifier that ranks your data. Typically this either exists in your data as part of the ETL process, or you can create a derived table using a rank window function and partition by whichever column you feel is appropriate to create the correct ranks you desire.

Once you have that, we can pretty easily call out your top ‘x’ patterns using parameters in LookML using this pattern (this is assuming you don’t have a rank in your data and have to create one) (redshift dialect).

view: top_5_brands_vs_other_ranked_by_user_selected_parameter {
  label: "Top 5 Brands (Select Metric to Rank By)"
  derived_table: {
    #parameter value specifies which of the rankings from the inner table to use
    sql:
    select brand,
    {% parameter brand_rank_criteria %} as brand_rank
    from
    (
      select brand,
      --metric to rank over is set by the user's brand_rank_criteria parameter selection
      rank() over(order by count(*) desc) as brandRankOrderItemCount,
      rank() over(order by sum(order_items.sale_price) desc) as brandRankSalePrice,
      rank() over(order by avg(order_items.sale_price) desc) as brandRankAvgSalePrice,
      rank() over(order by count(distinct user_id) desc) as brandRankDistinctUserCount,
      rank() over(order by sum(order_items.sale_price)-sum(inventory_items.cost) desc) as brandRankMargin


      FROM public.order_items  AS order_items
      LEFT JOIN public.inventory_items  AS inventory_items ON order_items.inventory_item_id = inventory_items.id
      LEFT JOIN public.products  AS products ON inventory_items.product_id = products.id
      group by brand
    )brand_summary
    ;;
  }

  dimension: brand {
    primary_key: yes
    hidden: yes
    type: string
    sql: ${TABLE}.brand ;;
  }

  parameter: brand_rank_criteria {
    description: "Specify which metric to rank brands by"
    type: unquoted
    #Set a default value so that the dynamic ranking still works even if the user doesn't use the parameter.
    #Parameter defualt values work better without underscores, otherwise they sometimes load as '' when added, rather than with the corresponding label.
    default_value: "brandRankOrderItemCount"
    allowed_value: {
      label: "Order Items Sold Count Rank"
      value: "brandRankOrderItemCount"
    }
    allowed_value: {
      label: "Order Items Average Sale Price Rank"
      value: "brandRankAvgSalePrice"
    }
    allowed_value: {
      label: "Distinct Users with Order Items Count Rank"
      value: "brandRankDistinctUserCount"
    }
    allowed_value: {
      label: "Total Margin Rank"
      value: "brandRankMargin"
    }
  }

#### This parameter will allow a user to select a Top N ranking limit for bucketing the brands, almost like parameterizing the Row Limit in the UI
  parameter: brand_rank_limit {
    type: unquoted
    default_value: "5"
    allowed_value: {
      label: "Top 5"
      value: "5"
    }
    allowed_value: {
      label: "Top 10"
      value: "10"
    }
    allowed_value: {
      label: "Top 20"
      value: "20"
    }
    allowed_value: {
      label: "Top 50"
      value: "50"
    }
  }

  dimension: brand_rank_top_5_brands {
    #Adjust the label that appears in visualization to match the ranking criteria
    label_from_parameter: brand_rank_criteria
    type: string
    sql: case when ${TABLE}.brand_rank<={% parameter brand_rank_limit %} then cast(${TABLE}.brand_rank as varchar) else 'other' end ;;
  }

  dimension: brand_name_top_5_brands {
    label: "Brand Name (Top 5)"
    type: string
    sql: case when ${TABLE}.brand_rank<={% parameter brand_rank_limit %} then ${TABLE}.brand else 'other' end ;;
  }
}

This can look confusing at first but we can walk through step by step.
Firstly the derived table sql.

select brand,
    {% parameter brand_rank_criteria %} as brand_rank
    from
    (
      select brand,
      --metric to rank over is set by the user's brand_rank_criteria parameter selection
      rank() over(order by count(*) desc) as brandRankOrderItemCount,
      rank() over(order by sum(order_items.sale_price) desc) as brandRankSalePrice,
      rank() over(order by avg(order_items.sale_price) desc) as brandRankAvgSalePrice,
      rank() over(order by count(distinct user_id) desc) as brandRankDistinctUserCount,
      rank() over(order by sum(order_items.sale_price)-sum(inventory_items.cost) desc) as brandRankMargin


      FROM public.order_items  AS order_items
      LEFT JOIN public.inventory_items  AS inventory_items ON order_items.inventory_item_id = inventory_items.id
      LEFT JOIN public.products  AS products ON inventory_items.product_id = products.id
      group by brand
    ) brand_summary

Here we’re using a subquery to create a brand rank by various parameters and then allowing the end user to select the appropriate ranking depending on what the end user is interested in using this parameter. This is formatted in the LookML as follows:

parameter: brand_rank_criteria {
    description: "Specify which metric to rank brands by"
    type: unquoted
    #Set a default value so that the dynamic ranking still works even if the user doesn't use the parameter.
    #Parameter defualt values work better without underscores, otherwise they sometimes load as '' when added, rather than with the corresponding label.
    default_value: "brandRankOrderItemCount"
    allowed_value: {
      label: "Order Items Sold Count Rank"
      value: "brandRankOrderItemCount"
    }
    allowed_value: {
      label: "Order Items Average Sale Price Rank"
      value: "brandRankAvgSalePrice"
    }
    allowed_value: {
      label: "Distinct Users with Order Items Count Rank"
      value: "brandRankDistinctUserCount"
    }
    allowed_value: {
      label: "Total Margin Rank"
      value: "brandRankMargin"
    }
  }

Now we’re set up for success with a rank of our data. We can take this to a logical next step with parameters and allow the end user to select the top ‘x’ rank that they desire like this. Creating dimension with a case when statement in the sql: parameter of the LookML that will bucket anything beneath the value parameter of brand_rank_limit as ‘other’ and show the top ‘x’. You can see this here ${TABLE}.brand_rank<={% parameter brand_rank_limit %}

#### This parameter will allow a user to select a Top N ranking limit for bucketing the brands, almost like parameterizing the Row Limit in the UI
  parameter: brand_rank_limit {
    type: unquoted
    default_value: "5"
    allowed_value: {
      label: "Top 5"
      value: "5"
    }
    allowed_value: {
      label: "Top 10"
      value: "10"
    }
    allowed_value: {
      label: "Top 20"
      value: "20"
    }
    allowed_value: {
      label: "Top 50"
      value: "50"
    }
  }

  dimension: brand_rank_top_5_brands {
    #Adjust the label that appears in visualization to match the ranking criteria
    label_from_parameter: brand_rank_criteria
    type: string
    sql: case when ${TABLE}.brand_rank<={% parameter brand_rank_limit %} then cast(${TABLE}.brand_rank as varchar) else 'other' end ;;
  }

  dimension: brand_name_top_5_brands {
    label: "Brand Name (Top 5)"
    type: string
    sql: case when ${TABLE}.brand_rank<={% parameter brand_rank_limit %} then ${TABLE}.brand else 'other' end ;;
  }
}

This will get you an explore that looks something like this.

With SQL generated like this.

WITH top_5_brands_vs_other_ranked_by_user_selected_parameter AS (select brand,
    brandRankOrderItemCount as brand_rank
    from
    (
      select brand,
      --metric to rank over is set by the user's brand_rank_criteria parameter selection
      rank() over(order by count(*) desc) as brandRankOrderItemCount,
      rank() over(order by sum(order_items.sale_price) desc) as brandRankSalePrice,
      rank() over(order by avg(order_items.sale_price) desc) as brandRankAvgSalePrice,
      rank() over(order by count(distinct user_id) desc) as brandRankDistinctUserCount,
      rank() over(order by sum(order_items.sale_price)-sum(inventory_items.cost) desc) as brandRankMargin


      FROM public.order_items  AS order_items
      LEFT JOIN public.inventory_items  AS inventory_items ON order_items.inventory_item_id = inventory_items.id
      LEFT JOIN public.products  AS products ON inventory_items.product_id = products.id
      group by brand
    )brand_summary
    )
SELECT 
	case when top_5_brands_vs_other_ranked_by_user_selected_parameter.brand_rank<=5 then cast(top_5_brands_vs_other_ranked_by_user_selected_parameter.brand_rank as varchar) else 'other' end  AS "top_5_brands_vs_other_ranked_by_user_selected_parameter.brand_rank_top_5_brands",
	case when top_5_brands_vs_other_ranked_by_user_selected_parameter.brand_rank<=5 then top_5_brands_vs_other_ranked_by_user_selected_parameter.brand else 'other' end  AS "top_5_brands_vs_other_ranked_by_user_selected_parameter.brand_name_top_5_brands",
	COALESCE(COALESCE(CAST( ( SUM(DISTINCT (CAST(FLOOR(COALESCE(order_items.sale_price ,0)*(CAST(1000000 AS DOUBLE PRECISION)*1.0)) AS DECIMAL(38,0))) + CAST(STRTOL(LEFT(MD5(CAST(order_items.id  AS VARCHAR)),15),16) AS DECIMAL(38,0))* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CAST(order_items.id  AS VARCHAR)),15),16) AS DECIMAL(38,0)) ) - SUM(DISTINCT CAST(STRTOL(LEFT(MD5(CAST(order_items.id  AS VARCHAR)),15),16) AS DECIMAL(38,0))* 1.0e8 + CAST(STRTOL(RIGHT(MD5(CAST(order_items.id  AS VARCHAR)),15),16) AS DECIMAL(38,0))) )  AS DOUBLE PRECISION) / CAST((CAST(1000000 AS DOUBLE PRECISION)) AS DOUBLE PRECISION), 0), 0) AS "order_items.total_sales_price"
FROM public.order_items  AS order_items
LEFT JOIN public.inventory_items  AS inventory_items ON order_items.inventory_item_id = inventory_items.id 
LEFT JOIN public.products  AS products ON inventory_items.product_id = products.id 
LEFT JOIN top_5_brands_vs_other_ranked_by_user_selected_parameter ON top_5_brands_vs_other_ranked_by_user_selected_parameter.brand=products.brand  

GROUP BY 1,2
ORDER BY 1 
LIMIT 500

Happy Bucketing :slight_smile:


(Paul Wadsworth) #6

I wasn’t sure where to put this, but since it starts with “view” I guessed it needed to be a new View.

I went through and put this code in, but I thought I’d test it just with top 5 hard coded and only the COUNT(*) metric. About half way through I thought of 2 problems I’d have, the first being that if I’m defining “Other” in this view, then it just wouldn’t work on the join into the model, the second being I couldn’t work out how the filters of any Explore/Dashboard would change the top 5.

I was surprised when I tested it, that the “brand_name_top_5_brands” joined to “brand” actually put anything that didn’t match under “other”. I don’t understand how that works, but I’m glad it does

However the Top 5 I’m getting doesn’t care about any filters on the report. Since it’s querying the database for the top 5 instead of the filtered data, I’m not surprised, but I figured I needed to put this into the Model rather than being a separate view, but Looker tells me a Derived Table isn’t a valid property for a join.

Thanks for this detailed explanation that has helped me understand Derived Tables more than I did, but the way this is set up suggests I need to hard code any filters straight in there, like if someone wants to know the Top 5 Orders by only looking at Saturdays.

For example, while I’ve successfully got the Top 5 and other using your help, the chart I have now is essentially 2 categories, since the other 4 have 0 in them, since the Top 5 over the last 3 years is different to the last 5 over the last 12 weeks.