Running Total of Distinct Values

calculations
done
low_priority
visualizations
reply

(Andrew H) #1

Hello,

I’m looking to create a cumulative line graph of distinct values. I’m aware of the Running_total() function as well as Count_distinct() function.

Example:
My table have 4 days

Day 1 - A,B,C
Day 2 - A,B,D
Day 3 - B,C
Day 4 - E

My results should be

Day 1: 3
Day 2: 4
Day 3: 4
Day 4: 5

I’ve nested Count_distinct within running_total but its returning:

Day 1: 3
Day 2: 6
Day 3: 8
Day 4: 9

The Table Calculation: running_total(id_count})
The Measure:
measure: id_count {
type: count_distinct
sql: ${TABLE}.“ID” ;;
}

Any suggestions? Much appreciated!

Cheers,
Andrew


The Podium — March 20th, 2019
(Fabio) #2

If you have 5,000 distinct values or less in your dataset of interest, and your model has a MIN(date) measure, you can do this in the explore (otherwise you’ll need to use similar logic, but in the SQL of a derived table)

First, dimension by your entities / distinct values, measure them by their min (earliest) date, and sort by the min date. Using your example data, you’ll get a table like:

| Thing | Min Date |
--------------------
|A      | 1        |
|B      | 1        |
|C      | 1        |
|D      | 2        |
|E      | 4        |

Then add two table calcs, one for a running total, and one to only show the last row of each Min Date value.

  • Running Count: row() + if(is_null(${min_date}),0,0)
    ^ We add the +if(...) to reference the measure and ensure this goes to the right of it, for visualization purposes
  • Show row - ${min_date} != offset(${min_date},1)
    ^ We’ll show only the last row (running count) within each value of min date
| Thing | Min Date | RC | Show? |
---------------------------------
|A      | 1        | 1  | No    |
|B      | 1        | 2  | No    |
|C      | 1        | 3  | Yes   |
|D      | 2        | 4  | Yes   |
|E      | 4        | 5  | Yes   |

Finally, hide the first column, and use “Hide No’s” on the last column, and pick a line vis.


(Andrew H) #3

Thank you Fabio, this approach makes sense. I think for our use case we’ll have to do this on the SQL side given the size of the data set. Much Appreciated!


(Fabio) #4

If you want to pre-calculate some of this in SQL and then allow users to pick their own timeframes, it will help to create a roll-up table with dimensions “date”, “prior date”, and measure “count”. Start by using a window function to partition by entity, order by date, and add a ‘lag’ function:

Entity | Date | Prior Date
--------------------------
A      | 1    | NULL
A      | 2    | 1
B      | 1    | NULL
B      | 2    | 1
B      | 3    | 2
C      | 1    | NULL
C      | 3    | 1
D      | 2    | NULL
E      | 4    | NULL

Then group and aggregate that to get this rollup table, which you would persist:

Date | Prior Date | Count
-------------------------
1    | NULL       | 3
2    | NULL       | 1
2    | 1          | 2
3    | 1          | 2
4    | NULL       | 1
5    | NULL       | 1

Then to get any cumulative distinct count, you can just group by date (or some function thereof, like week or month), filter on last date being outside of the range for which you want the measure to be distinct (because the ones where last date is within in that same range are not unique), and sum.