[Analytic Block] State or Status Data and Slowly Changing Dimensions

About This Block

State data, also known as slowly changing dimensions (SCDs), are a growing datatype found in modern data management and analytics. SCDs are specific types of data that can change over time at any moment. With the proper table schema, we want to establish a structure that can not only identify the current state of a record, but also provide historical record keeping of any previous states of that record.

You may find yourself interested in managing state data for the below reasons:

  • Desire to perform deep-dive historical analysis
  • Working with modern, “append-only” technologies such as BigQuery and Hadoop as your analytics datastore.

In this block, I will be exploring a way to leverage Looker to analyze this type of data (specifically slowly changing dimensions type 2)

Ideal Data Types

Below are some examples of data that are often considered state data:

  • Dimension tables containing randomly changing attributes (address, subscription, etc)
  • Milestone or progress tables
  • Historic tables

Expected Output

Let’s start with a sample state data table - I’m going to use a “User Account” table as an example. It is reasonable to expect that a given user’s account status may change over time. Therefore, the table captures attributes about each user, and the respective timeframes for each status (or “Account Status” in the picture below), prior to a change in status. Notice how a user’s current “Account Status” is indicated by specifying the “End” timestamp as 9999-12-13 23:59:59.

Explore Data in Full Screen

Given that we are populating this users table this way, how can we leverage Looker to answer current and historical questions about our users?

Current State: Users and statuses

While a simple COUNT might work on a standard users table, it will overreport users on a SCD table. This is due to the existence of a record for every account state of a user. Therefore, we will create a custom measure.

  - measure: user_count
    type: count_distinct
    sql: ${user_id}

We will now be looking at a count of distinct users within the table based on the user_id. We need to use a count_distinct because the same user can appear multiple time in this table. The output of this measure against the table above will be 10.

Being able to leverage count_distinct to get the unique user count from a state table is useful; however, a different approach is needed to only look at current records. Remember how all current records have an end timestamp of 9999-12-31 23:59:59? We can use the phrase after today in the matches filter type to exclude historial rows.

Simply apply the “after today” phrase into a matches filter on the end dimension. Now you can bring in all desired fields.

Historical State: User Count and Detail

The real power of the state table is being able “rewind” and gain state data at any previous point in time. Again, let’s start with a simple user count. In the current example, I created a custom measure with a filter that only pulled current records. Since we want the flexibility to look at any point in history, we will perform our analysis entirely in the Explore while leveraging filters there.

Pretend we want to see user account states back on November 19th, 2011. Simple specify a start and end date filter and bring in the fields. Remember to filter the “begin date” before the date of interest and the “end date” after the date of interest.

Notice how this is the current state for some users but a previous state for others. Comparing to the image above, you can see that certain users are at a lower status at this point in time. See here for an example.

Timeseries Data

User state data that evolves over time lends itself nicely to timeseries analysis. In our example, we have users whose statuses improve over time. We can setup custom measures to watch as the counts change over time.

First, we need to join our user accounts state table to a standardized date table. This is accomplished via a join that respects the user state. Notice how the join predicate is satisfied as long as the user’s state is applicable during that date of the date table.

From here, we can build things like filtered measures to watch when statuses are obtained by all our users, as well as how quickly those statuses are vacated (either by promotion or demotion).

Explore Data in Full Screen

Try it Yourself!

I’ve created an Explore for this sample user state table found here. If you’d like to perform some timeseries analysis, use a separate explore that has a join between the user state table and a standardized date table, found here.

Perhaps the most important part of implementing SCDs is seeing the relationships established in the model, found here. I wouldn’t focus too much on the views themselves as I am manipulating some demo data for the sake of example, but the join logic between the state tables’ timestamp columns and the standardized date table is important to see and understand.

3 Likes

Hi, how can I add end timestamp to the records in this case from dev mode? The table that I have contains date column, but it’s useless in this case and I am not going to use it.
I would need to compare “count” of user-id-s according specific dimension in a timeframe. So, have the counted dimension in a historical view. Can retrieve dates from calendar table but would need to join it with some “date” that I don’t have in a table. So, X axis will be calendar date and Y will be “count” (boolean dimensien). Is that possible? Thanks