[Analytic Block] Sessionization

low_priority
done
reply
(Anicia Santos) #1

About This Block

Event data is rich with behavioral indicators for application usage analysis. Unfortunately, raw event data isn’t structured in the best format for analytics. Rarely does it capture arguably the most crucial element for app usage analysis: session start and end times for a particular user. Without these pieces, it’s difficult to understand or how the events relate to each other and derive any meaningful analysis.

This block is designed to construct sessions out of individual user event streams in order to get an idea of how users flow through and engage with a site. By assigning a unique Session ID to each event, we can explore and answer some interesting questions to improve a user’s retention and conversion.

A few questions we can start to answer with simple sessionization are:

  • What events cause bounces?
  • Which events end a session or triggers cessation of usage?
  • How long do users typically use the application?
  • Where do these users spend most of their time during usage?
  • What features and areas of the application are used and where in the workflow?
  • Which sessions will encourage or discourage retention?

Ideal Data Types

This block is built for event data that ideally includes at least 1 timestamp for the event creation, a User ID, and a unique Event ID.

Expected Output

By assigning a Session ID to each event, we can also start to build out a fact table that defines which event was a start and end to a session, thereby giving us a Session Landing Page and Session Exit Page. We can use this to look at what pages customers most frequently exit from:

In our session facts table, we can also include a data point for the total number of events in a session. This can tell us which pages visitors are likely to exit the site from, without navigating elsewhere.

Our derived session start and session end boundaries can be used to calculate a session length. This will answer the question of how long do users typically spend on our site in a single session.

If we add a measure based on session length for ‘Average Session Length’, we can visualize which page is best to direct customers to after login:

Try it Out!

The block below was created using Redshift. An events table that records all site page visits is used to build three derived tables: Sessions, Event Mapping and Session Facts. The windows functions used to create sessionization are not compatible with MySQL databases.

This block is relatively simple to implement once you have changed the derived table references to match those of your own event data.

First, we will need to look at idle time between events for a given user and use an arbitrary threshold to determine when an event should be assigned a new Session ID. Essentially for a given user ID, if more than 30 minutes have elapsed since the previous event for that user ID (or if there is no previous event), that event will be recognized as the first event in a session and assigned a new unique Session ID.

The following example code is limited to sessionizing only the last 60 days of event data so as not to overload the database. If you have pages where users are likely to be much slower or much faster, you can adjust the idle time threshold to be greater or less than 60 minutes.

   view: sessions {
   derived_table: {
    sql_trigger_value: SELECT DATE(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', GETDATE())) ;;
    distribution: "user_id"
    sortkeys: ["session_start"]
    sql: WITH lag AS
        (SELECT
                  logs.created_at AS created_at
                , logs.user_id AS user_id
                , logs.ip_address AS ip_address
                , DATEDIFF(
                    minute,
                    LAG(logs.created_at) OVER ( PARTITION BY logs.user_id, logs.ip_address ORDER BY logs.created_at)
                  , logs.created_at) AS idle_time
              FROM public.events as logs
              WHERE ((logs.created_at) >= (DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) ))
                    AND (logs.created_at) < (DATEADD(day,60, DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) ) ))) -- optional limit of events table to only past 60 days
              )
        SELECT
          lag.created_at AS session_start
          , lag.idle_time AS idle_time
          , lag.user_id AS user_id
          , lag.ip_address AS ip_address
          , ROW_NUMBER () OVER (ORDER BY lag.created_at) AS unique_session_id
          , ROW_NUMBER () OVER (PARTITION BY COALESCE(lag.user_id::varchar, lag.ip_address) ORDER BY lag.created_at) AS session_sequence
          , COALESCE(
                LEAD(lag.created_at) OVER (PARTITION BY lag.user_id, lag.ip_address ORDER BY lag.created_at)
              , '6000-01-01') AS next_session_start
        FROM lag
        WHERE (lag.idle_time > 60 OR lag.idle_time IS NULL)  -- session threshold (currently set at 60 minutes)
       ;;
  }

  measure: count {
    type: count
    drill_fields: [detail*]
  }

  dimension_group: session_start_at {
    type: time
    hidden: yes
    convert_tz: no
    timeframes: [time, date, week, month]
    sql: ${TABLE}.session_start ;;
  }

  dimension: idle_time {
    type: number
    value_format: "0"
    sql: ${TABLE}.idle_time ;;
  }

  dimension: unique_session_id {
    type: number
    value_format_name: id
    primary_key: yes
    sql: ${TABLE}.unique_session_id ;;
  }

  dimension: session_sequence {
    type: number
    value_format_name: id
    sql: ${TABLE}.session_sequence ;;
  }

  dimension_group: next_session_start_at {
    type: time
    convert_tz: no
    timeframes: [time, date, week, month]
    sql: ${TABLE}.next_session_start ;;
  }

  measure: count_distinct_sessions {
    type: count_distinct
    sql: ${unique_session_id} ;;
  }

  set: detail {
    fields: [
      session_start_at_time,
      idle_time,
      unique_session_id,
      session_sequence,
      next_session_start_at_time
    ]
  }
}

Next, we need to create a table that maps the events table to the sessions PDT. We will call this ‘events_sessionized’. This PDT looks at an event’s creation time for a given user, and then finds the session ID for a user that has a start time and a ‘next session start’ that surround the evaluated event.

view: events_sessionized {
  view_label: "Events"

  derived_table: {
    sql_trigger_value: SELECT DATE(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', GETDATE())) ;;
    distribution: "event_id"
    sortkeys: ["created_at"]
    sql: SELECT
      ROW_NUMBER() OVER (ORDER BY log.created_at) AS event_id
    , log.ip_address
    , log.user_id
    , log.os
    , log.uri
    , log.event_type
    , log.browser
    , log.traffic_source
    , log.created_at
    , sessions.unique_session_id
    , ROW_NUMBER () OVER (PARTITION BY unique_session_id ORDER BY log.created_at) AS event_sequence_within_session
    , ROW_NUMBER () OVER (PARTITION BY unique_session_id ORDER BY log.created_at desc) AS inverse_event_sequence_within_session
FROM public.events AS log
INNER JOIN ${sessions.SQL_TABLE_NAME} AS sessions
  ON log.user_id = sessions.user_id
  AND log.ip_address = sessions.ip_address
  AND log.created_at >= sessions.session_start
  AND log.created_at < sessions.next_session_start
WHERE
  ((log.created_at) >= (DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) ))  AND (log.created_at) < (DATEADD(day,60, DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) ) )))
 ;;
  }

  measure: count {
    type: count
    drill_fields: [detail*]
  }

  dimension: event_id {
    primary_key: yes
    type: number
    value_format_name: id
    sql: ${TABLE}.event_id ;;
  }

  dimension: user_id {
    type: number
    sql: ${TABLE}.user_id ;;
  }

  dimension: unique_session_id {
    type: number
    value_format_name: id
    hidden: yes
    sql: ${TABLE}.unique_session_id ;;
  }

  dimension: page_name {
    type: string
    sql: ${TABLE}.uri ;;
  }

  dimension: event_type {
    type: string
    sql: ${TABLE}.event_type ;;
  }

  dimension: traffic_source {
    type: string
    sql: ${TABLE}.traffic_source ;;
  }

  dimension: event_sequence_within_session {
    type: number
    value_format_name: id
    sql: ${TABLE}.event_sequence_within_session ;;
  }

  dimension: inverse_event_sequence_within_session {
    type: number
    value_format_name: id
    sql: ${TABLE}.inverse_event_sequence_within_session ;;
  }

  set: detail {
    fields: [
      event_id,
      #ip_address,
      user_id,
      #os,
      traffic_source,
      #event_time_time,
      unique_session_id,
      event_sequence_within_session,
      inverse_event_sequence_within_session,
      #user_first_session_time,
      #session_landing_page,
      #session_exit_page
    ]
  }
}

Finally, we can create a Session Facts PDT. This fact table will use the data events_sessionized to find each session’s landing and exit page as well as start and end timestamps.

view: session_facts {
  derived_table: {
    sql_trigger_value: SELECT DATE(CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', GETDATE())) ;;
    distribution: "unique_session_id"
    sortkeys: ["session_start"]
    sql: WITH session_facts AS
        (
          SELECT
             unique_session_id
            , logs_with_session_info.created_at
            , user_id
            , ip_address
            , uri
            , event_id
            , event_type
            , COALESCE(user_id::varchar, ip_address) as identifier
            , FIRST_VALUE (created_at) OVER (PARTITION BY unique_session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS session_start
            , LAST_VALUE (created_at) OVER (PARTITION BY unique_session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS session_end
            , FIRST_VALUE (event_type) OVER (PARTITION BY unique_session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS session_landing_page
            , LAST_VALUE  (event_type) OVER (PARTITION BY unique_session_id ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS session_exit_page
          FROM
              ${events_sessionized.SQL_TABLE_NAME} AS logs_with_session_info
          GROUP BY 1,2,3,4,5,6, 7
          ORDER BY unique_session_id asc
        )
      SELECT
        session_facts.unique_session_id
        , session_facts.identifier
        , session_facts.session_start
        , session_facts.session_end
        , session_landing_page
        , session_exit_page
        , ROW_NUMBER () OVER (PARTITION BY session_facts.identifier ORDER BY MIN(session_start)) AS session_sequence_for_user
        , ROW_NUMBER () OVER (PARTITION BY session_facts.identifier ORDER BY MIN(session_start) desc) AS inverse_session_sequence_for_user
        , count(1) as events_in_session
      FROM session_facts
      INNER JOIN
        ${events_sessionized.SQL_TABLE_NAME} AS logs_with_session_info
      ON
        logs_with_session_info.created_at = session_facts.session_start
        AND logs_with_session_info.unique_session_id = session_facts.unique_session_id
      GROUP BY 1,2,3,4,5,6
      ORDER BY session_start asc
       ;;
  }

  dimension: unique_session_id {
    hidden: yes
    primary_key: yes
    type: number
    value_format_name: id
    sql: ${TABLE}.unique_session_id ;;
  }

  dimension_group: session_start_at {
    type: time
    convert_tz: no
    timeframes: [time, date, week, month]
    sql: ${TABLE}.session_start ;;
  }

  dimension_group: session_end_at {
    type: time
    convert_tz: no
    timeframes: [time, date, week, month]
    sql: ${TABLE}.session_end ;;
  }

  dimension: session_sequence_for_user {
    type: number
    sql: ${TABLE}.session_sequence_for_user ;;
  }

  dimension: inverse_session_sequence_for_user {
    type: number
    sql: ${TABLE}.inverse_session_sequence_for_user ;;
  }

  dimension: number_of_events_in_session {
    type: number
    sql: ${TABLE}.events_in_session ;;
  }

  dimension: session_landing_page {
    type: string
    sql: ${TABLE}.session_landing_page ;;
  }

  dimension: session_exit_page {
    type: string
    sql: ${TABLE}.session_exit_page ;;
  }

  dimension: session_length_seconds {
    type: number
    sql: DATEDIFF('sec', ${TABLE}.session_start, ${TABLE}.session_end) ;;
  }

  dimension: session_length_seconds_tier {
    type: tier
    tiers: [
      0,
      15,
      30,
      45,
      60,
      75,
      100
    ]
    sql: ${session_length_seconds} ;;
  }

  measure: average_session_length_seconds {
    type: average
    sql: ${session_length_seconds} ;;
  }

  measure: session_facts_count {
    type: count
    drill_fields: [detail*]
  }

  set: detail {
    fields: [
      unique_session_id,
      session_start_at_time,
      session_end_at_time,
      session_sequence_for_user,
      inverse_session_sequence_for_user,
      number_of_events_in_session,
      session_landing_page,
      session_exit_page
    ]
  }
}

With the above views set up, we just need to expose the data in an Explore. If you already have an Explore set up for your event data, add only the joins to that section. If not, add in the full explore:

explore: events {
  join: events_sessionized {
    view_label: "Events"
    relationship: one_to_one
    type: inner
    sql_on: ${events.id} = ${events_sessionized.event_id} ;;
  }
  
  join: sessions {
    relationship: many_to_one
    type: left_outer
    sql_on: ${events_sessionized.unique_session_id} = ${sessions.unique_session_id} ;;
  }
  
  join: session_facts {
    relationship: many_to_one
    type: inner
    view_label: "Sessions"
    sql_on: ${sessions.unique_session_id} = ${session_facts.unique_session_id} ;;
  }
} 

Once you save your model changes, don’t forget to run the LookML validator and test out a few explorations before pushing your changes to production!

4 Likes

[Analytic Block] Sessionization Funnel
JOIN 2017 - Deep Dive - Event Sessionization
(Angelim) #2

Hi @anicia,
I’m trying to implement this block, but I found some problems. On your explore,
isn’t it supposed to be like this? I don’t think people will have an unique_session_id on their events table.

- join: sessions
  relationship: many_to_one
  type: left_outer
  #sql_on: ${events.unique_session_id} = ${sessions.unique_session_id}
  sql_on: ${event_mapping.unique_session_id} = ${sessions.unique_session_id}

Aside from that, I’m facing a lot of other errors when generating the derived tables. event_mapping was complaining that I couldn’t define created_at as a sortkey if it wasn’t being selected. After adding that, I got "events_with_session_info.created_at" must appear in the GROUP BY clause or be used in an aggregate function;
If I try to aggregate anything else on session_facts I get another error.

Is there an updated block for sessionalization?

Thank you

0 Likes

(Dirty Looker) #3

Hey @angelim,

The unique session id is generated in the outer SELECT and is not apart of the dataset:

, ROW_NUMBER () OVER (ORDER BY find_idle_time.created_at) AS unique_session_id

The error you are seeing seems to be a missing field in the projection, thanks for catching that! We’ll update the example SQL soon.

Mike

0 Likes

(Angelim) #4

I’d really appreciate if you guys would send me a working example of this block. There’s a lot of stuff I can’t figure out apart from what I’ve already mentioned. For instance, on session_facts you’re grouping by 5 columns, but columns number 7 and 8 are also dimensions, right? Won’t Redshift complain about that?

0 Likes

(Anicia Santos) #5

Hi @angelim
Thanks for catching the error in the sql_on clause for event_mapping. I have updated the post to reflect that correction.

In regards to your other comments, the block is adapted from the code we used to create the sessionization example in learnbeta. The model in there uses log data that does not have a unique event_id recorded so (as @mikhailxu pointed out), we use a window function to essentially assign a unique event_id in a PDT that the events table is built in. I added a comment in the event_mapping PDT to indicate as much to reduce confusion going forward.

Finally, for session_facts, we should actually only group on unique_session_id as columns 2 through 8 are window function aggregates.

I believe support was able to get you up and running with sessionization but if you have any other questions feel free to reach out!
Thanks,
Anicia

0 Likes

(Angelim) #6

Hi @anicia,

It got it working based on that very same example on learnbeta. Got help all the way from London :slightly_smiling:
Unfortunately I forgot the analyst’s name, but he gave me access to learnbeta and pointed me in the right direction.

Thank you once again for the excellent service.

Regards

0 Likes

(Sonny Rivera) #7

I"m having the same issue with generating the session_fact table. Redshift is saying that we need to group the date field (created_at) . Mine is called etl_tstamp.

Let me know if you can help.

Sonny

0 Likes

(Dirty Looker) #8

Hey @srivera,

It could be that in the window function the ORDER BY clause is still using “created_at”, this can be swapped out to etl_tstamp to match the fields in the underlying event table.

0 Likes

(Akshay Singh) #9

Hi @anicia,

Thanks for sharing this, really useful. I implemented it in our Looker instance - as a quick check, I compared the number of unique sessions to Google Analytics sessions and see a vast difference (Segment has about 30% of GA sessions), even though Segment feeds data to GA as well. Any ideas on why there would be such a discrepancy? The session definition remains the same (<30 mins),

0 Likes

(Zachary Michel) #10

@akshay524 There was a bug in the old Segment block that was assuming all users had an id. We’ve since adjusted the LookML to adhere to the possibility of an anonymous_id. You’ll have to reach out to your Account Analyst, or help.looker.com in order to get the updated block

0 Likes

(Bharat Ayyar) #11

Is there a reason why this WHERE clause specifies both a >= and a < date?

WHERE (date_created) >= (DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) )) 
AND (log.created_at) < (DATEADD(day,60, DATEADD(day,-59, DATE_TRUNC('day',GETDATE()) ) )))

Seems like just the first part is sufficient to limit the dataset to 60 days. But I can’t tell if I’m missing something or if this makes the query more performant.

0 Likes

(Izzy) #12

Looks like in the first place that WHERE appears in the code, there’s a comment that says

 -- optional limit of events table to only past 60 days

Which seems to answer that question :slight_smile: Guess it’s optional and only if you do want to limit the events. I bet you’re right, though, and it’d make things faster.

0 Likes