Seasonal Sales Forecasting

snowflake
redshift
lookml

(jesse.sherb) #1

Scenario

In a related post, I illustrated a pattern that allows for some lightweight linear forecasting of measures using LookML. Naturally, the next question I received was is there is a pattern to consider seasonality? In this post, I will dive a little deeper into a pattern that allows for seasonal forecasting, leveraging SQL and your historical data.

The philosophy for this pattern is as follows: In order to predict sales for a day in the future, we will look at that day’s sales in the prior year. We’ll then multiply the current YTD growth rate to that number to get our forecasted amount. This takes into consideration current growth velocity with seasonality of sales in the prior year. Additionally, this means projections will get stronger each day that new data is collected.

Please note, this article contains the use of Looker’s derived tables.

Goal

Let’s create a sustainable pattern, leveraging SQL and LookML to create seasonal forecasting for any measure, with no ongoing maintenance needed.


Step 1: Create a date table

In order to report a measure on future dates, we must first create a table that creates these future dates (and continues to create). Depending on the dialect, this could be a bit tricky to accomplish. For this step, you can learn more about the intricacies of the pattern by referring to this post. Let’s start with Redshift:

WITH DATE_TABLE AS (
    SELECT X.*, 
           EXTRACT(YEAR FROM DAY)  AS YEAR,
           EXTRACT(MONTH FROM DAY) AS MONTH 
    FROM (
    SELECT PRODUCTS.BRAND                                                AS BRAND,  
           DATE(DATEADD('DAY',F.NUMBER,'2013-12-31'))                    AS DAY
    FROM PUBLIC.ORDER_ITEMS
    LEFT JOIN PUBLIC.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
    LEFT JOIN PUBLIC.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
    CROSS JOIN (SELECT 
                  ROW_NUMBER() OVER () as number
                  FROM 
                    (SELECT 0 as n UNION SELECT 1) p0,
                    (SELECT 0 as n UNION SELECT 1) p1,
                    (SELECT 0 as n UNION SELECT 1) p2,
                    (SELECT 0 as n UNION SELECT 1) p3,
                    (SELECT 0 as n UNION SELECT 1) p4,
                    (SELECT 0 as n UNION SELECT 1) p6,
                    (SELECT 0 as n UNION SELECT 1) p7,
                    (SELECT 0 as n UNION SELECT 1) p8,
                    (SELECT 0 as n UNION SELECT 1) p9,
                    (SELECT 0 as n UNION SELECT 1) p10,
                    (SELECT 0 as n UNION SELECT 1) p11,
                    (SELECT 0 as n UNION SELECT 1) p12) F
                  GROUP BY 1,2
                  ORDER BY 1,2) X)

A few notes:

  1. We use a cross join to join our table that consist of a generated series of numbers to the Nth degree. Table ‘F’ in this example will generate a large series of integers (1,2,3,4,5….N). Each number in this series is then added (using dateadd) to a hardcoded minimum date; in this case 12/31/2013. I like to use the minimum date of when your data begins. In the above example, Table ‘F’ will generate a max value of 4,096. This means it’ll add 4,096 days to my date table starting with 12/31/2013. This would give us dates into 2025.
  2. In the example above, we are grouping by products.brand. It is important in this step to include whichever dimension(s) you’d like to forecast by. This step is creating a table that appends every future date to every brand. For example:
  3. Lastly, we throw this into a subquery and extract the year and month for every date. This will come into play later in the query.
  4. For the same pattern using Snowflake, please see the bottom of the post.

Step 2: Find historical data at the daily level

According to the premise of this pattern stated above, we will need to first aggregate our sales data (or any measure you’re forecasting) at the daily level prior to applying our growth rate to project future sales. I’ll call this our DAILY_HISTORY table:

DAILY_HISTORY AS
  (     
           SELECT D.*, 
                  SUM(REVENUE) OVER(PARTITION BY BRAND,YEAR ORDER BY DAY ROWS UNBOUNDED PRECEDING)        AS YTD_REVENUE
           FROM (       
           SELECT   PRODUCTS.BRAND                                                                        AS BRAND,
                    EXTRACT(YEAR FROM ORDER_ITEMS.CREATED_AT)                                             AS YEAR,                      
                    DATE(ORDER_ITEMS.CREATED_AT)                                                          AS DAY,
                    SUM(SALE_PRICE)                                                                       AS REVENUE
           FROM     PUBLIC.ORDER_ITEMS
           LEFT JOIN PUBLIC.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
           LEFT JOIN PUBLIC.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
           GROUP BY 1,2,3) D)

A few notes:

  1. The first part of this is quite simple; for each brand (or dimension(s) you’re grouping by), find the total sales price for every day in the past. Include the year extraction for the following steps.
  2. Using the above note as a subquery, it is then imperative we then find the rolling YTD sales, for each day and brand, using a SUM window function. This is how we will ultimately find our growth rate ((Future Year YTD Sales - Previous Year YTD Sales) / Previous Year YTD Sales).

Step 3: Bringing together our CTE’s and applying forecasting logic

To this point, we have a table that consists of each brand and each day in the future. We also have a table that consists of each brand and the total sales for each day in the past (along with our rolling YTD sales for each of those dates). It is now time to bring these two tables and apply our forecasting logic. To reiterate, the logic is as follows:

Forecasted Sales 1/12/2019 (Future Date) = 
Total Sales 1/12/2018 * (1.0+(YTD Sales 1/12/2019 - YTD Sales 1/12/2018) / YTD Sales 1/12/2018)

Here’s the SQL:

SELECT  ROW_NUMBER() OVER (ORDER BY DAY) AS PK,
        BRAND,
        YEAR,
        DAY,
        CASE WHEN DAY < CURRENT_DATE THEN REVENUE
              ELSE LAST_REVENUE*(1+LAST_VALUE(PERCENT_CHANGE IGNORE NULLS) OVER (PARTITION BY BRAND ORDER BY DAY ROWS UNBOUNDED PRECEDING))
              END                                                                         AS REVENUE,
        CASE WHEN DAY >= CURRENT_DATE THEN 'Yes' ELSE 'No' END AS PROJECTED
FROM (
SELECT Y.*, 
      1.0*(YTD_REVENUE - LAST_YTD_REVENUE)/NULLIF(LAST_YTD_REVENUE,0) AS PERCENT_CHANGE
FROM (
SELECT DT.BRAND                                                                                                                            AS BRAND,      
       DT.YEAR                                                                                                                             AS YEAR, 
       DT.DAY                                                                                                                              AS DAY, 
       DH.REVENUE                                                                                                                          AS REVENUE,
       DH.YTD_REVENUE                                                                                                                      AS YTD_REVENUE,
       LAG(DH.REVENUE,CASE WHEN MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2 THEN 366 ELSE 365 END) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)     AS LAST_REVENUE,
       LAG(DH.YTD_REVENUE,CASE WHEN MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2 THEN 366 ELSE 365 END) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) AS LAST_YTD_REVENUE
FROM DATE_TABLE DT
LEFT JOIN DAILY_HISTORY DH ON DATE(DT.DAY) = DATE(DH.DAY) AND DT.BRAND = DH.BRAND
ORDER BY 1,2,3) Y) YY 

A few notes:

  1. The first step here will be to join our two CTE’s; date_table and daily_history. This will give us a row for every brand, on every day, with the total sales for that day and the rolling YTD sales for that day. This will leave us will all days in the past containing values, with all days in the future as NULL. Next step will be to apply our forecasting logic to replace the NULLS.
  2. The way to find the revenue and YTD revenue from the date in the prior year is to use a LAG function to look back 365 or 366 days, in the case of a leap year. The MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2 logic will determine leap years.
  3. Lastly, we can nest this all in a subquery and apply our forecasting logic:

CASE WHEN DAY < CURRENT_DATE THEN REVENUE ELSE LAST_REVENUE*(1+LAST_VALUE(PERCENT_CHANGE IGNORE NULLS) OVER (PARTITION BY BRAND ORDER BY DAY ROWS UNBOUNDED PRECEDING)) END AS REVENUE

This essentially will fill in all future dates with our forecasted sales logic for that day. Therefore, you can always predict one year in the future of your most recent sales date. This will leave us with a table that includes brand, date and revenue, with historical dates containing actual revenue and future dates containing projected revenue.

Here’s the full query in multiple dialects:

Redshift:

WITH DATE_TABLE AS (
        SELECT X.*, 
               EXTRACT(YEAR FROM DAY)  AS YEAR,
               EXTRACT(MONTH FROM DAY) AS MONTH 
        FROM (
        SELECT PRODUCTS.BRAND                                                AS BRAND,  
               DATE(DATEADD('DAY',F.NUMBER,'2013-12-31'))                    AS DAY
        FROM PUBLIC.ORDER_ITEMS
        LEFT JOIN PUBLIC.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
        LEFT JOIN PUBLIC.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
        CROSS JOIN (SELECT 
                      ROW_NUMBER() OVER () as number
                      FROM 
                        (SELECT 0 as n UNION SELECT 1) p0,
                        (SELECT 0 as n UNION SELECT 1) p1,
                        (SELECT 0 as n UNION SELECT 1) p2,
                        (SELECT 0 as n UNION SELECT 1) p3,
                        (SELECT 0 as n UNION SELECT 1) p4,
                        (SELECT 0 as n UNION SELECT 1) p6,
                        (SELECT 0 as n UNION SELECT 1) p7,
                        (SELECT 0 as n UNION SELECT 1) p8,
                        (SELECT 0 as n UNION SELECT 1) p9,
                        (SELECT 0 as n UNION SELECT 1) p10,
                        (SELECT 0 as n UNION SELECT 1) p11,
                        (SELECT 0 as n UNION SELECT 1) p12) F
                      GROUP BY 1,2
                      ORDER BY 1,2) X),
DAILY_HISTORY AS
      (     
               SELECT D.*, 
                      SUM(REVENUE) OVER(PARTITION BY BRAND,YEAR ORDER BY DAY ROWS UNBOUNDED PRECEDING)        AS YTD_REVENUE
               FROM (       
               SELECT   PRODUCTS.BRAND                                                                        AS BRAND,
                        EXTRACT(YEAR FROM ORDER_ITEMS.CREATED_AT)                                             AS YEAR,                      
                        DATE(ORDER_ITEMS.CREATED_AT)                                                          AS DAY,
                        SUM(SALE_PRICE)                                                                       AS REVENUE
               FROM     PUBLIC.ORDER_ITEMS
               LEFT JOIN PUBLIC.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
               LEFT JOIN PUBLIC.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
               GROUP BY 1,2,3) D)

SELECT  ROW_NUMBER() OVER (ORDER BY DAY) AS PK,
        BRAND,
        YEAR,
        DAY,
        CASE WHEN DAY < CURRENT_DATE THEN REVENUE
              ELSE LAST_REVENUE*(1+LAST_VALUE(PERCENT_CHANGE IGNORE NULLS) OVER (PARTITION BY BRAND ORDER BY DAY ROWS UNBOUNDED PRECEDING))
              END                                                                         AS REVENUE,
        CASE WHEN DAY >= CURRENT_DATE THEN 'Yes' ELSE 'No' END AS PROJECTED
FROM (
SELECT Y.*, 
      1.0*(YTD_REVENUE - LAST_YTD_REVENUE)/NULLIF(LAST_YTD_REVENUE,0) AS PERCENT_CHANGE
FROM (
SELECT DT.BRAND                                                                                                                            AS BRAND,      
       DT.YEAR                                                                                                                             AS YEAR, 
       DT.DAY                                                                                                                              AS DAY, 
       DH.REVENUE                                                                                                                          AS REVENUE,
       DH.YTD_REVENUE                                                                                                                      AS YTD_REVENUE,
       LAG(DH.REVENUE,CASE WHEN MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2 THEN 366 ELSE 365 END) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)     AS LAST_REVENUE,
       LAG(DH.YTD_REVENUE,CASE WHEN MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2 THEN 366 ELSE 365 END) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) AS LAST_YTD_REVENUE
FROM DATE_TABLE DT
LEFT JOIN DAILY_HISTORY DH ON DATE(DT.DAY) = DATE(DH.DAY) AND DT.BRAND = DH.BRAND
ORDER BY 1,2,3) Y) YY

Snowflake:

WITH DATE_TABLE AS (
        SELECT X.*, 
               EXTRACT(YEAR FROM DAY)  AS YEAR,
               EXTRACT(MONTH FROM DAY) AS MONTH 
        FROM (
        SELECT PRODUCTS.BRAND                                                AS BRAND,  
               TO_DATE(DATEADD('DAY',F.NUMBER,'2014-01-01'))                 AS DAY
        FROM ECOMM.ORDER_ITEMS
        LEFT JOIN ECOMM.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
        LEFT JOIN ECOMM.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
        CROSS JOIN (SELECT SEQ8() AS NUMBER FROM TABLE(GENERATOR(ROWCOUNT => 3650))) F
        GROUP BY 1,2
        ORDER BY 1,2) 
        X),
DAILY_HISTORY AS
      (     
               SELECT D.*, 
                      SUM(REVENUE) OVER(PARTITION BY BRAND,YEAR ORDER BY DAY ROWS UNBOUNDED PRECEDING)        AS YTD_REVENUE
               FROM (       
               SELECT   PRODUCTS.BRAND                                                                        AS BRAND,
                        EXTRACT(YEAR FROM ORDER_ITEMS.CREATED_AT)                                             AS YEAR,                      
                        TO_DATE(ORDER_ITEMS.CREATED_AT)                                                       AS DAY,
                        SUM(SALE_PRICE)                                                                       AS REVENUE
               FROM     ECOMM.ORDER_ITEMS
               LEFT JOIN ECOMM.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
               LEFT JOIN ECOMM.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
               GROUP BY 1,2,3) D)

SELECT  ROW_NUMBER() OVER (ORDER BY DAY) AS PK,
        BRAND,
        YEAR,
        DAY,
        CASE WHEN DAY < CURRENT_DATE() THEN REVENUE
              ELSE LAST_REVENUE*(1+LAST_VALUE(PERCENT_CHANGE IGNORE NULLS) OVER (PARTITION BY BRAND ORDER BY DAY ROWS UNBOUNDED PRECEDING))
              END                                                                         AS REVENUE,
        CASE WHEN DAY >= CURRENT_DATE() THEN 'Yes' ELSE 'No' END AS PROJECTED
FROM (
SELECT Y.*, 
      1.0*(YTD_REVENUE - LAST_YTD_REVENUE)/NULLIF(LAST_YTD_REVENUE,0) AS PERCENT_CHANGE
FROM (
SELECT DT.BRAND                                                                                                                            AS BRAND,      
       DT.YEAR                                                                                                                             AS YEAR, 
       DT.DAY                                                                                                                              AS DAY, 
       DH.REVENUE                                                                                                                          AS REVENUE,
       DH.YTD_REVENUE                                                                                                                      AS YTD_REVENUE,
       CASE WHEN (MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2) 
            THEN LAG(DH.REVENUE,366) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) 
            ELSE LAG(DH.REVENUE,365) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) 
            END                                                                                                                            AS LAST_REVENUE,
      CASE WHEN (MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2) 
            THEN LAG(DH.YTD_REVENUE,366) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) 
            ELSE LAG(DH.YTD_REVENUE,365) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY) 
            END                                                                                                                            AS LAST_YTD_REVENUE
FROM DATE_TABLE DT
LEFT JOIN DAILY_HISTORY DH ON TO_DATE(DT.DAY) = TO_DATE(DH.DAY) AND DT.BRAND = DH.BRAND
ORDER BY 1,2,3) Y) YY

LookML example (Snowflake) using a derived table:

view: projected_revenue {
    derived_table: {
      sql: WITH DATE_TABLE AS (
        SELECT X.*,
               EXTRACT(YEAR FROM DAY)  AS YEAR,
               EXTRACT(MONTH FROM DAY) AS MONTH
        FROM (
        SELECT PRODUCTS.BRAND                                                AS BRAND,
               TO_DATE(DATEADD('DAY',F.NUMBER,'2014-01-01'))                 AS DAY
        FROM ECOMM.ORDER_ITEMS
        LEFT JOIN ECOMM.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
        LEFT JOIN ECOMM.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
        CROSS JOIN (SELECT SEQ8() AS NUMBER FROM TABLE(GENERATOR(ROWCOUNT => 3650))) F
        GROUP BY 1,2
        ORDER BY 1,2)
        X),
DAILY_HISTORY AS
      (
               SELECT D.*,
                      SUM(REVENUE) OVER(PARTITION BY BRAND,YEAR ORDER BY DAY ROWS UNBOUNDED PRECEDING)        AS YTD_REVENUE
               FROM (
               SELECT   PRODUCTS.BRAND                                                                        AS BRAND,
                        EXTRACT(YEAR FROM ORDER_ITEMS.CREATED_AT)                                             AS YEAR,
                        TO_DATE(ORDER_ITEMS.CREATED_AT)                                                       AS DAY,
                        SUM(SALE_PRICE)                                                                       AS REVENUE
               FROM     ECOMM.ORDER_ITEMS
               LEFT JOIN ECOMM.INVENTORY_ITEMS ON ORDER_ITEMS.INVENTORY_ITEM_ID = INVENTORY_ITEMS.ID
               LEFT JOIN ECOMM.PRODUCTS ON INVENTORY_ITEMS.PRODUCT_ID = PRODUCTS.ID
               GROUP BY 1,2,3) D)

SELECT  ROW_NUMBER() OVER (ORDER BY DAY) AS PK,
        BRAND,
        YEAR,
        DAY,
        CASE WHEN DAY < CURRENT_DATE() THEN REVENUE
              ELSE LAST_REVENUE*(1+LAST_VALUE(PERCENT_CHANGE IGNORE NULLS) OVER (PARTITION BY BRAND ORDER BY DAY ROWS UNBOUNDED PRECEDING))
              END                                                                         AS REVENUE,
        CASE WHEN DAY >= CURRENT_DATE() THEN 'Yes' ELSE 'No' END AS PROJECTED
FROM (
SELECT Y.*,
      1.0*(YTD_REVENUE - LAST_YTD_REVENUE)/NULLIF(LAST_YTD_REVENUE,0) AS PERCENT_CHANGE
FROM (
SELECT DT.BRAND                                                                                                                            AS BRAND,
       DT.YEAR                                                                                                                             AS YEAR,
       DT.DAY                                                                                                                              AS DAY,
       DH.REVENUE                                                                                                                          AS REVENUE,
       DH.YTD_REVENUE                                                                                                                      AS YTD_REVENUE,
       CASE WHEN (MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2)
            THEN LAG(DH.REVENUE,366) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)
            ELSE LAG(DH.REVENUE,365) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)
            END                                                                                                                            AS LAST_REVENUE,
      CASE WHEN (MOD(DT.YEAR,4) = 0 AND DT.MONTH > 2)
            THEN LAG(DH.YTD_REVENUE,366) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)
            ELSE LAG(DH.YTD_REVENUE,365) OVER(PARTITION BY DT.BRAND ORDER BY DT.DAY)
            END                                                                                                                            AS LAST_YTD_REVENUE
FROM DATE_TABLE DT
LEFT JOIN DAILY_HISTORY DH ON TO_DATE(DT.DAY) = TO_DATE(DH.DAY) AND DT.BRAND = DH.BRAND
ORDER BY 1,2,3) Y) YY
             ;;
      sql_trigger_value: SELECT current_date() ;;
    }

    dimension: pk {
      primary_key: yes
      hidden: yes
      type: string
      sql: ${TABLE}.pk ;;
    }

    dimension: year {
      hidden: yes
      type: number
      sql: ${TABLE}.year ;;
    }

    dimension: brand {
      hidden: yes
      type: string
      sql: ${TABLE}.brand ;;
    }

    dimension_group: day {
      label: "Created"
      timeframes: [raw,date,day_of_month,week_of_year,month,quarter,year,day_of_year,month_name,month_num]
      type: time
      sql: TO_TIMESTAMP(${TABLE}.day) ;;
    }

    dimension: revenue {
      type: number
      sql: ${TABLE}.revenue ;;
      value_format_name: usd
    }

    measure: total_revenue {
      description: "This measure can only be grouped by brand and date formats"
      type: sum
      sql: ${revenue} ;;
      value_format_name: usd_0
      drill_fields: [day_month,actual_revenue,projected_revenue,total_revenue]
    }

    measure: projected_revenue {
      description:  "This measure can only be grouped by brand and date formats"
      type: sum
      sql: ${revenue} ;;
      filters: {
        field: projected
        value: "Yes"
      }
      value_format_name: usd_0
      drill_fields: [day_month,total_revenue]
    }

    measure: actual_revenue {
      description:  "This measure can only be grouped by brand and date formats"
      type: sum
      sql: ${revenue} ;;
      filters: {
        field: projected
        value: "No"
      }
      value_format_name: usd_0
      drill_fields: [day_month,total_revenue]
    }

    dimension: projected {
      type: yesno
      sql: ${TABLE}.projected = 'Yes' ;;
    }

  }