Filters only for dates that have data present in your DB

templatedfilter
etl

(Sean Higgins) #1

Sometimes ETL doesn’t go according to plan, or is just done irregularly. In these situations where you don’t have data for yesterday but don’t want to have to always alert your users every time something happens, you might want to have the date range filters used in your Explores only look at the dates that have data.

Example: A user searches for your customers/users that have been created in the last 14 days. ETL didn’t run properly this morning so you only have data from 2 days ago. You want to show data from 14 days before yesterday.

Pre-warning - this is a thin use case and has some snags (listed under Caveats below)

Step 1:
Create a derived table that explicitly returns the max(date) from an ETL jobs/log table. I don’t have an ETL table handy, so I’m going to just explicitly return an arbitrary date:

view: etl_checker {
  derived_table: {
    ## Change this to something like sql: SELECT max(etl_date) as etl_date FROM etl_logs ;;
    sql: SELECT (CURRENT_DATE -2) as etl_date ;;
  }

  dimension: etl_date {
    sql: ${TABLE}.etl_date ;;
  }
}

Step 2:
CROSS JOIN this table in an Explore with your base table so every row has this date:

explore: users {
  join: etl_checker {
    type: cross
    relationship: many_to_one
  }
}

Step 3:
Use this date to create a DATEDIFF between the date and the CURRENT_DATE . Here we make use of Templated Filters’ lesser used date_end to extract the end date value of a date_filter we’ve also created.

This date_filter uses the sql parameter and is self-referential so it inserts into the WHERE clause, offsets the filter by the amount of days between today and the ETL date.

view: users {
...
# Take a user's input and check if the date range they selected matches the offset date range in the base table, where the offset is the number of days between their chosen date and the max ETL date (below dimension)

  filter: date_filter {
    type: date
    sql:
    CASE WHEN
    DATEDIFF(day, ({% date_start date_filter %})::date,({% date_end date_filter %})::date) = 1 THEN
    ({% date_start date_filter %})::date = ${created_date}
    ELSE
    {% condition date_filter %} DATEADD(day,${etl_date_diff},${created_date}) {% endcondition %}
    END
;;
  }

# Get the difference in days between the max(date) from ETL and the date the user chose. 
  dimension: etl_date_diff{
    type: number
  sql: datediff(day,${etl_checker.etl_date},{% date_end date_filter %}) ;;
  }

  dimension_group: created {
    type: time
    timeframes: [
      raw,
      date,
    ]
    sql: ${TABLE}.CREATED_AT ;;
  }
}

Caveats:
As always, with these workarounds there are some subtle things to watch for. You should really only use this for Explores where there’s always a date range being filtered.
The purpose of this approach is to make sure you’re not showing NULLs for dates which haven’t been ETLd in your database yet and only showing the max dates which are present.

Edit:
I updated the filter so you can select single dates. Thanks @brecht for helping tinker with this.


(Sean Higgins) #2

Recently, I was asked how to only show data for dates where there was data present for a given country based on ETL load times. In this use case, data was loaded per region on different cadences. For example data in Canada loaded every 2 days, in Spain every day and in Ireland every week. A simple approach to this is using parameter, a derived table with templated filters and a conditional JOIN statement with Liquid:

view: max_order_date {
  derived_table: {
    sql:
SELECT max(created_at) as max_order_created_date
FROM schema.fact_orders as orders
JOIN dim_country ON orders.country_id = dim_country.id
-- user will select a country name in the Explore and the derived table will filter accordingly
WHERE ({% condition dim_country.country_name %} dim_country.country_name {% endcondition %})
;;
}
dimension: max_order_created_date {
  hidden: yes
  type: date
}
parameter: latest_or_all {
  type: unquoted
  description: "To limit the results to just the latest Order data for a region, choose 'Latest'. Otherwise choose 'All' for all dates"
  label: "Use Latest or All data"
  allowed_value: {value:"Latest"}
  allowed_value: {value:"All"}
  default_value: "All"
}
}

In the Explore, the retrieved Max Date for that country is exposed and used in an INNER JOIN if the user wants to limit the data to that latest Max Date, otherwise nothing happens.

explore: orders {
  label: "Orders"
  description: "Orders that have been made in the given time period"
  
  join: max_order_date {
    view_label: "Order Data Filter"
    type: inner
    sql_on: {% if max_order_date.latest_or_all._parameter_value == 'Latest' %}
    ${max_order_date.max_order_created_date} = ${orders.created_date}
    {% else %}
    1=1
    {% endif %}
    ;;
    relationship: many_to_one
  }

This can easily be adapted for any sort of dimensionality, have multiple Templated Filters in the Derived Table for different conditions, have layered if statements in the JOINs and have more options in the parameters.