PDTs in Looker - Useless outside of Looker?!

Hey Lucas! I agree: your option #1 is the way to go. I think it’s super-important to take the modeling logic that your analyst team builds out in SQL and make it accessible to all of your data consumers throughout the org, and hashes are a big problem.

This is a super-common ask we hear from folks that implement Looker as a part of a larger data stack. For example, folks with analytics and data science teams want to be able to model their data not only for analytics use cases in Looker but also to consume via Python/R/etc. This is one of the major reasons we built data build tool (dbt). dbt is a data modeling layer very similar to Looker PDTs but with several advantages: incremental data models, no hashes, automatic dependency resolution, multi-threaded runs, etc. Instead of building your SQL transformation logic in a PDT, move it to dbt (which is open source and free to use!) and you’ll get all of these benefits out of the box :smiley:

1 Like

Hi Everyone,

I use Redshift and wanted to have only one place to maintain business related aggregations, and for this Looker PDTs were the most convenient. However I also faced the issue, how to use the PDTs outside Looker.

I use a script (I did it in Python but there are other options too) that runs regularly and checks for the PDTs, such as looker.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact and creates a view pdt.customer_fact as select * from the original PDT.

The script executes the outcome of the below query:

SELECT DISTINCT ‘CREATE OR REPLACE VIEW pdt.’||ltrim(ltrim(tablename, split_part(tablename,’’,1)),’’)||’ AS SELECT * FROM ‘||tablename||’;’ AS sql_needed FROM pg_table_def WHERE schemaname = ‘looker’ AND tablename not like ‘%connection%’;

There might be more elegant solutions out there, if you have one, please share.


D.

2 Likes

Hi @B_D
I want to caution you about a couple of issue with the solution you’ve implemented.

  1. The current PDT implementation is a drop and swap, e.g. CTAS tmp_name, DROP perm_name, RENAME tmp_name TO perm_name. Except in dialects which support late binding views, the presence of a VIEW which depends on perm_name, the DROP will fail and regeneration with it. This could lead to querying stale data both in Looker and wherever the view is being used outside of Looker.
  2. You may inadvertently create a view which is based on a development version of a PDT, not the production version. This could lead to querying the incorrect computation when using the view outside of Looker.

The engineering team is always looking for ways to improve Looker, and stable/friendly PDT names which could be safely accessed outside of Looker is under consideration.

Would the engineering team at Looker consider adopting @conrad 's idea, but create the views during the build process of the PDT’s?

It would probably be the simplest possible solution.

@thalesmello I am one of the engineers who maintains the PDT module at Looker. Apologies for not being clear about that in my above comment.

Yes, creating views as part of PDT maintenance is one approach we have thought of for providing access to PDTs with friendly/stable names outside of Looker.

Unfortunately, as long as creating the views during the build process could introduce new problems related to acquiring metadata locks that block the PDT regeneration pipeline. I do have ideas about how to get around this. However, it is not planned for the immediate future.

Keep an eye out for updates on how you can safely implement a similar solution to what @B_D suggested using a combination of the looker API and late binding views.

@conrad Does Redshift support late binding views? I couldn’t find anything concrete about it, just just feature request.

Either way, if I were to create a view that reference an existing table, wouldn’t it make the PDT build process fail?

To the best of my knowledge, Redshift does not currently support late binding views. A quick google search confirms what you mentioned, there are feature requests out there.
I brought up late binding because I believe it could, depending on implementation, eliminate the regeneration failure I was referring to. Even with late binding views, a long running query against said view could prevent Looker regen process from acquiring the ExclusiveLock required at the DROP stage, but in theory that would just stall regeneration for the duration of the long query. So not a silver bullet, but certainly better than the regen failing altogether.

As you can see, there are tradeoffs when it comes to PDT usability. If it weren’t for our desire for the same PDT to be usable in both development and production mode if the computation is identical, we would use a friendly/stable name for production PDTs. We value the seamless transition from production to development and back above the potential benefits of a friendly stable name. That is not to discount the friendly name benefits, and I believe we will eventually implement a solution which provides both.

What would be a great (and small?) change in the current PDT implementation is that LOOKER at least just starts to move the position of the hash to the suffix of the PDT table name, i.e.:

INSTEAD OF:
looker.LR_Z5AIXY4U7S16SB1TCKJZB_JND_nativemobileSales

DO THIS:
looker.LR_JND_nativemobileSales_Z5AIXY4U7S16SB1TCKJZB

WHY:
It will make referencing these tables in BigQuery (using table suffixes) much easier.
For our Looker deployment we can then make views that do a simple:
SELECT * FROM ‘looker.LR_JND_nativemobileSales_*’

Because the solution to select the PDT using
SELECT * FROM ‘looker.LR_*’ WHERE _TABLE_SUFFIX like ‘%_JND__nativemobileSales’ does not work, as explained here in stackoverflow

1 Like

That’s a good idea @Rogier_Werschkull, I’ve passed this on to our product folks here. Thanks for bringing this up.

1 Like

Looks like Redshift released late-binding views in September 2017.

https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-redshift-now-supports-late-binding-views-referencing-amazon-redshift-and-redshift-spectrum-external-tables/

Keep an eye out for updates on how you can safely implement a similar solution to what @B_D suggested using a combination of the looker API and late binding views.

Sounds interesting.

Hi @B_D,

Thank you for passing that along. Currently, our engineering team is working towards better PDT renaming and an API endpoint that will allow someone to reference the current production version of a given PDT. The approach of using VIEWs outside of Looker will remain fragile but should work. We highly recommend, if you do, to use the new Redshift clause WITH NO SCHEMA BINDING. We also, make no guarantees that our PDT build/regen process and naming conventions won’t change. With that, there is still a risk of querying the wrong PDT and getting incorrect results.

Cheers,
Vincent

For anybody still trying to figure out how to get the correct pdt name out of looker, you can use a view like this:

view: pdt_mapping {
  derived_table: {
    sql:
      SELECT 'pdt_company_metrics' name, '${pdt_company_metrics.SQL_TABLE_NAME}' internal_name
      UNION SELECT 'series_dates', '${series_dates.SQL_TABLE_NAME}'
      UNION SELECT 'funnel', '${funnel.SQL_TABLE_NAME}'
      -- add additional pdts here...
    ;;
  }
  dimension: name {
    sql: ${TABLE}.name ;;
  }
  dimension: internal_name {
    sql: ${TABLE}.internal_name ;;
  }
  dimension: view_sql {
    sql: 'CREATE OR REPLACE VIEW pdt.' || ${name} || ' AS '
    'SELECT * FROM ' || ${internal_name} || ' WITH NO SCHEMA BINDING'
    ;;
  }
}

Which gives you an explore like this:

from which you can pull the View SQL dimension with an external script to rebuild the views.

While this approach requires that you add each pdt individually to the view, this list changes infrequently enough for us that this doesn’t seem to be an issue.

1 Like

Is there any update on a more robust/reasonable way of accessing PDTs via standard SQL? Siloed data is the main factor preventing Looker adoption for us, and it seems to be a common issue judging by what people say here and in other places.

This is such an interesting thing to hear, as one of the big points that we always bring up as a benefit of Looker is breaking down silo’ed data. I never thought about the fact that if you have a big enough architecture set up with Looker just being one component, Looker becomes a silo itself.

Work is slated to begin on stable PDT names this quarter. It’s not planned to be finished during Q3, but I’ll drop in with an update if there’s anything exciting one way or the other.

Now it’s Q4, is there any update on this?

1 Like

We began work on this in Q3, but weren’t ever planning to finish it in-quarter. It’s under active development right now (building a feature that will allow us to build this feature :grin:) and looking like a probable Q4 release. We haven’t yet committed to a specific release for this work, but I’ll update here if that changes.

@izzy are you able to share how it will work, at least?

I think this has been shared elsewhere, but for folks looking in this thread, here’s a workaround we use right now:

view: your_pdt_name {
 derived_table: {
   create_process: {
    sql_step: DROP TABLE if exists looker_scratch.your_pdt_name CASCADE ;;
    sql_step: 
      CREATE TABLE looker_scratch.your_pdt_name AS
      SELECT
        ... ;;
    sql_step: CREATE VIEW ${SQL_TABLE_NAME} AS 
      SELECT * FROM looker_scratch.your_pdt_name ;;
  }

  sql_trigger_value: SELECT date_trunc('week', now() at time zone 'America/New_York') ;;
 }
}
1 Like

The upshot of the work is that PDT table names will be stable and readable, instead of including a long hash that changes often. Thus, they’ll be able to be referenced reliably outside of Looker.

Ohh that’s amazing. I’ve been using Looker’s PDTs in standard SQL every now and then. That will make my life much easier!

1 Like