PDTs in Looker - Useless outside of Looker?!

PDTs are great for Looker users, but our heavy data users/early company employees/founders are married to their own SQL queries where they can do all kinds of crazy ad-hoc reporting.

Since PDTs are stored with a hashed names in our database, those users can’t reference Looker’s PDTs. The same people are worried that building out PDTs in Looker instead of our database makes us more dependent on Looker. Therefore, the executive team has mandated that we either

  1. do not hash the PDT names anymore or
  2. move all PDTs out of Looker and into our database.

Maintaining PDTs in Looker is easier for me. Can anyone help me figure out how I can go with solution 1?

Thanks,
Lucas

2 Likes

Is the concern with the hashes that the names are in some ways not easily found or are the table names changing too often?

The name is somewhat readable after the hashes. The concern is that the name changes and is impossible to remember with the hashes. Someone would have to look up the table name every time they want to reference it, making it more work and saved queries or dependent views impossible.

In this case, I have some good news then. The way the hashes are generated, they are unique to any connection, model, sql, persistence combination (caveat, if the code used to generate derived tables changes, this may cause one-time shifts).

This means if your derived tables are stable and these parameters aren’t changing then the naming will be stable enough to query. You will still need to map from these more complex table names to their colloquial use, but they should be consistent over time. Hopefully this clarifies the situation, but I know product is looking at possible solutions as well.

2 Likes

We discussed this at our data services meeting and don’t think we want to go down the road where we have to communicate that everyone has to update their table reference when we make a change.

For now, we will use Looker only to prototype a PDT and then move it over into the database. If the information from the hash can be stored somewhere else then the name, then we can use PDTs fully in Looker.

As designed, PDTs aren’t easily usable outside of Looker.

Lucas joins a long list of people who would like to use PDTs, which are designed for one thing, to solve a different problem. Customers are imagining new uses for PDTs, and we are open to new ideas. There is an open issue about this very request.

The reason it has not been done yet is because, the hash is in the name for important reasons so we can’t simply take it out. Thinking through all the consequences of conflicts possible in multi instance, multi model configurations is tricky. I am still working these things out for normal PDTs and I have been working on that for some time.

I think we could probably eventually do something to help with this particular case. The current favorite choice is some system for making views with stable names for the production version of PDTs. However, this solution hasn’t come close enough to implementation to be considered likely, thus we do not have a timeline on when it may be available.

@mtoy The reason why stable names for PDTs with stable SQL is a very important feature is that some databases, e.g., Redshift, which we use, don’t support triggers so implementing PDTs with triggers outside of Looker is not trivial. In fact, PDT automation was a big reason why we chose Looker. Alas, I just discovered that PDTs are essentially unusable outside of looks. They can’t even be used in sql_always_where clauses in models.

Two questions for you:

  1. Would it help to restrict the problem to (a) production branch and (b) PDTs with static SQL? That would cover all our use cases.

  2. If you see (1) as too complicated from a support & customer communication standpoint, what about exposing a subset of PDT capabilities directly to customers, e.g., a new LookML “scheduled operation” that has a trigger and some customer provided SQL that is to be run upon the trigger. No issues with PDT naming but, still, customers will get the benefit of Looker’s 24/7 availability, scheduling, etc. This could be made slightly more flexible by allowing for some templating to handle branches in developer mode but that’s not required.

1 Like

Hey @sim!

PDTs can be referenced in a sql_always_where clause by using the following syntax: ${pdt_name.field_name}, as long as the PDT is joined to the explore.

If this method is not sufficient for your use case, you can visit help.looker.com and we’d be happy to discuss this with you in more detail.

As to your two questions, we’ll be discussing this internally and will get back to you soon with a more detailed response.

1 Like

Hi @sim!

I’ve spoken to the engineering team and have a response for you regarding your two questions.

The reason we haven’t implemented #1 is because this is dependent on the Looker setup- for example, if you have two instances pointing to the same DB, this solution will break.

#2 is a solution that we have been considering internally and may implement at some point. Thanks for your input!

Hey Lucas! I agree: your option #1 is the way to go. I think it’s super-important to take the modeling logic that your analyst team builds out in SQL and make it accessible to all of your data consumers throughout the org, and hashes are a big problem.

This is a super-common ask we hear from folks that implement Looker as a part of a larger data stack. For example, folks with analytics and data science teams want to be able to model their data not only for analytics use cases in Looker but also to consume via Python/R/etc. This is one of the major reasons we built data build tool (dbt). dbt is a data modeling layer very similar to Looker PDTs but with several advantages: incremental data models, no hashes, automatic dependency resolution, multi-threaded runs, etc. Instead of building your SQL transformation logic in a PDT, move it to dbt (which is open source and free to use!) and you’ll get all of these benefits out of the box :smiley:

1 Like

Hi Everyone,

I use Redshift and wanted to have only one place to maintain business related aggregations, and for this Looker PDTs were the most convenient. However I also faced the issue, how to use the PDTs outside Looker.

I use a script (I did it in Python but there are other options too) that runs regularly and checks for the PDTs, such as looker.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact and creates a view pdt.customer_fact as select * from the original PDT.

The script executes the outcome of the below query:

SELECT DISTINCT ‘CREATE OR REPLACE VIEW pdt.’||ltrim(ltrim(tablename, split_part(tablename,’’,1)),’’)||’ AS SELECT * FROM ‘||tablename||’;’ AS sql_needed FROM pg_table_def WHERE schemaname = ‘looker’ AND tablename not like ‘%connection%’;

There might be more elegant solutions out there, if you have one, please share.


D.

2 Likes

Hi @B_D
I want to caution you about a couple of issue with the solution you’ve implemented.

  1. The current PDT implementation is a drop and swap, e.g. CTAS tmp_name, DROP perm_name, RENAME tmp_name TO perm_name. Except in dialects which support late binding views, the presence of a VIEW which depends on perm_name, the DROP will fail and regeneration with it. This could lead to querying stale data both in Looker and wherever the view is being used outside of Looker.
  2. You may inadvertently create a view which is based on a development version of a PDT, not the production version. This could lead to querying the incorrect computation when using the view outside of Looker.

The engineering team is always looking for ways to improve Looker, and stable/friendly PDT names which could be safely accessed outside of Looker is under consideration.

Would the engineering team at Looker consider adopting @conrad 's idea, but create the views during the build process of the PDT’s?

It would probably be the simplest possible solution.

@thalesmello I am one of the engineers who maintains the PDT module at Looker. Apologies for not being clear about that in my above comment.

Yes, creating views as part of PDT maintenance is one approach we have thought of for providing access to PDTs with friendly/stable names outside of Looker.

Unfortunately, as long as creating the views during the build process could introduce new problems related to acquiring metadata locks that block the PDT regeneration pipeline. I do have ideas about how to get around this. However, it is not planned for the immediate future.

Keep an eye out for updates on how you can safely implement a similar solution to what @B_D suggested using a combination of the looker API and late binding views.

@conrad Does Redshift support late binding views? I couldn’t find anything concrete about it, just just feature request.

Either way, if I were to create a view that reference an existing table, wouldn’t it make the PDT build process fail?

To the best of my knowledge, Redshift does not currently support late binding views. A quick google search confirms what you mentioned, there are feature requests out there.
I brought up late binding because I believe it could, depending on implementation, eliminate the regeneration failure I was referring to. Even with late binding views, a long running query against said view could prevent Looker regen process from acquiring the ExclusiveLock required at the DROP stage, but in theory that would just stall regeneration for the duration of the long query. So not a silver bullet, but certainly better than the regen failing altogether.

As you can see, there are tradeoffs when it comes to PDT usability. If it weren’t for our desire for the same PDT to be usable in both development and production mode if the computation is identical, we would use a friendly/stable name for production PDTs. We value the seamless transition from production to development and back above the potential benefits of a friendly stable name. That is not to discount the friendly name benefits, and I believe we will eventually implement a solution which provides both.

What would be a great (and small?) change in the current PDT implementation is that LOOKER at least just starts to move the position of the hash to the suffix of the PDT table name, i.e.:

INSTEAD OF:
looker.LR_Z5AIXY4U7S16SB1TCKJZB_JND_nativemobileSales

DO THIS:
looker.LR_JND_nativemobileSales_Z5AIXY4U7S16SB1TCKJZB

WHY:
It will make referencing these tables in BigQuery (using table suffixes) much easier.
For our Looker deployment we can then make views that do a simple:
SELECT * FROM ‘looker.LR_JND_nativemobileSales_*’

Because the solution to select the PDT using
SELECT * FROM ‘looker.LR_*’ WHERE _TABLE_SUFFIX like ‘%_JND__nativemobileSales’ does not work, as explained here in stackoverflow

1 Like

That’s a good idea @Rogier_Werschkull, I’ve passed this on to our product folks here. Thanks for bringing this up.

1 Like

Looks like Redshift released late-binding views in September 2017.

https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-redshift-now-supports-late-binding-views-referencing-amazon-redshift-and-redshift-spectrum-external-tables/

Keep an eye out for updates on how you can safely implement a similar solution to what @B_D suggested using a combination of the looker API and late binding views.

Sounds interesting.

Hi @B_D,

Thank you for passing that along. Currently, our engineering team is working towards better PDT renaming and an API endpoint that will allow someone to reference the current production version of a given PDT. The approach of using VIEWs outside of Looker will remain fragile but should work. We highly recommend, if you do, to use the new Redshift clause WITH NO SCHEMA BINDING. We also, make no guarantees that our PDT build/regen process and naming conventions won’t change. With that, there is still a risk of querying the wrong PDT and getting incorrect results.

Cheers,
Vincent