How to use regex filter on string fields?

done
low_priority
reply

(Ryan Tuck) #1

I’d like to filter on a string dimension where that dimension matches a particular regex.

For instance, in PostgreSQL, I would do it like so:

where
  my_field similar to '[0-9]{3}_[A-Z]{4}_[0-9]{3}'

(Greg Sanders) #3

Ryan,

Great question and thanks for your contributions to the forum! While I am no regex expert, I did recently work on a project that required using them to filter and extract portions of server log entry lines stored in BigQuery. Some of the design seems applicable here.

For the filtering use case, I built a derived table that used the BQ function REGEXP_CONTAINS to filter the results (probably regexp_matches in Postgres and you may have a different regex syntax as well). For a static search using a regex it would look something like:

 derived_table: {
    sql: SELECT … FROM ...
            WHERE REGEXP_CONTAINS(log_line, r".*/(.*\.jsp).*")  
;; }

This would filter for results containing .jsp in the log_line field. But, we want this to be a dynamic regex input that the user can type into a filter in the explore. To do so, we need to create a liquid parameter and then reference it in our WHERE clause in the derived table, like so:

 derived_table: {
    sql: SELECT … FROM ...
            WHERE REGEXP_CONTAINS(log_line, {{ regexp_search._parameter_value }})  
;; }
  parameter: regexp_search {
    type: string
  }

The issue with this solution is that Looker is going to wrap the input text in quotes and escape backslash characters with additional backslashes in the _parameter_value, so we need to “fix” that before using it in the regexp function. The following worked for me:

  derived_table: {
    sql:         
{% assign x = regexp_search._parameter_value.size | minus:2 %}
            SELECT ...            FROM ...
            WHERE REGEXP_CONTAINS(log_line, {{ regexp_search._parameter_value | slice: 1, x | replace: '\\', '\' }})
             ;; }

The first line of liquid sets the value of x to the string length minus two which is then used in the slice function to cut off the quotes that Looker adds to the regexp that the user inputs. Finally we replace the double backslashes with single backslashes and we’re good to go.

The unquoted parameter type would reduce the need for liquid, but that data type is restricted to alphanumeric characters to prevent SQL injection and thus doesn’t work for regex use cases like this. There may be more elegant solutions than this and one could certainly build something using the API that would similarly allow regex input and further tailor the UI for this purpose.

Let us know how this works for you and if you come up with anything else noteworthy in working with regex and Looker!

Best,
Greg

PS-I love the Warby Parker case study and reference it all the time!


(Ryan Tuck) #4

Seems like a pretty good solution to the problem Greg - thanks for the thoughtful response.

I guess I’d now consider this a feature request to create some sort of regex_match() function available to custom filters.

Will post here if we figure out a separate solution.