Filtering on hashed data

liquid
lookml

(Daniel Mintz) #1

Sometimes, sensitive data is stored in databases in encrypted form. Generally the input data is run through a one-way hashing algorithm (e.g. md5, sha256, bcrypt) so that if you know the input, you can find the corresponding hashed value, but you can’t get from a hashed value back to its unencrypted input.

In the case where you’re doing this with something like email address (to prevent bulk downloading of emails), but still want people to be able to search for specific users (whose email address they already have), you can easily set Looker up to enable this.

The prerequisites are that:

  • The hashed outputs are accessible in your database
  • Any “salts” or other values being included in the hash are in the database
  • Your database is capable of all the functions that are being used to hash the input

To show how this works with LookML, let’s assume that your company is hashing its users’ email addresses with SHA256 to pseudononymize the users in the database. If you want to allow Looker users to search for a user whose email address they already have, you can create a parameter, which is a filter-only field for them to input the email address in to, like so:

  parameter: user_email {
    type:string
  }

Then, you can use that user_email field in the sql_always_where at the Explore level. You perform whatever hashing functions were used to encrypt the original inputs identically on the user_email input. That way, if the Looker user enters an email address, it will get passed to the query in the WHERE clause, hashed via the identical process, and then get matched to the encrypted values in the database to find any that match.

That looks like this:

 sql_always_where: {% if user_email._is_filtered %}
                      ${hashed_email} = 
                      sha256(lower(trim({% parameter user_email %})))
                    {% else %}
                      1 = 1
                    {% endif %}

You’ll notice that I’ve used the _is_filtered liquid variable so that I only put this in the WHERE clause if the user has actually used the filter. Otherwise, I just pass 1=1, which always evaluates to TRUE to skip over this part of the code.

You may also notice that before applying the sha256() hashing function, I’ve run TRIM() and LOWER() on the input. This is because hashing functions are case-sensitive and sensitive to stray characters like space. That means that Daniel@email.com will return a different hash from daniel@email.com, which will also return a different hash than the same email with a space at the end, like "daniel@email.com ".