String edit distance (Levenshtein)


(Alexandre JOURNO) #1

Hi all,

Is there a simple manner to compute the edit distance between two strings ?
In the following example, string comparison is performed in order to check whether the name of a user in the db looks alike his email.
I computed the following proxy: the email contains the first name or the last name, but I would like for a more generic function, such as the Levenshtein distance : how many edits from one string to the other. ‘johnsmith’ is 1 step distant from ‘john_smith’ and 3 steps to ‘jsmith’, but 9 steps to ‘billbrown’.

dimension: is_email_consistent {
type: yesno
sql: ${email} like concat(’%’,lower(${firstname}),’%’) OR ${email} like concat(’%’,lower(${lastname}),’%’) ;;
}

Do you know any more clever way to perform such distance ?

Many thanks in advance


(jd) #2

What database are you using? Postgres natively implements a levenshtein() function


(Michael Jalkio) #3

We implemented Levenshtein distance on our Redshift cluster using a Python UDF (https://docs.aws.amazon.com/redshift/latest/dg/udf-creating-a-scalar-udf.html). User-defined functions might be available on your database as well.


(fabio) #4

I found that for the use case I cared about, Levenshtein distance was actually not a very helpful metric. I was using BigQuery, so was able to use a Javascript-based UDF. It basically looks for how many matching n-grams there are between the two strings, with the ability to configure stop words:


(Alexandre JOURNO) #5

Thanks, I used a simplified version of your code to score consistencies of emails with last name/first name, in order to raise a flag on fraudulent behaviours (among many other flags)