# [Analytic Block] Creating a Median Aggregate Function in BigQuery with a User Defined Function

BigQuery allows for the creation of User Defined SQL function written in either SQL or Javascript. This block shows how to create and use a MEDIAN function written in SQL. Below is a query runs a query and returns the median and average salaries for various jobs in the San Francisco City Government. Click on any of the numbers to see the individual Jobs.

[Explore from Here](https://blocks.looker.com/embed/explore/bigquery_median/sf_salary)

## How it works:

LookML explores have a parameter sql_preamble: that allow you to specify sql statements that execute before the query is run. BigQuery has a CREATE TEMP FUNCTION statement that lets you create user defined functions. We have written a function computes median and added it to the explore.

Every query that runs from this explore will have the median function inserted before the SQL code for the query.

``````explore: sf_salary {
sql_preamble:
CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)
RETURNS FLOAT64 AS ((
SELECT
AVG(num)
FROM (
SELECT
row_number() OVER (ORDER BY num) -1 as rn
, num
FROM UNNEST(a_num) num
)
WHERE
rn = TRUNC(ARRAY_LENGTH(a_num)/2)
OR (
MOD(ARRAY_LENGTH(a_num), 2) = 0 AND
rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )
));
;;
}
``````

## Calling the Function

Functions in BigQuery can take Array of any type (read about arrays in BigQuery here). We first collect all the numeric values into an array, useing the ARRAY_AGG() and then pass the array to the MEDIAN() user defined function which scans the array and looks for the median value.

``````  measure: average_base_pay {
type: average
sql: \${pay} ;;
value_format_name: "decimal_0"
}

measure: median_base_pay {
type: number
sql: MEDIAN(ARRAY_AGG(\${pay})) ;;
value_format_name: "decimal_0"
}
``````

## The LookML Code

``````connection: "bigquery_publicdata_standard_sql"

explore: sf_salary {
sql_preamble:
CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)
RETURNS FLOAT64 AS ((
SELECT
AVG(num)
FROM (
SELECT
row_number() OVER (ORDER BY num) -1 as rn
, num
FROM UNNEST(a_num) num
)
WHERE
rn = TRUNC(ARRAY_LENGTH(a_num)/2)
OR (
MOD(ARRAY_LENGTH(a_num), 2) = 0 AND
rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )
));
;;
}

view: sf_salary {
sql_table_name: `lookerdata.sfsalary.salaries` ;;

measure: count {
type: count
drill_fields: [detail*]
}

dimension: id {
type: number
sql: \${TABLE}.Id ;;
}

dimension: employee_name {
type: string
sql: \${TABLE}.EmployeeName ;;
}

dimension: job_title {
type: string
sql: \${TABLE}.JobTitle ;;
}

# Pay is stored as a string, safely convert it to a float.
dimension: pay {
type: string
sql: CASE WHEN
REGEXP_CONTAINS(\${TABLE}.TotalPay, r'^[\d\.]+\$')
THEN CAST(\${TABLE}.TotalPay AS FLOAT64)
END ;;
}

measure: average_base_pay {
type: average
sql: \${pay} ;;
value_format_name: "decimal_0"
}

measure: median_base_pay {
type: number
sql: MEDIAN(ARRAY_AGG(\${pay})) ;;
value_format_name: "decimal_0"
}

dimension: year {
type: string
sql: \${TABLE}.Year ;;
}

dimension: status {
type: string
sql: \${TABLE}.Status ;;
}

set: detail {
fields: [
id,
employee_name,
job_title,
pay,
year,
status
]
}
}
``````

The SQL for the Query

``````CREATE TEMP FUNCTION MEDIAN(a_num ARRAY<FLOAT64>)
RETURNS FLOAT64 AS ((
SELECT
AVG(num)
FROM (
SELECT
row_number() OVER (ORDER BY num) -1 as rn
, num
FROM UNNEST(a_num) num
)
WHERE
rn = TRUNC(ARRAY_LENGTH(a_num)/2)
OR (
MOD(ARRAY_LENGTH(a_num), 2) = 0 AND
rn = TRUNC(ARRAY_LENGTH(a_num)/2)-1 )
));

SELECT
sf_salary.JobTitle  AS sf_salary_job_title,
COUNT(*) AS sf_salary_count,
MEDIAN(ARRAY_AGG((CASE WHEN
REGEXP_CONTAINS(sf_salary.TotalPay, r'^[\d\.]+\$')
THEN CAST(sf_salary.TotalPay AS FLOAT64)
END)))  AS sf_salary_median_base_pay,
AVG((CASE WHEN
REGEXP_CONTAINS(sf_salary.TotalPay, r'^[\d\.]+\$')
THEN CAST(sf_salary.TotalPay AS FLOAT64)
END) ) AS sf_salary_average_base_pay
FROM `lookerdata.sfsalary.salaries`  AS sf_salary

GROUP BY 1
ORDER BY 2 DESC
LIMIT 500
``````
2 Likes

Hi Lloyd, this is the good option in order to include UDF in SQL. But in LookML it seems like this parameter (sql_preamble) was EXPERIMENTAL and LookML mentions that it will be removed in future release. Due to this I am getting LookML error saying “Invalid property of explore:sql_preamble” and I am not able to push code to Production.

Any workaround for this. I am trying to use JS UDF in google BQ.

@khaah7 - In 4.6, sql_preamble: is implemented (and the warning has been removed). We’ve been playing with it for a while and figured it was good enough. Most other dialects let you install UDFs, but because BigQuery is stateless, we decided to implement this as sql_preamble:

Feel free to use it knowing the warning will disappear next release.

Hi Lloyed, is it possible to do that in version 4.2 as that’s what we are currenly using or have to upgrade to 4.6 in order to use that. Currently I am getting error in 4.2 and without resolving this error, I can not push LookML to production.

Yes, you can turn off validation until you get to 4.6. Go into “project settings” and uncheck ‘Require LookML Validation to Commit’. You can turn it back on after you upgrade.

Hi Lloyd, this will be grate. I am not seeing option to save/update setting after unchecking the option in project setting. I have admin role.

We’ll reach out. I’m not sure why you are seeing this.

Hey @kshah7,

The project settings are located under the Projects drop-down arrow. If you scroll down on the page, there should be an option to “Update Project Settings.”

If the option still isn’t there, there might be something else going on. If so, can you please email help.looker.com so we can take a closer look!

Thanks!

Adding on to this. Here is something I recently did using Javascript to create custom SQL functions within BigQuery for use in Looker.

``````  sql_preamble:
CREATE TEMPORARY FUNCTION url_decode(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc)
} catch (e) { return null }
return null;
""";
;;
``````
2 Likes

for the measure, it might be useful to include `IGNORE NULLS` to ensure that the median is calculated as intended. For example, `sum(value)` and `avg(value)` ignore nulls. Similarly, it would make sense for `median(value)` to do the same.

``MEDIAN(ARRAY_AGG(\${pay}) IGNORE NULLS))``