Nested Data in BigQuery (Repeated Records)

bigquery
(lloyd tabb) #1

BigQuery supports nested records within tables. These records can be either a single record or contain repeated values.

In the example below each person has a single phone number, but may have lived in multiple cities.

Why Use Nested Records?

Nested records have a couple of advantages when scanning over a distributed dataset. First, they do not require joins. This means that computations can be faster and scan much less data than if you had to rejoin the extra data each time you use it.

Nested structures are essentially pre-joined tables. And, because data is stored columnarly, if you don’t reference the nested column, there is no added expense to the query. If you do reference the nested column, the logic is identical to a colocated join.

The other advantage that nested structures bring is that they avoid repeating data that would have to be repeated in a wide, denormalized table. In other words, for a person who’s lived in five cities, a wide denormalized table would contain all of their info in five rows (one for each of the cities they’d lived in). In a nested structure, the repeated information only takes one row, since the array of five cities can be contained in a single row and only unnested when needed.

Explore

First let’s play with the explorer for this data. It appears identical to an explorer written with a normal tables. There are three views, the person, the cities they’ve lived in and their phone numbers.

Output

Everything works as expected.

The Full Code

And the complete code for this model.

connection: "bigquery_publicdata_standard_sql"

explore: persons {
  # Repeated nested Object
  join: persons_cities_lived {
    view_label: "Persons: Cities Lived:"
    sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
    relationship: one_to_many
  }
  # Non repeated nested object
  join: persons_phone_number {
    view_label: "Persons: Phone:"
    sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
    relationship: one_to_one
  }
}

view: persons {
  sql_table_name: `bigquery-samples.nested.persons_living` ;;

    dimension: id {
      primary_key: yes
      sql: ${TABLE}.fullName ;;
    }
    dimension: fullName {label: "Full Name"}
    dimension: kind {}
    dimension: age {type:number}
    dimension: gender {}

    measure: average_age {
      type: average
      sql: ${age} ;;
      drill_fields: [fullName,age]
    }

    measure: count {
      type: count
      drill_fields: [fullName, cities_lived.place_count, age]
    }

    dimension: citiesLived {hidden:yes}
    dimension: phoneNumber {hidden:yes}
  }


  view: persons_phone_number {
    dimension:  areaCode {label: "Area Code"}
    dimension: number {}
  }


  view: persons_cities_lived {
    dimension: id {
      primary_key: yes
      sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
    }
    dimension: place {}
    dimension: numberOfYears {
      label: "Number Of Years"
      type: number
    }
    measure: place_count {
      type: count
      drill_fields: [place, persons.count]
    }
    measure: total_years {
      type: sum
      sql: ${numberOfYears} ;;
      drill_fields: [persons.fullName, persons.age, place, numberOfYears]
    }
  }

What to notice: Views

Each record is written as a view. The Phone Number view is very simple and just declares the dimensions that appear in the record.

 view: persons_phone_number {
    dimension:  areaCode {label: "Area Code"}
    dimension: number {}
  }

The Cities view is more complex. As shown above, we declare the dimensions that appear in the record (numberOfYears and place) but we also declare some measures. The measures and drills are declared as usual, as if this data were in its own table. The only real difference is that we declare ‘id’ as a primary key so that symmetric aggregates work.

view: persons_cities_lived {
    dimension: id {
      primary_key: yes
      sql: CONCAT(CAST(${persons.fullName} AS STRING),'|', CAST(${place} AS STRING)) ;;
    }
    dimension: place {}
    dimension: numberOfYears {
      label: "Number Of Years"
      type: number
    }
    measure: place_count {
      type: count
      drill_fields: [place, persons.count]
    }
    measure: total_years {
      type: sum
      sql: ${numberOfYears} ;;
      drill_fields: [persons.fullName, persons.age, place, numberOfYears]
    }
  }

What to Notice: Record Declarations

In the view that contains the subrecords (in this case persons), we need to declare the records. These will be used later when creating the joins. We hide them, because we won’t need them when exploring the data.

view: persons {
    ...
    dimension: citiesLived {hidden:yes}
    dimension: phoneNumber {hidden:yes}
   ...
}

What to Notice: Joining the Data

Nested records in BigQuery are Arrays of Structs. Instead of Joining with a sql_on: expression, the join relationship is built into the table. Other than that UNNESTing a ARRAY of STRUCTs is exactly like joining a table.

In the non-repeated case, we simply have a struct, we turn that into an array of structs by placing it in square brackets. While this may be a little weird to look at, there seems be be no performance penalty and this keeps things clean and simple.

explore: persons {
  # Repeated nested Object
  join: persons_cities_lived {
    view_label: "Persons: Cities Lived:"
    sql: LEFT JOIN UNNEST(persons.citiesLived) as persons_cities_lived ;;
    relationship: one_to_many
  }
  # Non repeated nested object
  join: persons_phone_number {
    view_label: "Persons: Phone:"
    sql: LEFT JOIN UNNEST([${persons.phoneNumber}]) as persons_phone_number ;;
    relationship: one_to_one
  }
}
6 Likes

Why Nesting Is So Cool
Nested aggregate and aggregate awareness: how to make it work?
(Maximilian Roos) #2

This is great, and thank you for such a complete example.

Is there a good way to autogen the LookML here? The SQL Query Runner seems to raise an error if nested data is returned.

Thanks

0 Likes

(steven) #3

Unfortunately, there are a few known bugs with the JDBC driver we use for BigQuery handling results with nested data. You should still be able to use the “Create View From Table” functionality to autogenerate this sort of LookML against a BigQuery table with nested and repeated records.

1 Like

(ernesto ongaro) #4

This is great help!

One additional note: I found for non-repeated nested objects, like areaCode, that using the dot “.” operator was sufficient without having to UNNEST

dimension: areaCode {
    sql: persons.phoneNumber.areaCode ;;
}
2 Likes

(Peggy Beard) #5

Another note, if the repeated nested object is a single field (not a structure), like “nicknames”, then the way to declare the view is:

view: nicknames {
  dimension: nicknames {
    sql: ${TABLE} ;;
  }
} 
2 Likes