Connecting to a Databricks Spark 2.x cluster (Professional and Enterprise Tier customers)

spark
(Todd Nemet) #1

Databricks offers secure and robust JDBC connectivity to Spark 2.x cluster for Professional and Enterprise Tier Databricks customers. (For Spark 1.x connectivity, see the resources section at the end of this article.)

There are several advantages of using this method for connecting to Databricks Spark clusters:

  • The connection is secured using SSL
  • There is no need to determine and specify the master node of the cluster each time it is started

###Steps for creating a connection

Additional detail can be found in the Databricks documentation

Step 1: Determine the Spark cluster

After creating the cluster, use the Databricks UI to find the JDBC/ODBC settings under Advanced settings.

It will look something like this:

Make a note of the following:

  • Server hostname
  • Port
  • HTTP path

Step 2: Fill out the connection settings

In Looker, navigate to Admin > Connections > New Database Connection

Fill out the fields in the connection page according to the instructions below and using the image below as an example:

  • Name
    Specify the name of the connection and how it will be referred to in LookML projects.

  • Dialect
    Select Apache Spark 2.0.

  • Host:Port
    Specify the server hostname and port from the previous step.

  • Database
    Specify the default database that you would like modeled. Other databases can be accessed, but Looker treats this database as the default database.

  • Username
    A Databricks username login.

  • Password
    The Databricks password associated with that login.

  • Persistent Derived Tables
    Check this and specify a Temp Database if you are going to use the Persistent Derived Tables feature in Looker.

  • Additional Params
    This is important for accurately specifying the Databricks connection. There are three parameters (plus ; as a delimiter) that need to be specified. Concatenate them together on the Additional Params line.

    • ;transportMode=http
    • ;ssl=true
    • ;httpPath=/the/http/path/from/step/1

####Step 3: Modify the following options as desired or leave them at their default values

  • Max connections
    By default, the max connections is set to 30. See this page for more information about the Max Connections field.

  • Connection Pool Timeout
    Specify the connection pool timeout. By default, the timeout is set to 120 seconds. See this page for more information on the Connection Pool Timeout field.

  • Database Time Zone
    Specify the timezone used by the dates in the database. Leave this field blank if you do not want timezone conversion. See this page for more information on the Database Time Zone field.

  • Query Time Zone
    The desired timezone for Looker queries. Leave this field blank if you do not want timezone conversion. See this page for more information about the Query Time Zone field.

####Step 4: Test the connection settings
Use the Test These Settings button to make sure that Looker can connect through Databricks’s infrastructure. If successful, it will display a JDBC URL similar to the one shown in the JDBC settings for your cluster.

####Step 5: Save the connection settings
Add Connections will create the connection with this configuration. It will also run additional tests against the Spark cluster.

If successful, it will display something similar to this:

###Resources:

1 Like