Note: Looker version 4.0 supports Spark 1.5.2, 1.6.2, 2.0.0, and 2.0.1. Support for Spark 1.4 has been dropped.
Looker is architected to connect to a database server via JDBC.
In the case of Spark SQL 1.x this means that the Hive thrift server that needs to be compiled in to Spark to run. For Spark 2.x, JDBC via thrift server comes with all versions.
By default, this thrift server will listen on port 10000. However, if the cluster also has Hive installed on it, then port 10000 may already be used by a thrift server used to serve Hive queries. In that case the Spark SQL thrift server will need to use another port, like 10001 or 10010.
Building from source
If building Spark from source, see this section of the Spark docs for more detail on how to compile in the Hive thrift server.
Downloading from spark.apache.com
If downloading Spark from spark.apache.com, select one of the package types that include "pre-built for Hadoop..." to ensure that Spark has the thrift server compiled in.
Hosted Spark services
Fortunately, most hosted Spark services contain versions of Spark that have the thrift service built in. I have run the thrift service and connected Looker to the following:
- Amazon Elastic Map Reduce (EMR)
- Google Dataproc
Your Looker analyst can help you connect Looker to these services.
Unfortunately, success getting the Spark thrift server running for Hadoop distributions has been spottier. It will definitely get better as time goes on, so this information could be out of date.
- According to a webinar on 11/12, Spark SQL's thriftserver and JDBC won't be officially supported in CDH 5.5
- Documentation states that Spark SQL and thriftserver is not supported in CDH 5.4.
- MapR Spark page
- According to its documentation, running the Spark SQL Thrift Server on a secure cluster is not supported.
Starting and stopping the thrift server
See the Spark docs for more detail.
The thrift server is controlled by two scripts in
To run on an alternate port and as an alternate user, use a command like this:
sudo -u spark
sudo -u spark /path/to/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001
Depending on how Hadoop group permissions are set up, you may want to add something like
HADOOP_USER_NAME=hadoop to the command as well.
Verifying the JDBC connection
Spark comes with a JDBC client in
beeline. Verify that the thrift server is running with a command like
bin/beeline -u 'jdbc:hive2://localhost:10001/' or if you want to connect as a specific user
bin/beeline -u 'jdbc:hive2://localhost:10001/' -n user -p abc123.
Saving tables from a Spark script
See the Spark docs for more detail.
To write a table that will persist after the Spark script exits and that Looker can search, it needs to be saved with
Connecting Looker to Spark SQL
Select Admin -> Connections -> New Database Connection. Then select Dialect -> Apache Spark 1.5+ or Apache Spark 2.0.
Name: The name of the connection. This is how the connection will be referred to in the LookML model
Host:Port The Spark SQL thrift server host and port. (:10000 by default)
Database: The default schema/database that will be modeled. When no database is specified for a table, this will be assumed.
Username: User to authenticate as
Password: Optional password for user
Persistent Derived Tables: Check this if you will be using PDTs with Looker
Temp Database: A temporary schema/database for storing persistent derived tables. Must be created beforehand with something like
CREATE SCHEMA looker_scratch;
SSL: Leave unchecked
Database Time Zone: The timezone of data stored in Spark. Usually can be left blank or set to GMT.
Query Time Zone: The timezone to display data queried in Looker.
Additional Params: Usually this is left blank, but you can add additional parameters to the JDBC string here like
Test the connection to make sure that it is configured correctly. When you Add Connection, additional connection tests will be run.
Debugging the connection
After pressing "Test These Settings" You should see a green box that says "Can connect."
If not, here are some example values along with what caused them:
"Test These Settings" errors
Cannot connect: connection refused: Java::JavaSql::SQLException: Could not open connection to jdbc:hive2://ec2-52-20-232-248.compute-1.amazonaws.com:10001/imdb;user=hadoop: java.net.ConnectException: Operation timed out
In this case, the firewall settings are blocking Looker's attempts to connect to the Spark SQL thrift server and the TCP connection attempt is timing out. To resolve this set up an SSH tunnel or adjust the firewall settings.
Cannot connect: connection refused: Java::JavaSql::SQLException: Could not open connection to jdbc:hive2://localhost:10004/imdb;user=hadoop: null
In this case Looker's attempts to reach the Spark SQL thrift server machine are successful, but either there is no thrift server running or the wrong port is configured. (If this is the case, the error message will return within a few seconds.)
Another possibility is that a different type of server (i.e. a web server) is listening at that port instead of the thriftserver.
Yet another possibility is that the wrong authentication protocol is sent to the thrift server. For example
;auth=noSasl is configured in Additional Params when SASL has not been configured on the thrift server.
(If one of the latter two possibilities is the case then the error message will return after more than a few seconds.)
Debugging the connection to the thrift server
Verify that the Spark thrift server is running and find its listening port
If you have access to the machine running the Spark thrift server, verify that it is running with a command like this:
$ ps -efwww | grep spark | grep thriftserver
spark 17495 1 7 Sep25 ? 3-15:22:04 /usr/lib/jvm/java-1.7.0-openjdk-22.214.171.124.x86_64/jre/bin/java -cp /usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf/:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/ -Xms512m -Xmx512m -Dspark.driver.log.level=INFO -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=512M org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal`
Then you can verify the port number using the pid using a command like this:
$ sudo netstat -plten | grep 17495
tcp 0 0 :::43120 :::* LISTEN 492 96763982 17495/java
tcp 0 0 :::10001 :::* LISTEN 492 96762603 17495/java
tcp 0 0 ::ffff:10.55.0.168:45222 :::* LISTEN 492 96762998 17495/java
tcp 0 0 :::4040 :::* LISTEN 492 96762551 17495/java
tcp 0 0 :::52968 :::* LISTEN 492 96752416 17495/java
In this case the listening port is 10001.
Note that there may also be a Hive thrift server running on the same node.
Verify the connection from the Looker machine to the Spark thrift server
Note, I have an ssh tunnel set up to the Spark thrift server so
localhost is the host that I use to connect to it.
Use telnet to verify that the client machine is able to connect on the appropriate port:
A successful connection will look like this:
$ telnet localhost 10001 (type this line in with CTL+])
Connected to localhost.
Escape character is '^]'.
An unsuccessful connection will look like this:
telnet localhost 10004
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.
Verify the JDBC connection string
Once you have verified the connection, verify the correct JDBC string to connect for the Spark thrift server. It may be different based on methods of authentication and allowed usernames.
Spark comes with a client called
beeline. (See above for more detail.)
Use beeline to verify the correct JDBC URL for connecting to the thrift server. For example:
$ bin/beeline --color=yes -u 'jdbc:hive2://localhost:10001/imdb;username=hadoop'
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10001/imdb;username=hadoop
Connected to: Spark SQL (version 1.4.1)
Driver: Spark Project Core (version 1.4.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.4.1 by Apache Hive
0: jdbc:hive2://localhost:10001/imdb> show tables;
| tableName | isTemporary |
| accidents2 | false |
| aka_title | false |
| cast_info | false |
| char_name | false |
| company_name | false |
| complete_cast | false |
| keyword | false |
| movie_companies | false |
| movie_info | false |
| movie_keyword | false |
| movie_link | false |
| name | false |
| person_info | false |
| states | false |
| title | false |
| title2 | false |
16 rows selected (0.186 seconds)
From this I know that to connect I need to use
localhost as the host (after setting up the SSH tunnel),
10001 as the port,
hadoop as the username, no password, and
imdb as the default database, and no additional parameters.
Another popular tool for testing JDBC connections is SQuirreL SQL.
Looker’s ability to provide some features depends on the database dialect’s ability to support that functionality.
In the current Looker release, Spark 1.5+ supports the following Looker features:
In the current Looker release, Spark 2.0 supports the following Looker features: