Testing and troubleshooting socket based query killing


(Ian Ross) #1

Background

When a user runs a query in Looker, a connection is opened from the user’s browser to the Looker server, and from the Looker server to the database to execute that query. If a user closes that browser tab, there is a mechanism that detects this and then in turn initiates a cancel or kill on that query on the database.

The way this works is when a query is run from a dashboard, look, or explore page, there is a connection opened to the server that stays open for the duration of the query or queries depending on it is coming from a single query page, such as a look, or a page that runs multiple queries such as a dashboard.

Problems can occur if either:

  1. Some network device such as a reverse proxy or firewall is timing out and closing that connection prematurely causing queries to be undesirably cancelled before completion
  2. Some network device is not properly forwarding the closing connection to the looker server causing a query a user stopped or abandoned from the front end to continue running on the database

More commonly issues arise from (1)

Testing

In order to test to make sure you are not going to run into issues with network devices prematurely cancelling queries. A simple test you could set up would be to have a query that runs for a known amount of time, run that from the frontend and ensure it runs to completion. Here is an example of a way you could set that up:

explore: query_kill_test{
#make sure the query never comes back from cache
persist_for: 0 seconds
}
view: query_kill_test {
  derived_table: (select sleep(1800) as "sleep") ;;
#the number of seconds you let this sleep will be the duration of the query 
  dimension: sleep {
    type: number
    sql: ${TABLE}.sleep ;;
  }

The nice thing about using a function like sleep to test this, is that it will run for a known amount of time, every time.

You can then run the above explore from the explore page and ensure it runs to completion. If it does not, and it gets prematurely cancelled, it is likely due to a network device closing the connection.

Troubleshooting queries being prematurely killed

The first thing you should do is before digging in on the network side, make sure there is no global timeout set for queries on the database that is setting a max runtime for any query. That would be the most obvious culprit, and likely be solved by adjusting that setting.

Some browsers also have client side connection timeouts, for example this article discusses how to change that timeout in Firefox. We have also seen instances where a timeout is enforced by local aniti-virus software as well.

If there is no such setting on the database or on the client machine, the next step is to figure out what device is prematurely closing the connection, it is a good idea to first get an idea of the network landscape between the browser and the Looker instance to get an idea of where to dig in. Devices that will commonly cause this would be load balancers or firewalls into the network hosting the Looker server, or possibly a reverse proxy that is in front of the Looker server.

Looker hosted instances for example have a timeout of 3600 seconds set on the load balancer into the network and thus no query run from the browser can run for longer than that.

An example of such a timeout would be our reference to a 3600 second read timeout, in our nginx reverse proxy configuration example.

Once you have dug in on the looker server network side of things, if there is still no apparent timeout, the next thing to do would be to move onto the client side network. Does the behavior change if you run the query to looker from varying networks? If yes, you should look into any relevant timeouts on the outbound network devices from that network such as a firewall.

Troubleshooting queries not being killed

The first thing to check if queries are not being killed is to verify that the the “can cancel queries” test is passing for that particular connection the query is running against in the admin connection panel. If it is not, verify the permissions on the database side of things for the looker database user.

If that test is passing, the next thing to do is determine the specific scenario where queries are not being cancelled:

  1. Is it that when you leave the explore page when a query is running that does not go and properly cancel the query?
  2. Does hitting the stop button from the explore page not properly cancel the query?
  3. Does hitting stop from the admin query panel not kill the query ?

Usually if the “can cancel query” test in the admin panel is passing, 3 above should also work. Problems can arise with 1 and 2 as they rely on the connection from the browser to the looker server to close in order for looker to determine it should kill that query.

We have identified specifically an issue with Apache reverse proxies as we mention here where connections do not get properly forwarded to looker causing these sockets to remain open and query killing to not work.

The other thing that can cause these connections to not get closed properly is network device configuration where the network device is configured to not forward connections to looker. For example, if running in AWS, your ELB should be configured as a layer 4 device in order to not terminate the connection on the ELB which can cause issues with sockets getting closed on the Looker server.

Troubleshooting this is generally fairly specific to the network and the devices involved, for further troubleshooting here, reach out to looker support at support@looker.com .


[Retired] Query Timeouts and Queueing
(Andrew Kozhokaru) #2

ELB default idle timeout setting is 60s. You can change it up to 4000s if you have a need to run queries this long.

More to read here.