Split brain issue using version 5?


(Alex Gray) #1

We have 3 Looker AWS Instances behind an ELB, whose health check is simply TCP:9999.

We had to roll the servers today, one at a time.

When all the servers were up and running and the ELB was happily sending requests to all 3 instances, we started noticing odd behaviors.

The issue was that from looker’s point of view there were TWO instances in the cluster, not 3. The “rogue” instance’s looker java process was happily running (and thus it had its 9999 port open) and it was happily taking requests from the ELB.

It looks like we just bumped into the class is “split brain” case, where we had isolated looker instances that were not part of the cluster.

How do we:

  1. programmatically detect this.
  2. prevent this from happening.

This is using version 5 of looker.


(Ryan Dunlavy) #2

Hi @Alex_Gray

Could you tell me a bit about how the node was started by itself?

If it was due to missing startup flags, I think adding some checks on each of the Looker instances to confirm that the flags were set correctly would help to ensure catching that in the future. For example, ps aux | sed -n -e 's/^.*looker.jar start //p' | head -n1 would return all of the Looker flags set on the java process.

Another option would be to use i__looker to check on the number nodes with any queries ran over the past hour. You could check the number of nodes that have been used (and are in the cluster) either by scheduling this report or regularly querying it via the API. For example:

[your looker url]/explore/i__looker/history?fields=history.node_id,history.query_run_count&f[history.created_hour]=3+hours&sorts=history.query_run_count+desc&limit=500&query_timezone=America%2FLos_Angeles&vis=%7B%7D&filter_config=%7B%22history.created_hour%22%3A%5B%7B%22type%22%3A%22past%22%2C%22values%22%3A%5B%7B%22constant%22%3A%223%22%2C%22unit%22%3A%22hr%22%7D%2C%7B%7D%5D%2C%22id%22%3A0%2C%22error%22%3Afalse%7D%5D%7D

As for ways to prevent this, making sure the startup flags are set the same way each time would be the best way to ensure all nodes are within the cluster. Adding a lookerstart.cfg could be one option, and the IP could be set automatically as well:

MY_HOST=$(curl ifconfig.me | awk '{print $1}')
LOOKERARGS="-d looker-db.yml --clustered -H "${MY_HOST}""

Hope this helps!

Ryan