Looker not starting in clustered configuration - migration already applied


(Matt Doller) #1

I have a two-node looker cluster that was running on looker 5.2 yesterday. After attempting to upgrade to 6.0, we’ve run into significant problems when trying to start looker. Here’s the order of events:

  1. Stop node 2

  2. Stop node 1

  3. Upgrade node 1 by downloading newest 6.0 jar, and starting looker

  4. The upgraded looker jar runs a migration on startup - 20171012154251_add_subtotals_to_query.rb - which adds a “subtotals” column to the table “query”

  5. Looker hangs, and fails to start

  6. The upgrade process is abandoned, and looker is started again on node 2

  7. Looker on node 2 doesn’t run properly because of the migration that was applied

  8. Restarting looker (6.0) on node 1 fails, as it tries to apply the migration again:

    2018-11-27 11:57:01.927 +0000 [ERROR|007d0|db:looker] :: Java::JavaSql::SQLSyntaxErrorException: Duplicate column name ‘subtotals’: ALTER TABLE query ADD COLUMN subtotals TEXT(16777216)

    2018-11-27 11:57:01.935 +0000 [INFO|007d0|db:looker] :: (0.003333s) ROLLBACK

    2018-11-27 11:57:01.937 +0000 [ERROR|007d0|main] :: Error starting Looker: Java::JavaSql::SQLSyntaxErrorException: Duplicate column name ‘subtotals’

Now we’re stuck in a case where looker 5.2 doesn’t run because of the extra column, and looker 6.0 won’t start because it tries to apply the migration again.

Is there a way to start looker through a jar and skip the migrations completely? Or should we manually drop that column and try again from scratch?

Thanks,
Matt

UPDATE: After connecting to the mysql db, i looked through the schema_migrations table and saw that the 20171012154251_add_subtotals_to_query.rb filename had not been added to the list of migrations that had run, when it was apparent that it had, since that column exists on the query table. After adding a row for that file name, looker 6.0 did indeed startup. Now to pour through logs to see why exactly a migration ran but did not update the schema_migrations table…


(david.hughes) #2

Hey Matt,

What is the current status of your instance?

Just to answer your original question.

This error comes from the migration job attempting to add a column to a table in the internal DB, and since it’s already there, the migration job falls over. This means that once a migration fails, all nodes and the internal DB must be restored to backup from the most recent functional version.

So the next steps here are:

  1. Rollback all nodes to most recent functional backup
  2. Rollback the internal DB to most recent functional backup
  3. Boot internal DB
  4. Boot one node for migration
  5. Wait for migration to complete
  6. Boot the rest of the nodes

Regards,
Dave


(Matt Doller) #3

The cluster is running on looker 6.0 now. I hand-inserted the filename of the migration that was causing the failure, and started looker on node 1 without issue, and then I started node 2. I presume that the DDL change occurred, but the schema_migrations table wasn’t updated - perhaps because mysql doesn’t support DDL changes in a transaction?


(molly.lippsett) #4

Hi @mattd,

Glad to hear you’re up and running! I have a theory here - the write to the schema_migrations table doesn’t happen until the migration is completely done. It’s possible that the column was added before the hang happened on node 1, and when the process was abandoned on that node, it prevented the process from finishing, and didn’t get to the stage where it would write to schema_migrations. My hypothesis is that if we rolled back completely and got the migration to finish completely on node 1 before starting up node 2, the writes would happen.

Best,
Molly