Scheduled plan stuck

(Todd) #1

A bunch, not all, of our scheduled plans have stopped running. Logs indicate it’s stuck with this error:

“Scheduled plan “PLAN_NAME”[589] is still in stage ‘execute’ - skipping”

is there a way to address this to remove whatever lock was placed on them? this started happening a week ago; we just noticed it today.

Thank you,

  • Todd

(Todd) #2

When the previous job failed, it looks like looker did not update the scheduled_job_stage record to set the completed_time field. I’m testing this out by populating that record with a completed_time and seeing if the next scheduled plan runs. if so, that was the issue and then I need to write a query to do the same for the other jobs that are stuck in the same state. Perhaps this was fixed in a newer version of Looker (We’re stuck on Looker 5.0.30 at the moment)

(Todd) #3

That was indeed the problem; now off to write a query to remediate it.

(molly.lippsett) #4

Hi @tfoley

Nice find on the completed_time, that indeed fixes the stuck schedule - we do see this behavior less frequently in more recent versions of Looker, but it is still conceivable that a query could hang and get stuck. In this case, you’ve found one solution, and another might be to find the associated queries and kill them (from the queries panel). That theoretically should unstick everything. I will also let the engineering team know that you are experiencing this on version 5.0.30.


(Todd) #5

Thanks for the reply. The query was not running, it was at least a week since the error happened; being the holidays it went under the radar for us. There was nothing in looker that would allow me to address the stuck schedule outside of deleting it and creating a new one. (which is fine but doesn’t scale well).

(Todd) #6

Hi Everyone,

We’ve addressed this issue on our end and have our schedules up and running again. here’s some info for anyone that might run into this or if looker wants to send off to their devs for a bug report.

Issue Details:

  • Some scheduled plans showed an error in their history ( unhandled errno: Too many open files in system)
  • Each of those scheduled plans had not run after that error, which happened a week before discovery.
  • The looker log had a warning when trying to run the plan on schedule: Scheduled plan PLAN_NAME[123] is still in stage 'execute' - skipping
  • The matching scheduled plan data in the Looker database showed that for all the failed scheduled plans, there was an empty column “scheduled_job_stage.completed_at”
  • Updating the column “scheduled_job_stage.completed_at” for the corrupted plans with a valid date resolved the issue and the schedules picked back up. NOTE: they don’t go backwards and re-run everything that’s happened since it failed, it’s a forward only fix.

Initial cause of the issue:
So what likely happened is that there was an abnormal set of plans running at the same time which caused the looker task to crash due to hitting the linux max open files limit. This resulted in a corrupted state of the scheduled plans. We’ve increased our limits so that specific error is not likely to happen again.

Looker Fix
When looker is trying to determine if it can run a scheduled plan, the place where the error output was “…is still in stage ‘execute’”, add some logic to determine if it’s really running; as in our case it was a failed query that had not run in a week. Or possibly have a task that audits the scheduled plans and if any are stuck in this state, fix them or even better if you can find a way to better handle that failure so there’s a cleanup step which makes sure the scheduled plans don’t get left in a broken state.

Fix the aftermath and get schedules running again.
Okay, so here’s the query to repair the corrupted schedules; I added a date conditional to ensure I’m only targeting one’s that failed during the incident and so the one’s that are currently running don’t break; which is the most important portion of this. I used the started_at date since it was available and makes this easier to run if we hit it again.

UPDATE looker.scheduled_job_stage sjs
SET sjs.completed_at = sjs.started_at
WHERE sjs.completed_at IS NULL
AND sjs.started_at < "2019-01-01 00:00:00"
AND sjs.started_at > "2018-12-18 00:00:00"

(Izzy) #7

Awesome sleuth work Todd! You’re totally right that that’s the way to fix this rare issue, and thanks a bunch for the detailed report-- it’s getting passed on to the engineers.

Just hopping in to add a disclaimer that in general we NEVER advise customers to directly alter their internal database without checking with us first-- It just can’t hurt to reach out to Looker support and confirm the situation + the fix.

If anyone else runs into this, that UPDATE command is a viable solution, but it’s specific to that internal database-- It’s always dangerous to tinker with the internal db unassisted, and I strongly encourage you to reach out to support instead.

But @tfoley does get major bravery points and a special one-copy-only “Internal DB Doctor” badge granted :wink: