We’ve addressed this issue on our end and have our schedules up and running again. here’s some info for anyone that might run into this or if looker wants to send off to their devs for a bug report.
- Some scheduled plans showed an error in their history (java.io.IOException: unhandled errno: Too many open files in system)
- Each of those scheduled plans had not run after that error, which happened a week before discovery.
- The looker log had a warning when trying to run the plan on schedule:
Scheduled plan PLAN_NAME is still in stage 'execute' - skipping
- The matching scheduled plan data in the Looker database showed that for all the failed scheduled plans, there was an empty column “scheduled_job_stage.completed_at”
- Updating the column “scheduled_job_stage.completed_at” for the corrupted plans with a valid date resolved the issue and the schedules picked back up. NOTE: they don’t go backwards and re-run everything that’s happened since it failed, it’s a forward only fix.
Initial cause of the issue:
So what likely happened is that there was an abnormal set of plans running at the same time which caused the looker task to crash due to hitting the linux max open files limit. This resulted in a corrupted state of the scheduled plans. We’ve increased our limits so that specific error is not likely to happen again.
When looker is trying to determine if it can run a scheduled plan, the place where the error output was “…is still in stage ‘execute’”, add some logic to determine if it’s really running; as in our case it was a failed query that had not run in a week. Or possibly have a task that audits the scheduled plans and if any are stuck in this state, fix them or even better if you can find a way to better handle that failure so there’s a cleanup step which makes sure the scheduled plans don’t get left in a broken state.
Fix the aftermath and get schedules running again.
Okay, so here’s the query to repair the corrupted schedules; I added a date conditional to ensure I’m only targeting one’s that failed during the incident and so the one’s that are currently running don’t break; which is the most important portion of this. I used the started_at date since it was available and makes this easier to run if we hit it again.
UPDATE looker.scheduled_job_stage sjs
SET sjs.completed_at = sjs.started_at
WHERE sjs.completed_at IS NULL
AND sjs.started_at < "2019-01-01 00:00:00"
AND sjs.started_at > "2018-12-18 00:00:00"