Skip to content

Ephemeral runner hangs indefinitely after GitHub outage and does not recover #4076

@arianvp

Description

@arianvp

Describe the bug

We spawn ephemeral runners for each github job we receive.
During recent outage, ephemeral runners got stuck not picking up jobs even though
they were registered with github and printed listening for jobs.

there is no way for us to tell if the runner is just waiting for a job, or it's genuinely stuck, so it's not safe
for us to kill due to a small chance that it picks up a job just before we kill it.

Basically. Once a runner is registered and listening, there is no safe way for us to shut it down without race conditions

To Reproduce

  1. Have a GitHub Outage
  2. Wait for outage to recover
  3. Notice all our runners are stuck

Expected behavior

The runner should eventually exit with a non-zero exit code if it fails to pick up jobs. Would even help for us if there is an --exit-after-idle <hours-idle> flag or something.

Runner Version and Platform

Version of your runner?

OS of the machine running the runner? OSX/Windows/Linux/...

What's not working?

Please include error messages and screenshots.

Job Log Output

If applicable, include the relevant part of the job / step log output here. All sensitive information should already be masked out, but please double-check before pasting here.

Runner and Worker's Diagnostic Logs

If applicable, add relevant diagnostic log information. Logs are located in the runner's _diag folder. The runner logs are prefixed with Runner_ and the worker logs are prefixed with Worker_. Each job run correlates to a worker log. All sensitive information should already be masked out, but please double-check before pasting here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions