CI: Add better/longer retries to AIO workflow TF #1939

Alex-Welsh · 2025-10-14T10:45:10Z

Previously, the terraform apply would just run 5 times. If it failed, it would wait 60 seconds.
All AIOs are triggered at the same time, so might conflict.

Now, it runs 6 times. After 3 attempts, wait for 2 hours (the cloud is probably at capacity, waits for other jobs to finish)
Normal waits are randomised from 1 to 3 minutes

gemini-code-assist · 2025-10-14T10:45:15Z

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Alex-Welsh · 2025-10-14T13:24:59Z

The one time I want to test a CI failure, every AIO works perfectly first time 🙃

.github/workflows/stackhpc-all-in-one.yml

priteau

The jump from 60-180 seconds to 7200 is huge (an exponential backoff is more conventional), but let's see how it works. We can keep tuning it.

Alex-Welsh requested a review from a team as a code owner October 14, 2025 10:45

Alex-Welsh temporarily deployed to SMS Lab October 14, 2025 10:51 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 14, 2025 10:51 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 14, 2025 10:51 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 14, 2025 10:51 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 14, 2025 10:52 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 14, 2025 10:52 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 14, 2025 10:52 — with GitHub Actions Inactive

Alex-Welsh force-pushed the better-aio-retries branch from b2ecf4f to c5fc4fa Compare October 21, 2025 12:35

Alex-Welsh temporarily deployed to Leafcloud October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 21, 2025 12:42 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 21, 2025 12:43 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 21, 2025 12:43 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 21, 2025 12:43 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 21, 2025 12:43 — with GitHub Actions Inactive

priteau requested changes Oct 22, 2025

View reviewed changes

.github/workflows/stackhpc-all-in-one.yml Outdated Show resolved Hide resolved

Alex-Welsh force-pushed the better-aio-retries branch from c5fc4fa to 6ba3f40 Compare October 22, 2025 13:55

ci: Add better/longer retries to AIO workflow TF

62e53e0

Alex-Welsh force-pushed the better-aio-retries branch from 6ba3f40 to 62e53e0 Compare October 22, 2025 13:56

Alex-Welsh temporarily deployed to Leafcloud October 22, 2025 14:37 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 22, 2025 14:37 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 22, 2025 14:37 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 22, 2025 14:37 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 22, 2025 14:38 — with GitHub Actions Inactive

Alex-Welsh had a problem deploying to SMS Lab October 22, 2025 14:38 — with GitHub Actions Failure

Alex-Welsh had a problem deploying to Leafcloud October 22, 2025 14:38 — with GitHub Actions Failure

Alex-Welsh temporarily deployed to SMS Lab October 22, 2025 14:38 — with GitHub Actions Inactive

Alex-Welsh had a problem deploying to Leafcloud October 22, 2025 14:38 — with GitHub Actions Failure

Alex-Welsh temporarily deployed to Leafcloud October 22, 2025 14:38 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to SMS Lab October 22, 2025 16:35 — with GitHub Actions Inactive

Alex-Welsh temporarily deployed to Leafcloud October 22, 2025 16:35 — with GitHub Actions Inactive

priteau approved these changes Oct 23, 2025

View reviewed changes

Alex-Welsh merged commit b57e27d into stackhpc/2025.1 Oct 23, 2025
52 of 56 checks passed

Alex-Welsh deleted the better-aio-retries branch October 23, 2025 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: Add better/longer retries to AIO workflow TF #1939

CI: Add better/longer retries to AIO workflow TF #1939

Uh oh!

Alex-Welsh commented Oct 14, 2025

Uh oh!

gemini-code-assist bot commented Oct 14, 2025

Uh oh!

Alex-Welsh commented Oct 14, 2025

Uh oh!

Uh oh!

priteau left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CI: Add better/longer retries to AIO workflow TF #1939

CI: Add better/longer retries to AIO workflow TF #1939

Uh oh!

Conversation

Alex-Welsh commented Oct 14, 2025

Uh oh!

gemini-code-assist bot commented Oct 14, 2025

Uh oh!

Alex-Welsh commented Oct 14, 2025

Uh oh!

Uh oh!

priteau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants