PATTERN Cited by 1 source

Eject failing PR, keep queue running¶

Pattern¶

When a merge-queue's future-state pipeline fails for a queued PR, remove that single PR from the queue (leave it open with the failure results attached to the PR) and let the queue re-evaluate + continue — other PRs behind it in the queue proceed. Never stall the whole queue on one failing PR.

Canonical articulation — Atlassian Bitbucket Merge Queues, 2026-04-29:

"If a merge-queue build fails, that PR is removed and left open. The queue re-evaluates; others can proceed." (Source: sources/2026-04-29-atlassian-inside-atlassians-merge-queues)

Why it matters¶

Without this discipline, a single broken PR stalls every queued PR behind it. The merge queue loses its throughput advantage, merge latency spikes, and the queue ends up in the same operational state as pre-queue serial-merge. The eject-on-failure discipline is what lets a merge queue run at realistic throughput (Atlassian Jira: 300+ merges/day, build-concurrency = 14).

(1) Eject the PR, not the queue¶

Failed PR is removed from the queue and left open in the normal PR state.
The queue re-evaluates: the next PR in line is now the top of the queue, and its future-state branch is re-materialised without the ejected PR.
In-flight pipelines for PRs that were downstream of the failing PR may have to re-run against the new future state — this is a throughput cost but is always preferable to stalling.

(2) Keep the failure results on the PR itself¶

The pipeline output is attached to the PR, not to a shared dashboard.
PR authorship is preserved + visible: the author sees their PR failed for this reason and knows what to fix.

This second design choice is a social-coordination improvement, not just a technical one. A Jira tech lead on the outcome:

"The best part is what I don't see anymore: long Slack threads trying to figure out which PR broke master." (Source: sources/2026-04-29-atlassian-inside-atlassians-merge-queues)

The debugging thread that used to happen after a semantic merge conflict — "whose PR broke it?" — is pre-empted because the failing PR never landed.

Trade-offs¶

Upsides:

Queue throughput stays stable under failure. A single failing PR doesn't block the rest.
Authorship is preserved. The PR author sees their own failure; no shared rollback / revert thread.
Revert machinery is unused. main was never touched; no post-merge revert is needed.
Social coordination cost drops. No "which PR broke main?" Slack threads.

Downsides:

Downstream PRs may re-run. PRs that were queued behind the ejected PR have their future-state branches re-materialised and re-validated against the new head of queue. This is throughput cost; at high queue depth it can produce a cascade of re-runs.
Ejected-then-fixed PRs must re-enter the queue. The author fixes the issue, pushes a new revision, and re-adds to the merge queue. Queue-latency + queue-round-trip on fix becomes part of the author's merge-latency envelope.
Observability requirement is real. Queue operators need dashboards on eject rate, queue depth, and per-PR merge latency — otherwise a silent failure-cascade can go undetected.

Admin override: reorder + drain + deactivate¶

For hot-fix + emergency flow, admin controls let operators side-step eject semantics entirely:

Reorder — move a hot-fix PR to the top of the queue.
Drain — stop accepting new PRs into the queue while in-flight PRs complete.
Deactivate — pause merges entirely (e.g., during an incident where the main pipeline is broken independently of the queue).

These are escape hatches, not the default path. The default path is eject-and-continue; the escape hatches cover cases where the queue's own assumptions aren't valid. (Source: sources/2026-04-29-atlassian-inside-atlassians-merge-queues)

Composes with¶

patterns/validate-against-future-state-of-main — this pattern is the failure-recovery half of the validate-against- future-state-of-main discipline. Inseparable in practice.
patterns/parent-child-pipelines-for-ci-parallelism — the merge-queue pipeline can itself run multiple parallel parent-child pipelines; eject semantics apply at the merge-queue-pipeline level regardless of internal parallelism.

Generalises to¶

The same "fail-one-unit, continue-the-pipeline" shape shows up in unrelated systems:

Batch job runners: a failing task in a batch doesn't halt the batch; it's marked failed and retried/reported.
Deploy pipelines: a failing canary on one shard doesn't halt the other shards' deploys.
Data-pipeline stream processors: a single bad record is dead-lettered, not stop-the-world.

The shared insight: a queue / batch / pipeline's throughput is protected by isolating failures to their smallest unit, not by halting the whole stream.

Seen in¶

sources/2026-04-29-atlassian-inside-atlassians-merge-queues — Atlassian Bitbucket Merge Queues eject-on-failure; PR stays open with results; queue re-evaluates and other PRs proceed.