GitLab CI on_failure Execution Logic

The orchestration of complex CI/CD pipelines requires a granular understanding of how GitLab handles job failures and subsequent triggers. Specifically, the when: on_failure keyword is designed to provide a mechanism for failure recovery, cleanup operations, or conditional branching when a preceding stage or job does not complete successfully. However, the interaction between when: on_failure, the allow_failure flag, and the rules syntax creates a non-intuitive environment where the expected behavior of the pipeline often diverges from the actual execution. Understanding these dynamics is critical for engineers who need to implement robust error handling without inadvertently blocking the merging of code or skipping vital cleanup processes.

The Mechanics of when: on_failure

The when: on_failure directive instructs the GitLab Runner to execute a specific job only if at least one job in a previous stage has failed. This is fundamentally a reactive trigger. In a standard pipeline flow, if stage A contains a job that fails, any job in stage B configured with when: on_failure will be triggered.

This mechanism is frequently utilized in scenarios where a developer needs to perform a specific action following a crash or a failed test suite. For instance, a user attempting to clone a repository to a remote server may find that the branch already exists or the process exits with an error. In such a case, using when: on_failure allows the pipeline to transition to a recovery job—such as creating a branch based on the default branch—only when the initial attempt fails. This prevents the duplication of logic within a single script and leverages the pipeline's native state management to handle branching logic.

The Interaction Between allowfailure and onfailure

A critical point of failure in pipeline configuration is the interaction between allow_failure and when: on_failure. The allow_failure keyword determines whether a job's failure should result in the entire pipeline being marked as failed.

When allow_failure is set to true, GitLab treats the job failure as an acceptable outcome. Consequently, the pipeline status remains "running" or "success" from the perspective of the pipeline's overall state. Because when: on_failure relies on the pipeline's failure state to trigger, setting allow_failure: true for a job in stage A will prevent a job in stage B configured with when: on_failure from ever executing. The system essentially views the failure as "allowed," and therefore not a "failure" in the context of triggering recovery jobs.

This creates a paradox for users who want manual jobs that do not block merging (which typically requires allow_failure: true) but still want to trigger cleanup jobs if those manual jobs are executed and fail. If the manual job is allowed to fail, the subsequent cleanup job tied to on_failure is ignored. To resolve this, users must set allow_failure: false to ensure the failure state is propagated, though this may result in the pipeline being marked as failed and potentially blocking the merge request.

Configuration Combination	Effect on Subsequent on_failure Jobs	Pipeline Status Impact
allow_failure: false + job fails	Triggered	Pipeline marked as Failed
allow_failure: true + job fails	Not Triggered	Pipeline marked as Success/Warning
allow_failure: false + job succeeds	Not Triggered	Pipeline marked as Success

Conditional Logic and the Rules Syntax

The implementation of when keywords differs significantly when used as a standalone keyword versus when used within a rules block. This inconsistency is a primary source of confusion for DevOps engineers.

When when: on_failure is used on its own, it functions as a global trigger for the job based on the status of previous stages. However, when integrated into a rules block, the behavior changes. A common failure pattern occurs when users attempt to use rules to assign different variables based on the success or failure of the pipeline.

For example, a user may define a cleanup job with the following logic:

yaml cleanup_rules_job: stage: cleanup rules: - when: on_success variables: PIPELINE_STATUS: succeeded - when: on_failure variables: PIPELINE_STATUS: failed script: - echo "cleaning up" - echo "The pipeline has ${PIPELINE_STATUS}!"

In this scenario, the expected behavior is that the job runs regardless of the outcome, but with a variable indicating the status. However, current behavior in some GitLab versions shows that the cleanup_job may fail to run entirely when using rules with on_failure in this specific configuration. This forces users to adopt an "ugly workaround" by splitting the cleanup into two separate jobs: one for success and one for failure.

```yaml
cleanupjobonsuccess:
stage: cleanup
when: onsuccess
variables:
PIPELINESTATUS: succeeded
script:
- echo "cleaning up"
- echo "The pipeline has ${PIPELINESTATUS}!"

cleanupjobonfailure:
extends: cleanupjobonsuccess
when: onfailure
variables:
PIPELINESTATUS: failed
```

This duplication is inefficient but ensures that the cleanup logic is executed regardless of the pipeline status, allowing the script to access the state of the pipeline via the defined variables.

Dependency Mapping with Needs and on_failure

The needs keyword allows for a directed acyclic graph (DAG) approach to pipeline execution, bypassing traditional stage ordering. When combined with when: on_failure, it allows for highly specific recovery paths.

A complex use case involves a pipeline where a job in stage three should only run if a specific manual job in stage two fails. If the manual job is configured with allow_failure: true, it may not trigger the on_failure job. However, if the manual job is explicitly marked as allow_failure: false, the pipeline will recognize the failure and trigger the subsequent job.

Consider a configuration where job2.1 is a manual job:

```yaml
job2.1:
stage: two
script:
- echo "job used to ignore stage two failures, allowfailure is true by default but listing here for clarity"
when: manual
allowfailure: true

job3.0:
stage: three
script:
- echo "runs only if a job from stage 2 fails and pauses execution of the pipeline as it is marked as manual non-optional"
when: onfailure
needs:
- job: job2.1
allowfailure: false
```

In this architecture, the use of needs creates a direct dependency. If job2.1 fails and allow_failure is false, job3.0 will start. However, a significant side effect occurs: once job3.0 starts due to the failure of a preceding job, all subsequent stages and jobs may still be marked as skipped if the overall pipeline state remains "failed." This means that while the failure-handling job executes, the "happy path" of the pipeline is severed, and the pipeline will not automatically recover to run later successful stages unless specifically configured to do so.

Practical Application: Failure-Specific Branching

The use of on_failure is often a solution to issues where shell scripts within a job cannot accurately detect failure states due to how GitLab runners handle exit codes. Some users have reported that if statements within a script block are entered even when a process exits with 0 or when a folder is correctly created.

By moving the fallback logic into a separate job with when: on_failure, the developer shifts the responsibility of failure detection from the shell script to the GitLab CI coordinator. This ensures that the recovery action—such as creating a branch on GitHub from a GitLab runner—only occurs when the primary job truly fails. This architectural shift reduces the complexity of the script and increases the reliability of the pipeline's conditional execution.

Analysis of Failure Patterns

The implementation of on_failure reveals a fundamental tension in GitLab CI between "Pipeline Status" and "Job Execution."

The primary issue is that on_failure is not a trigger for "any error," but a trigger for a "Pipeline Failure State." When a user employs allow_failure: true, they are explicitly telling the coordinator that the error should not change the pipeline state. Consequently, any job waiting for an on_failure trigger is waiting for a state change that will never happen.

Furthermore, the inconsistency between when as a top-level keyword and when within rules suggests a divergence in how GitLab processes job eligibility. In a top-level when configuration, the runner evaluates the state of the entire pipeline. In a rules configuration, the evaluation is more focused on the specific conditions defined in the rule list. When a user tries to use rules to simulate an always behavior (by listing both on_success and on_failure), the system may fail to execute the job because it treats the rules as a set of mutually exclusive conditions for the job's existence, rather than a set of triggers for its execution.

To achieve a truly robust failure handling system, the following architectural principles should be applied:

Use allow_failure: false for any job whose failure must trigger a subsequent recovery or cleanup job.
Avoid relying on rules for status-based variables if the goal is to ensure a job runs in all scenarios; instead, use when: always or separate jobs for success and failure.
Utilize needs to create direct dependencies for failure handling, but be aware that this does not automatically clear the "failed" status of the pipeline for subsequent stages.
Shift complex conditional shell logic (e.g., checking for directory existence) into separate on_failure jobs to utilize the CI coordinator's native exit code tracking.