Orchestrating Pipeline Determinism via GitLab CI/CD on_success Logic

The fundamental architecture of a Continuous Integration and Continuous Deployment (CI/CD) pipeline relies on the predictable execution of job sequences based on the outcome of preceding tasks. Within the GitLab CI/CD ecosystem, the when keyword serves as the primary mechanism for defining the conditional execution of jobs. Among the various strategies available to DevOps engineers, on_success represents the default operational state, ensuring that downstream processes only initiate if the prerequisite stages have concluded without error. However, achieving true pipeline determinism requires a nuanced understanding of how on_success interacts with other execution policies such as on_failure, always, and manual. Misconfigurations in these logical gates can lead to non-deterministic pipeline behavior, where jobs are unexpectedly skipped or executed in an order that defies the engineer's mental model of the workflow. This article examines the technical intricacies of the on_success policy, its relationship with job rules, the implications of stage-based sequencing, and the complex edge cases encountered when attempting to combine conditional logic with manual intervention.

The Mechanics of the on_success Execution Policy

In the context of GitLab CI/CD, the when: on_success directive is the implicit standard. When a job is defined without a specific when clause, the GitLab runner assumes that the job should only proceed if all jobs in the preceding stages have completed successfully.

The execution logic follows a strict hierarchical dependency:

Direct Fact: on_success executes a job only when all jobs from prior stages succeed.
Impact Layer: This prevents a "cascade of errors" where a deployment job might attempt to push a broken build to a production environment because a previous testing stage failed. By enforcing this gate, the system preserves the integrity of the deployment target.
Contextual Layer: This dependency is deeply intertwined with the concept of "stages." Since jobs within a single stage run in parallel, the on_success condition for a job in Stage B is only met once every single job in Stage A has returned an exit code of 0.

The following table delineates the primary when policies available for job configuration:

Policy	Functional Requirement	Execution Trigger
on_success	All prior jobs must exit with code 0	Standard sequential progression
on_failure	At least one job in a prior stage must fail	Error handling and cleanup tasks
always	Regardless of prior job outcomes	Logging, telemetry, or mandatory cleanup
manual	Requires human interaction via the UI	Controlled deployments or gatekeeping

Navigating the Complexity of Mixed Execution Policies

A significant challenge arises when engineers attempt to blend on_success with on_failure to create complex recovery workflows. A common scenario involves using an on_failure job to perform a recovery action—such as creating a branch on a remote server like GitHub—and then expecting a subsequent on_success job to continue the pipeline.

One documented issue involves the perceived breakdown of the on_success logic when an on_failure job is triggered. In a specific workflow involving jobs A, B, C, and D:

If Job A fails, Job B (set to on_failure) executes to attempt a recovery.
If Job B succeeds in its recovery attempt, the pipeline logic regarding Job C and Job D becomes ambiguous.
In some configurations, the pipeline might execute A, skip B (if B was supposed to be on_success), execute C, and skip D.
In other cases, if B is on_failure and succeeds, the pipeline might execute A, execute B, execute C, and then execute D.

The confusion often stems from the definition of "success" in a pipeline. If a job is explicitly designed to run on_failure, its successful execution is still technically a response to a failure in a prior stage. This creates a logical paradox for downstream on_success jobs: does the "success" of a recovery job count as a successful stage, or does the original failure still invalidate the on_success requirement for subsequent jobs?

To resolve such issues, experts suggest that instead of relying on the when: on_failure trigger to bridge the gap between failure and success, engineers should implement logic within a single job to check for preconditions. For instance, a job can attempt to clone a branch and, using shell logic (e.g., if statements), decide whether to create a new branch or proceed with the existing one. This consolidates the logic into a single exit code, making the downstream on_success behavior predictable.

Rules and Workflow Logic in Pipeline Configuration

The transition from the legacy only/except syntax to the modern rules syntax has introduced new complexities in how on_success is applied. The rules keyword provides a much more granular way to determine job inclusion, but it requires strict adherence to specific patterns to avoid "double pipelines" and conflicting logic.

The implementation of rules allows for conditional execution based on predefined variables such as CI_PIPELINE_SOURCE.

Preventing Duplicate Pipelines

A common error in GitLab CI configuration is defining rules for a job without utilizing workflow: rules. This often results in the creation of duplicate pipelines—one for a branch push and one for a merge request event.

To prevent this, the following structure is recommended:

```yaml
workflow:
rules:
- if: $CIPIPELINESOURCE == "mergerequestevent"
- if: $CICOMMITBRANCH && $CIOPENMERGEREQUESTS
when: never
- if: $CICOMMIT_BRANCH

job:
script: echo "Executing with optimized rules"
rules:
- if: $CIPIPELINESOURCE == "mergerequestevent"
when: onsuccess
- if: $CIPIPELINESOURCE == "schedule"
when: never
- when: onsuccess
```

Failure to implement this correctly can lead to a scenario where a single commit triggers two separate pipelines, consuming double the runner resources and causing confusion in the pipeline UI.

The Danger of Mixing Syntax

Another critical rule for maintainability is the prohibition of mixing only/except and rules within the same pipeline configuration. While the YAML parser might not throw a syntax error, the default behaviors of these two systems are fundamentally different.

only/except focuses on the existence of certain conditions (like a branch name).
rules focuses on the evaluation of logical expressions and the assignment of when clauses.

Mixing them can lead to "silent" failures where a job is skipped because the two logic engines are operating under different assumptions about the pipeline state, making troubleshooting exceptionally difficult.

Interaction Between Manual Actions and Automated Success

The when: manual policy is often used as a final gate for production deployments. However, a significant limitation in the current GitLab UI is the inability to dynamically change a manual job to an automated job once the pipeline has been instantiated.

The Manual-to-Success Constraint

By design, if a job is set to when: manual, it is often treated as a fallback to on_success. This means a manual action is typically only available if the previous stage has already succeeded. This creates a rigid workflow where a human must intervene to move the pipeline forward.

A common requirement among DevOps professionals is the ability to "toggle" a deployment from manual to automatic for specific runs. For example, an engineer might want to run a standard pipeline where deployment is manual, but for a specific hotfix, they want to enable auto-deployment if all tests pass.

Currently, there is no UI button or checkbox in the GitLab pipeline interface to switch a manual job to on_success for a single execution.

Implementing the Variable-Based Workaround

To achieve this level of flexibility, engineers must utilize a combination of custom variables and rules. The most effective pattern involves defining a variable (e.g., AUTO_DEPLOY) and using it to override the when clause.

yaml deploy_job: stage: deploy script: - ./deploy.sh rules: - if: '$AUTO_DEPLOY == "true"' when: on_success - when: manual

In this configuration, the pipeline behavior is determined at the moment of creation:

If the user triggers the pipeline via the "Run pipeline" screen and sets AUTO_DEPLOY to true, the job becomes an on_success job.
If the variable is not set or is set to any other value, the job defaults to manual mode.

This workaround requires the engineer to actively manage variables during the pipeline trigger phase, as the pipeline UI does not allow for post-creation modification of these rules.

Advanced Variable Contexts for Rule Evaluation

To effectively utilize on_success within complex rules structures, one must master the CI_PIPELINE_SOURCE predefined variables. These variables allow the pipeline to distinguish between different triggers, ensuring that on_success jobs only run in appropriate contexts.

The following table provides a technical breakdown of key variables used to control pipeline logic:

Variable	Value	Contextual Application
`CI_PIPELINE_SOURCE`	`api`	Triggered via GitLab's API; useful for external automation.
`CI_PIPELINE_SOURCE`	`chat`	Triggered via ChatOps; allows for conversational CI/CD.
`CI_PIPELINE_SOURCE`	`external`	Triggered by non-GitLab CI services.
`CI_PIPELINE_SOURCE`	`merge_request_event`	Specifically for MR pipelines; critical for preventing double pipelines.
`CI_PIPELINE_SOURCE`	`push`	Standard branch push events.
`CI_PIPELINE_SOURCE`	`schedule`	Triggered by scheduled cron jobs; allows for periodic testing.

Understanding these sources is vital when defining rules. For example, an engineer might want an on_success job to run during a push event but remain manual or never during a merge_request_event to save resources.

Architectural Considerations for Shell Runners

A common misconception regarding on_success and on_failure involves the lifecycle of the runner itself, particularly when using shell executors. There is often confusion regarding whether an on_failure job can effectively "clean up" after a failed job.

When a job fails on a shell runner, the environment is not automatically wiped. If an on_failure job is used for cleanup (e.g., removing large build artifacts or clearing disk space), it is essential to understand the following:

Sequence and Parallelism: Jobs within the same stage run simultaneously. Therefore, an on_failure job in the same stage as the failed job may not behave as a sequential cleanup task.
Artifact Persistence: Artifacts are typically zipped and uploaded to the GitLab server only upon a successful job exit. If a job fails, its local artifacts may not be available to subsequent on_failure jobs unless specifically handled.
Runner Isolation: There is no guarantee that an on_failure job will be scheduled on the same runner where the failure occurred, especially in a distributed runner environment. If the cleanup task requires access to local files created during the failed job, the on_failure job may fail to find them.

For robust cleanup, it is often better to use the always policy or incorporate cleanup logic directly into the script block of the primary job using shell trap commands or try/catch logic, rather than relying on a separate on_failure job.

Technical Analysis of Pipeline Determinism

The implementation of on_success is more than a simple "if successful" check; it is the anchor of pipeline stability. However, as demonstrated through the various failure modes and configuration complexities, it is highly sensitive to the surrounding logic. The transition from simple job sequences to complex, rule-based pipelines requires a shift in perspective: from viewing jobs as isolated tasks to viewing them as parts of a state machine.

The primary takeaway for senior DevOps engineers is that the when keyword is a declaration of intent, but the rules keyword is the implementation of logic. When these two are misaligned—such as when an on_failure job is used to fix a state that a subsequent on_success job expects to be pristine—the pipeline becomes non-deterministic. To achieve absolute reliability, engineers should favor explicit logic (via shell scripts and variables) over implicit pipeline transitions (via on_failure and on_success triggers) whenever the workflow requires complex conditional branching.