In the realm of continuous integration and continuous deployment (CI/CD), the default paradigm assumes success. Workflows are typically designed so that every step executes flawlessly, and if a single step fails, the entire job and subsequently the workflow terminate. However, modern software engineering practices frequently demand more nuanced error handling. Engineers often need to execute cleanup tasks, roll back failed deployments, or create notification tickets when a specific step encounters an error. While platforms like Azure DevOps pipelines offer built-in control options to easily implement "run on failure" logic, GitHub Actions requires a different approach to achieve these same goals. Understanding how to manipulate workflow execution flow, handle step-level failures without stopping the pipeline, and effectively re-run failed components is essential for building resilient and maintainable automation infrastructure.
Implementing Failure-Dependent Steps with Conditional Logic
The primary mechanism for executing a step only when a previous step has failed in GitHub Actions is the use of conditional expressions within the step configuration. Unlike Azure DevOps, which provides explicit control options in its YAML pipeline definitions for running tasks on failure, GitHub Actions relies on context functions. To trigger a step specifically when a preceding step fails, developers utilize the if: ${{ failure() }} condition. This expression evaluates to true only if the previous step in the same job has failed.
This capability is particularly useful for automation tasks that must react to errors. For instance, a workflow might need to create a GitHub issue automatically to alert the team when a deployment fails. In such a scenario, the workflow is structured to include a step designed to fail, followed by a step configured to run only upon that failure.
The following example demonstrates a workflow where a step intentionally fails by exiting with code 1. The subsequent step is conditioned to run only if the previous step failed, using the GitHub API to create an issue in the repository:
yaml
on: [push]
jobs:
FailJobIssueDemo:
runs-on: ubuntu-latest
steps:
- name: Step is going to pass
run: echo Passing step
- name: Step is going to fail
run: exit 1
- name: Step To run on failure
if: ${{ failure() }}
run: |
curl --request POST \
--url https://api.github.com/repos/${{ github.repository }}/issues \
--header 'authorization: Bearer ${{ secrets.GITHUB_TOKEN }}' \
--header 'content-type: application/json' \
--data '{
"title": "Issue created due to workflow fialure: ${{ github.run_id }}",
"body": "This issue was automatically created by the GitHub Action workflow **${{ github.workflow }}** due to failure in run: _${{ github.run_id }}_."
}'
In this configuration, the Step To run on failure will not execute if the Step is going to fail succeeds. It is strictly bound to the failure state of the immediate predecessor. This allows for precise control over error-handling logic, enabling actions such as cleanup, rollback, or notification generation without interrupting the rest of the workflow's potential recovery paths, provided the workflow itself is configured to continue.
Managing Workflow Continuation with Continue-On-Error
While conditional steps like if: ${{ failure() }} are powerful, they rely on the workflow continuing to execute after a failure. By default, if a step fails, the job fails, and the workflow stops. To enable a workflow to keep processing subsequent steps—even after a failure—the continue-on-error feature must be utilized. This feature tells GitHub Actions to proceed with the remaining steps in the job or workflow regardless of the failure status of a specific step or job.
The continue-on-error property can be applied at either the step level or the job level, though the behavioral implications differ slightly depending on the scope. Applying it at the step level allows a specific task to fail without halting the rest of the job. Applying it at the job level allows a job to fail without halting dependent jobs in the workflow.
This feature serves two primary purposes in advanced CI/CD strategies:
- Resilience and Rollback: Workflows can be designed to identify a deployment failure based on a failing step and then execute a rollback process in a subsequent step. Without
continue-on-error, the rollback step would never run because the deployment failure would terminate the job. - Negative Testing: In testing scenarios, engineers may intentionally expect a step to fail to verify that the system correctly handles error conditions. By setting
continue-on-error: trueon the test step, the workflow can proceed to assert that the expected failure occurred, rather than marking the entire test suite as a failure.
Using continue-on-error effectively decouples the failure of a specific operation from the termination of the workflow, allowing for complex error-handling logic such as cleanup routines or compensating transactions to execute reliably.
Re-running Failed Workflows and Jobs via GitHub CLI
When a workflow fails due to transient issues or external dependencies, it is often necessary to re-run the failed components rather than the entire pipeline. GitHub provides robust mechanisms for re-running workflows and specific jobs through the GitHub Command Line Interface (CLI). The gh run rerun subcommand is the primary tool for this operation.
To re-run a failed workflow run, the gh run rerun subcommand is used, replacing RUN_ID with the ID of the failed run that needs to be executed again. If the run-id is not specified, the GitHub CLI returns an interactive menu, allowing the user to select a recent failed run from the history.
bash
gh run rerun RUN_ID
For more targeted re-runs, specific flags can be employed. To re-run only the failed jobs within a specific workflow run, the --failed flag is used. This ensures that successful jobs from the previous run are not re-executed, saving time and resources.
bash
gh run rerun RUN_ID --failed
To re-run a specific job within a workflow run, the --job flag is used, allowing for granular control over which components are re-executed.
bash
gh run rerun RUN_ID --job <job-id>
Additionally, diagnostic information is crucial when troubleshooting re-runs. To enable runner diagnostic logging and step debug logging for the re-run, the --debug flag can be appended to any of the above commands. This provides deeper insight into the execution environment and step behavior during the retry.
bash
gh run rerun RUN_ID --failed --debug
To monitor the progress of the re-run in real-time, the gh run watch subcommand can be used, which allows the user to select the run from an interactive list and view its status as it executes.
bash
gh run watch
Re-running Failed Workflows and Jobs via the GitHub Web Interface
For users who prefer a graphical interface or are not comfortable with command-line tools, GitHub provides a comprehensive web interface for managing workflow runs. The process begins by navigating to the main page of the repository and selecting the Actions tab. From the left sidebar, the specific workflow of interest is selected, followed by clicking on the name of the specific run from the list of workflow runs to view the summary.
The re-run options are presented in the upper-right corner of the workflow view. The actions available depend on the state of the jobs within the run:
- Re-run failed jobs: If any jobs in the workflow run failed, the user can select the Re-run jobs dropdown menu and click Re-run failed jobs. This action only re-executes the jobs that did not complete successfully.
- Re-run all jobs: If no jobs failed, or if the user wishes to re-execute the entire workflow, the Re-run all jobs button is available. This can be selected directly if no failures are present, or via the dropdown menu if failures exist but a full re-run is desired.
Regardless of the re-run type chosen, users have the option to enable additional logging. By selecting Enable debug logging before clicking Re-run jobs, runner diagnostic logging and step debug logging are activated for the re-run. This feature is invaluable for diagnosing intermittent issues that may not be apparent in standard output logs.
Technical Constraints and Execution Context of Re-runs
Understanding the underlying mechanics of workflow re-runs is critical for managing permissions and ensuring consistency. When a workflow is re-run, it does not adopt the privileges of the actor who initiated the re-run. Instead, it uses the privileges of the actor who originally triggered the workflow. This security model ensures that re-runs maintain the original context and permissions, preventing privilege escalation through retry mechanisms.
Furthermore, the re-run preserves the original git context. The workflow uses the same GITHUB_SHA (commit SHA) and GITHUB_REF (git ref) of the original event that triggered the initial workflow run. This ensures that the code being tested or deployed remains consistent between the failed attempt and the retry, avoiding version drift issues that could complicate debugging.
There are also strict limits on the number of times a workflow can be re-run. A single workflow run can be re-run a maximum of 50 times. This limit applies to the aggregate of all re-run actions, including both full re-runs and re-runs of a subset of jobs. This constraint prevents infinite loop scenarios and manages resource consumption on GitHub's hosted runners.
| Re-run Method | Command/Action | Description |
|---|---|---|
| CLI Full Re-run | gh run rerun RUN_ID |
Re-runs the entire workflow run. Interactive menu if ID is omitted. |
| CLI Failed Jobs | gh run rerun RUN_ID --failed |
Re-runs only the jobs that failed in the specified run. |
| CLI Specific Job | gh run rerun RUN_ID --job <job-id> |
Re-runs a specific job within the run. |
| CLI Debug Mode | --debug flag |
Enables runner diagnostic and step debug logging for the re-run. |
| Web Failed Jobs | Re-run failed jobs | UI option to re-run only failed jobs in the upper-right corner. |
| Web All Jobs | Re-run all jobs | UI option to re-run the entire workflow, regardless of previous status. |
Conclusion
Effective error handling in GitHub Actions requires a combination of conditional logic, workflow continuity features, and strategic re-run capabilities. By leveraging if: ${{ failure() }}, developers can implement precise failure-dependent steps, such as automated issue creation or cleanup tasks, mirroring the control flow capabilities found in other CI/CD platforms like Azure DevOps. The continue-on-error feature further enhances resilience, allowing workflows to proceed through expected failures or to execute rollback procedures after deployment errors. Finally, the robust re-run mechanisms available via both the GitHub CLI and the web interface provide efficient ways to retry failed operations, complete with detailed debug logging and strict privilege preservation. Together, these tools enable the creation of sophisticated, self-healing CI/CD pipelines that can manage complex deployment scenarios and testing requirements with precision and reliability.