Orchestrating Data Flow in GitHub Actions: Variables, Artifacts, and Matrix Strategies

The architecture of GitHub Actions workflows presents a fundamental structural challenge for DevOps engineers and software developers: the isolation of execution environments. By design, GitHub Actions workflows are segmented into discrete units of work, creating boundaries that prevent the natural sharing of state between different stages of a pipeline. When a variable is set by a task—the smallest unit of work within a GitHub Action—that variable remains accessible only within that specific task. Once the next task begins, that variable is entirely gone. Passing that variable to a subsequent step is possible under specific conditions, but passing it to another job, which may execute on an entirely different builder or runner, is impossible by default. This isolation is a deliberate security and scalability feature, but it creates significant friction when complex DevOps projects require the propagation of dynamic data, such as environment values populated with different names, prefixes, or suffixes, across the workflow lifecycle.

Understanding the Hierarchy of Execution

To effectively manage data flow, one must first understand the hierarchical structure of GitHub Actions terminology, as these terms define the scope and availability of variables. A GitHub Action is the workflow definition itself, a YAML-encoded file that resides within the same repository where the work is performed. The logic defined in these files is flexible, though the mechanism for passing information can be unintuitive. The workflow definition specifies the name of the job, the list of triggers that initiate the job, and the list of jobs to be executed.

Within this structure, a GitHub Action defines one or more jobs. These jobs are lists of tasks to be run and can be configured to execute concurrently or in series. Crucially, jobs can be assigned to different builders or the same builder. The logic within a job is further broken down into steps. Steps are individual tasks that should be executed by the job. While a visual representation might show a job with only one step, a typical job contains many steps. These steps are often run blocks that execute command-line commands, such as bash scripts on an ubuntu-latest system, or they may call other actions from the GitHub Actions Marketplace. The marketplace hosts numerous pre-built actions from creators and tech companies that perform complex operations with minimal inputs. While filtering for "Verified Creators" indicates that actions are from large tech companies with code testing and bug programs, it does not guarantee trustworthiness, and for simple tasks, building custom actions is often preferable.

At the most granular level is the variable. In programming terms, a variable is a named container that stores a value. For example, declaring FOO=BAR creates a variable named FOO. When a command such as echo "$FOO is the best" is executed, the value BAR is substituted for $FOO, resulting in the output BAR is the best. Understanding this basic concept is essential, as the behavior of variables changes drastically depending on whether they are being shared within a step, across steps in the same job, or between distinct jobs.

Sharing Data Within a Job: The GITHUB_ENV Mechanism

The default behavior of GitHub Actions allows data to flow freely between steps within the same job because these steps operate on the same file system. This shared file system enables the use of special GitHub "environment files" to pass variables between steps. This method is particularly useful when environment values are dynamic and need to be populated with varying configurations, such as different prefixes or suffixes.

To pass a variable between steps, the GITHUB_ENV environment file is utilized. This file allows a step to write key-value pairs that become available to subsequent steps in the same job. The process involves appending the variable assignment to the $GITHUB_ENV file. For instance, to set an ENVIRONMENT variable to dev, a step would execute a bash command that appends the string ENVIRONMENT=dev to the file. It is critical to use the append operator (>>) rather than the overwrite operator (>). The append operator ensures that the new variable is added to the file without destroying previously written variables, allowing multiple variables to be written to the file as long as they are appended correctly.

The following code snippet demonstrates this mechanism in a workflow named "Passing variable Fixed". The workflow is triggered by workflow_dispatch and contains a single job named pass-var running on ubuntu-latest.

yaml name: Passing variable Fixed on: workflow_dispatch jobs: pass-var: runs-on: ubuntu-latest steps: - name: Set deploy location id: set-var run: | ENVIRONMENT=dev echo "The ENVIRONMENT var is: $ENVIRONMENT" echo "ENVIRONMENT=$ENVIRONMENT" >> $GITHUB_ENV - name: Read deploy location id: read-var run: | if [ -z "$ENVIRONMENT" ]; then echo "No ENVIRONMENT var set, exiting" exit 1 else echo "The ENVIRONMENT var is set to: $ENVIRONMENT" fi

In this example, the first step, Set deploy location, sets the local variable ENVIRONMENT to dev, prints its value, and then appends ENVIRONMENT=dev to the $GITHUB_ENV file. The second step, Read deploy location, checks if the ENVIRONMENT variable is empty. If it is, the step exits with an error code. If it is not empty, it prints the value. Because the first step wrote to $GITHUB_ENV, the second step can successfully read the ENVIRONMENT variable, demonstrating that variables set in one step are accessible to subsequent steps within the same job. This approach is significantly cleaner than the complex model of using step outputs to feed job outputs and then task inputs, which is another valid but more cumbersome method provided by GitHub.

The Isolation Barrier: Passing Data Between Jobs

While sharing data within a job is straightforward due to the shared file system, passing data between jobs presents a more complex challenge. Jobs run in parallel by default, executing as soon as a suitable runner is found. If multiple runners with matching labels are available, the jobs in the workflow will execute simultaneously. More importantly, jobs can be assigned to different builders. This means that a job running on one runner has no direct access to the file system or environment variables of another job running on a different runner. The isolation barrier prevents the direct propagation of environment variables from one job to another.

To illustrate this limitation, consider a workflow with two jobs. The first job, pass-var, sets the ENVIRONMENT variable using the $GITHUB_ENV method described above. A second job, dependent on the first, attempts to read this variable. If the second job simply tries to access $ENVIRONMENT, it will fail because the variable was never written to the $GITHUB_ENV file of the second job's runner. The state is not automatically transferred. This is the default behavior, and it requires specific mechanisms to overcome.

The recommended approach by GitHub involves a combination of outputs and inputs. A step can define an output, which is then consumed by a downstream job as an input. This creates a chain of dependency and data flow. However, this method can become unwieldy, especially when dealing with multiple variables or when the data is complex, such as a map of environment values.

Alternative Strategies: Artifacts and Third-Party Actions

To address the limitations of the native output/input model, developers often turn to alternative strategies, including the use of artifacts and third-party actions. One effective method is to store dynamic environment values in a file and upload that file as an artifact. Artifacts have a special meaning in CI/CD pipelines; they are compiled or otherwise binary files that are built by automation and made accessible to downstream automation. This method is particularly useful when environment values are dynamic and need to be passed between jobs that may run on different builders.

In this approach, a step writes the environment variables to a literal file on the builder's disk, such as env.vars. The name of this file is arbitrary. Multiple variables can be written to this file using the append operator (tee -a). Once the file is populated, the actions/upload-artifact@3 action from the GitHub Marketplace is used to upload the file as an artifact. This action, published by GitHub under the name actions, ensures that the file is stored in a location accessible to subsequent jobs. Downstream jobs can then use the actions/download-artifact@3 action to retrieve the file and parse the variables back into their environment. This method bypasses the complexity of nested outputs and inputs and provides a clean, file-based mechanism for sharing state.

Another strategy involves using third-party actions designed specifically for persisting data between jobs. For example, the nick-fields/persist-action-data@v1 action allows data to be shared between jobs and accessed via environment variables and step outputs. This action provides a simplified interface for persisting and retrieving data. It supports optional parameters for the data to persist, the variable name to use for access in other jobs, and a comma-delimited list of variables to load into a job.

The following configuration demonstrates the usage of the persist-action-data action:

yaml - uses: nick-fields/persist-action-data@v1 with: data: ${{ steps.some-step.output.some-output }} variable: SOME_STEP_OUTPUT

To retrieve the data in a subsequent job, the action can be used with the retrieve_variables parameter:

yaml - uses: nick-fields/persist-action-data@v1 with: data: ${{ steps.some-step.output.some-output }} retrieve_variables: SOME_STEP_OUTPUT, SOME_OTHER_STEP_OUTPUT

The retrieved variables can then be accessed in a subsequent step using standard environment variable syntax:

yaml - run: echo $SOME_STEP_OUTPUT

Alternatively, the action can be used with an id to access the outputs directly:

yaml - uses: nick-fields/persist-action-data@v1 id: global-data with: data: ${{ steps.some-step.output.some-output }} retrieve_variables: SOME_STEP_OUTPUT, SOME_OTHER_STEP_OUTPUT - run: echo ${{ steps.global-data.outputs.SOME_STEP_OUTPUT }}

It is worth noting that the ownership of the persist-action-data project was transferred from the work account nick-invision to the personal account nick-fields in February 2022 due to the author leaving InVision. The author remains the primary maintainer, and the transfer is handled seamlessly by GitHub, ensuring that existing workflow references continue to function without immediate action required from users.

Conclusion

The ability to pass data between steps and jobs is a critical requirement for building robust and flexible GitHub Actions workflows. While the default isolation of jobs ensures security and scalability, it necessitates the use of specific mechanisms to share state. For data sharing within a job, the $GITHUB_ENV file provides a simple and effective solution, leveraging the shared file system of the runner. For data sharing between jobs, which may execute on different runners, more sophisticated strategies are required. These include the native output/input model, the use of artifacts to store and retrieve files containing environment variables, and third-party actions designed for data persistence. Each method has its own advantages and trade-offs, and the choice depends on the complexity of the data, the structure of the workflow, and the specific requirements of the project. Understanding these mechanisms and their underlying architecture is essential for DevOps engineers and developers who seek to build efficient and reliable CI/CD pipelines.