Orchestrating Data Flow: Strategies for Sharing Files and State Between GitHub Actions Jobs

In the architecture of modern continuous integration and continuous delivery (CI/CD) pipelines, the ability to pass data between isolated execution environments is a critical requirement. GitHub Actions orchestrates these workflows by breaking them down into individual jobs, each running on its own runner instance—often with different operating systems or configurations. Because each job is isolated, environment variables set in one job are not automatically available to others, and file systems do not persist across job boundaries. Consequently, developers must employ specific mechanisms to share data, whether that data consists of build artifacts, test results, or intermediate state variables. The two primary methods for achieving this data persistence are the use of artifacts and the caching system. Understanding the distinct roles, performance implications, and implementation details of these mechanisms is essential for designing efficient and reliable GitHub Actions workflows.

The Isolation Problem and Job Dependencies

A fundamental constraint in GitHub Actions is that jobs run in isolated environments. When a job completes, the runner is typically destroyed or reset, meaning any files created or environment variables set during that job are lost unless explicitly preserved. A common misconception is that environment variables can be declared in one job and accessed globally by subsequent jobs. While workflow-level environment variables exist, they cannot be dynamically modified by a step in one job and then read by a step in a dependent job without the use of specific output mechanisms or external storage.

To facilitate data sharing, jobs must be structured with explicit dependencies. The needs keyword defines the execution order and ensures that a job only runs after its prerequisite jobs have completed successfully. For example, if job_2 requires data produced by job_1, the workflow definition must include needs: job_1 within the job_2 configuration. This dependency ensures that the artifact or cache produced by job_1 is available for job_2 to consume. Without this dependency, jobs may run in parallel, leading to race conditions where a downstream job attempts to download data that has not yet been uploaded.

Using Artifacts for Persistent Data Sharing

Artifacts are the standard mechanism for sharing files between jobs in a workflow. They are immutable data packets that are uploaded at the end of a job and can be downloaded by subsequent jobs. The actions/upload-artifact and actions/download-artifact actions are the core tools for this process. Artifacts are particularly well-suited for storing build outputs, test reports, logs, and other data that must be preserved after the job runner is terminated.

The process begins with the upload step. In a given job, after the necessary files are generated, the actions/upload-artifact action is invoked. This action requires a name parameter to identify the artifact and a path parameter specifying which files or directories to include. If no name is specified, the artifact defaults to the name artifact. The immutability of artifacts ensures that once uploaded, the data remains unchanged, providing a reliable source of truth for downstream processes.

yaml - name: Upload math result for job 1 uses: actions/upload-artifact@v4 with: name: homework_pre path: math-homework.txt

In a downstream job, the actions/download-artifact action retrieves the data. The name parameter must match the name specified during upload. Upon download, the files are restored to the working directory. If the goal is to download all artifacts generated in a workflow run, the name parameter can be omitted. In this case, the action creates a directory for each artifact using its name, allowing the workflow to process multiple distinct data sets.

yaml - name: Download math result for job 1 uses: actions/download-artifact@v5 with: name: homework_pre

This mechanism supports cross-platform data sharing. For instance, a job running on ubuntu-latest can generate a file, upload it as an artifact, and a subsequent job running on macos-latest can download and process that same file. This capability is vital for multi-OS testing strategies where a build generated on Linux must be tested on macOS.

Leveraging Caches for Performance Optimization

While artifacts are designed for data persistence across jobs, caching is optimized for speed and reusability across workflow runs or within a single run. The actions/cache action allows developers to store large directories, such as node_modules or build artifacts, in a remote storage system. Subsequent jobs or workflow runs can retrieve this data without regenerating it from scratch.

Caching is significantly faster than artifact handling. Artifacts involve uploading files to GitHub's storage service and then downloading them, a process that can take several minutes for large files. In contrast, caching leverages a key-based lookup system that can retrieve data in seconds. This performance difference becomes critical in workflows with large dependencies or complex build processes.

The actions/cache action requires two primary parameters: path and key. The path specifies the files or directories to cache, while the key is a unique identifier used to store and retrieve the cache. The key is often constructed using dynamic expressions, such as the runner operating system, a custom cache name, and a hash of dependency files like package-lock.json. This ensures that the cache is invalidated only when the dependencies change, preventing stale data issues.

yaml - name: Cache node modules uses: actions/cache@v2 env: cache-name: cache-node-modules with: path: ~/.npm key: ${{ runner.os }}-build-${{ env.cache-name }}-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}-build-${{ env.cache-name }}- ${{ runner.os }}-build- ${{ runner.os }}-

The restore-keys parameter allows for partial cache hits. If an exact match for the key is not found, GitHub Actions searches for keys that match the prefix provided in restore-keys. This fallback mechanism ensures that even if the exact dependency hash does not match, a recent cache can be restored and updated, further reducing build times.

Comparative Analysis: Artifacts vs. Caches

Choosing between artifacts and caches depends on the specific requirements of the data being shared. The primary distinction lies in their intended use cases and performance characteristics.

Artifacts are immutable and designed for data that must be preserved for the duration of the workflow or for post-execution analysis. They are ideal for storing test results, build logs, and final release packages. Because artifacts are uploaded to a persistent storage service, they can be accessed after the workflow completes, which is useful for debugging or compliance purposes. However, the overhead of uploading and downloading artifacts can be significant. For large files, upload times can exceed five minutes, and download times can take two minutes or more.

Caches, on the other hand, are mutable and designed for performance optimization. They are ideal for reusing dependencies, build intermediates, or other data that can be regenerated if the cache is missing. Caches are not intended for long-term storage; they are automatically evicted after 7 days of inactivity. The speed advantage of caching is substantial, with retrieval times often measured in seconds.

A hybrid approach can be employed to leverage the strengths of both methods. For example, a workflow can use caching to quickly restore dependencies for a build job, then upload the final build output as an artifact. This ensures that the build process is fast, while the final result is preserved for subsequent jobs or for release.

yaml - uses: actions/checkout@v2 - uses: actions/download-artifact@v2 with: name: my-artifact

In scenarios where a job needs to share a version string or a small piece of data quickly, caching may be preferred due to its speed. However, for complex data structures or files that must be audited, artifacts provide a more robust solution.

Advanced Data Sharing Techniques

Beyond the standard artifact and cache mechanisms, there are additional techniques for sharing data in GitHub Actions. One such method involves using job outputs. The jobs.<job_id>.outputs syntax allows a job to declare outputs that can be consumed by dependent jobs. This is particularly useful for passing small pieces of data, such as version numbers or release URLs, without the overhead of file uploads.

For example, a job can write a value to the GITHUB_OUTPUT environment variable, which is then exposed as a job output. A subsequent job can access this output using the ${{ needs.<job_id>.outputs.<output_name> }} syntax. This method is efficient for passing simple strings or numbers between jobs.

Another technique involves third-party actions. For instance, the dsaltares/fetch-gh-release-asset action can be used to fetch files from GitHub Releases. This is useful when data needs to be shared across different workflows or repositories. However, such actions may have limitations, such as only running on Linux, which necessitates the use of artifacts to bridge the gap between different operating systems.

Conclusion

Sharing data between jobs in GitHub Actions requires a careful balance of performance, reliability, and flexibility. Artifacts provide a robust, immutable solution for preserving build outputs and test results, ensuring that data is available for downstream processing and post-execution analysis. Caches offer a high-performance alternative for reusing dependencies and intermediates, significantly reducing build times. By understanding the strengths and limitations of each approach, developers can design workflows that are both efficient and resilient. Whether passing a simple version string or a complex build artifact, the choice of mechanism should align with the specific requirements of the workflow and the constraints of the underlying infrastructure.