The production of academic papers, technical reports, and assignments using LaTeX often descends into a manual, fragmented process. This traditional workflow typically involves a patchwork of manual compilations, where the author must manually run a series of commands to generate bibliographies, tables of contents, and cross-references, often resulting in a document that is neither exercisable nor complete. Such a fragmented approach makes it virtually impossible to reproduce a document from its raw code and data, violating the core principles of Reproducible Research. To combat this, the implementation of a Continuous Integration (CI) pipeline using GitLab CI transforms the document creation process from a manual chore into an automated, version-controlled engineering pipeline. By treating a LaTeX document like a software project, authors can ensure that every commit to the repository triggers an automated build, validating that the document remains compilable and meeting strict open science standards.
The fundamental objective of this architecture is to create a repository that is fully reproducible down to the last pixel. This means that any third party, given the repository, should be able to execute a single script and generate the exact same PDF output. This level of rigor is essential for open science, allowing other researchers to analyze and criticize the work based on the actual data and code used to generate the final output. By utilizing GitLab CI, the burden of compilation is shifted from the local machine to a GitLab Runner—a specialized agent that executes the defined build scripts in an isolated environment. This ensures that "it works on my machine" is no longer a valid excuse, as the document is validated in a clean, standardized Docker container every time a change is pushed.
The Architecture of GitLab Continuous Integration for LaTeX
At its core, GitLab CI operates based on a simple trigger mechanism. When a user pushes a commit to a repository hosted on GitLab, the system automatically scans the root directory for a specific configuration file named .gitlab-ci.yml. This file serves as the blueprint for the entire automation process, containing the precise commands that the GitLab Runner must execute.
The GitLab Runner is the engine of the CI pipeline. These runners can be highly diverse in their implementation, ranging from a dedicated physical server located in a user's room to short-lived virtual machines deployed in the cloud. For users on the free tier, GitLab provides access to shared runners hosted on Amazon Web Services (AWS). While these free runners require zero local configuration, they are subject to a compute time restriction of 2000 minutes per month. This means that for extremely large documents or those with heavy figure generation requirements, users must be mindful of their total build time to avoid exhausting their monthly quota.
To maintain a clean repository, it is highly recommended to utilize a .gitignore file based on the GitLab TeX template. This prevents the repository from being cluttered with auxiliary files generated during the LaTeX compilation process, such as .aux, .log, .out, and .toc files, ensuring that only the source code and essential data are versioned.
Multi-Stage Pipeline Design for Document Production
A robust LaTeX pipeline is not a single step but a series of orchestrated stages. By dividing the process into stages, developers can isolate different failure points and optimize the build speed through caching. A comprehensive pipeline typically consists of three distinct phases: figures, build, and test.
The Figures Stage: Data-Driven Visualization
The first stage of the pipeline focuses on the generation of plots and figures. In a reproducible research workflow, figures should not be static images uploaded to the repository; instead, they should be generated from raw data using scripts.
In this specific implementation, the official python:3.8 Docker image is used to ensure a consistent Python environment. To prevent the pipeline from wasting time reinstalling dependencies on every single run, a caching strategy is employed. By defining the PIP_CACHE_DIR variable as $CI_PROJECT_DIR/.cache/pip and specifying the paths for .cache/pip and venv/, the pipeline can persist the virtual environment across different jobs. This is critical because installing heavy data science libraries can be time-consuming and can consume a significant portion of the 2000-minute free tier limit.
The execution flow for the figures stage is as follows:
- Python version is verified using
python -V. - The
virtualenvpackage is installed viapip. - A virtual environment is created using
virtualenv venv. - The environment is activated via
source venv/bin/activate. - The figures are generated by executing
make -C figures.
The output of this stage is saved as an artifact with untracked: true and an expiration period of one week. This ensures that the generated images are available for the subsequent build stage without needing to be re-generated.
The Build Stage: Compiling the LaTeX Document
Once the figures are ready, the pipeline moves to the build stage. This stage requires a Docker image that contains a full TeX Live distribution and the necessary build tools. The image martisak/texlive2020 is utilized here, providing a stable environment based on TeX Live 2020.
The primary tool used for compilation is latexmk. Unlike standard pdflatex commands, latexmk is a robust build automation tool that automatically determines how many times the document needs to be compiled to resolve all cross-references and bibliographies. While other alternatives like rubber or latexrun exist, latexmk is preferred due to its stability and its pre-installation in the chosen Docker image.
To trigger the build, the pipeline uses GNU make. The command make pdf is executed within the container, which in turn invokes latexmk. This stage lists the figures stage as a dependency, ensuring that the PDF is not compiled until the required images have been successfully generated. Like the previous stage, all untracked files (including log files) are preserved as artifacts for one week to assist in debugging if the build fails.
The Test Stage: PDF Unit Testing
The final stage of the pipeline is the test phase. This is a critical step for ensuring the quality and correctness of the final output. Rather than simply checking if the PDF exists, this stage runs unit tests on the compiled PDF document to verify it against a set of known requirements. This process helps authors beat publisher PDF checks by catching errors—such as missing citations or broken links—automatically before the document is submitted.
Local Development and the Preview Workflow
While the GitLab CI pipeline provides the final validation, authors need a way to see changes in near real-time without committing every single typo fix to the git history. This is achieved through a local build system that mirrors the CI environment.
To achieve a "live preview" effect, latexmk can be run with the -pvc (preview and continuously update) flag. This puts the tool into a mode where it monitors the source files for changes and automatically recompiles the document the moment a file is saved. The command to initiate this locally is:
make clean render LATEXMK_OPTIONS_EXTRA=-pvc
For a fully reproducible local setup, the author can run a Docker container that mounts the local working directory and executes make pdf inside the container. This ensures that the local build environment is identical to the GitLab Runner environment, eliminating discrepancies between local and remote builds.
Technical Specification Summary
The following table details the technical components used in the automated LaTeX pipeline:
| Component | Tool/Image | Purpose |
|---|---|---|
| Version Control | Git / GitLab | Tracking changes and triggering CI |
| Figure Generation | python:3.8 |
Creating publication-ready plots from data |
| Environment Management | virtualenv |
Caching Python dependencies |
| Build Automation | GNU make |
Coordinating build steps and targets |
| LaTeX Engine | latexmk |
Automating multi-pass PDF compilation |
| Docker Image (TeX) | martisak/texlive2020 |
Providing a consistent TeX Live 2020 environment |
| CI Config File | .gitlab-ci.yml |
Defining the pipeline stages and scripts |
| Compute Resource | GitLab Runner (AWS) | Executing the pipeline in the cloud |
Implementation Configuration
The following code blocks represent the configuration required to implement this pipeline.
The .gitlab-ci.yml configuration for the figure generation stage:
```yaml
variables:
PIPCACHEDIR: "$CIPROJECTDIR/.cache/pip"
cache:
key: "$CIJOBSTAGE-$CICOMMITREF_SLUG"
paths:
- .cache/pip
- venv/
figures:
image: python:3.8
stage: figures
beforescript:
- python -V
- pip install virtualenv
- virtualenv venv
- source venv/bin/activate
script:
- make -C figures
artifacts:
untracked: true
expirein: 1 week
```
The .gitlab-ci.yml configuration for the compilation stage:
yaml
compile:
image: martisak/texlive2020
stage: build
script:
- make pdf
dependencies:
- figures
artifacts:
untracked: true
expire_in: 1 week
when: on_success
Analysis of Pipeline Efficacy and Reproducibility
The transition from manual LaTeX compilation to a GitLab CI-driven pipeline represents a paradigm shift in document preparation. By enforcing a strict separation between data generation (Python), document assembly (LaTeX), and validation (Unit Testing), the author creates a transparent chain of custody for the information presented in the final PDF.
One of the most significant advantages of this system is the elimination of "hidden" dependencies. In a manual workflow, a document might only compile because the author has a specific package installed in their local TeX distribution that is not documented. By using the martisak/texlive2020 image, the environment is codified. If the document fails to compile in the CI pipeline, it indicates that a dependency is missing from the environment, forcing the author to resolve the issue for all future users.
Furthermore, the use of latexmk combined with make provides a layer of abstraction that simplifies the user experience. Whether the build is happening on a local machine via make render or on a remote runner via make pdf, the underlying logic remains the same. This consistency is the cornerstone of the "exercisable and complete" repository.
The integration of caching for Python virtual environments demonstrates a sophisticated understanding of CI resource management. Given the 2000-minute limit on free GitLab runners, optimizing the before_script to avoid redundant pip install cycles is not just a matter of speed, but a necessity for project sustainability.
In conclusion, the synergy of GitLab CI, Docker, and latexmk transforms LaTeX from a markup language into a professional document engineering pipeline. This approach ensures that the final output is a direct and reproducible result of the source code and data, fulfilling the highest standards of open science and technical documentation.