The integration of GitLab Runners with Kubernetes represents a paradigm shift in Continuous Integration and Continuous Delivery (CI/CD) infrastructure, moving away from static, virtualized environments toward a dynamic, container-orchestrated model. In traditional setups, organizations often relied on fixed quantities of Docker runners executing within Openstack virtual machines, utilizing tools like docker-machine. However, as the industry evolved and docker-machine was deprecated—leaving only a GitLab-maintained fork—the need for a more scalable and resilient architecture became critical. For large-scale organizations, such as CERN, the surge in licensed users and the exponential growth in pipeline executions necessitated a transition to the Kubernetes executor. This transition provides the foundational advantages of the Kubernetes ecosystem: inherent reliability, rapid scalability, and high availability. By leveraging the Kubernetes executor, a GitLab instance can instantiate job pods on demand, ensuring that compute resources are utilized efficiently and that the infrastructure can expand in lockstep with user activity.
Architectural Transition from Legacy Docker Runners to Kubernetes
The migration from legacy virtual machine-based runners to a Kubernetes-native infrastructure is a complex process that requires a phased approach to ensure service continuity and stability. The transition strategy employed by large-scale deployments typically involves a multi-stage rollout designed to minimize disruption and gather operational intelligence.
The migration process begins with the establishment of the Kubernetes clusters and the registration of the runners within the GitLab instance. To facilitate a controlled transition, a temporary tag, such as k8s-default, is created. This allows administrators to offer an opt-in mechanism where a randomly selected group of users can test the new infrastructure. This phase is critical for troubleshooting the Kubernetes executor in a real-world environment and acquiring the necessary institutional know-how before a wider rollout.
As confidence in the stability of the Kubernetes runners increases, the strategy shifts toward accepting untagged jobs. In the GitLab CI/CD ecosystem, untagged jobs are those that do not specify a required runner tag in the .gitlab-ci.yml file. By allowing Kubernetes runners to pick up these jobs, the infrastructure begins to handle a larger portion of the total workload. During this intermediary phase, legacy Docker runners are maintained as a fallback mechanism. This ensures that if a user experiences an error while utilizing the new Kubernetes-based executor, they can revert to the Docker runners by explicitly using the docker tag in their configuration, thereby preventing a total blockage of the development pipeline.
The final phase of the migration is the decommissioning of the legacy infrastructure. Once the Kubernetes runners have proven their stability and the majority of the workload has shifted, the old runner tags are removed. This forces any remaining users who had not yet migrated to adopt the new infrastructure, effectively consolidating all CI/CD operations onto the Kubernetes cluster.
Infrastructure Scaling and Quality Assurance
To maintain a high standard of service reliability, sophisticated GitLab deployments implement a tiered infrastructure strategy. This includes the use of a dedicated QA infrastructure to verify and assure the quality of the service before any release is pushed to production.
The QA instance operates with its own set of runners. These runners serve as a testing ground for new configurations, updates, and executor versions. In emergency situations where the production environment faces extreme demand or failure, these QA runners can be registered to the production instance. This capability allows for an immediate scaling-up of the GitLab infrastructure, providing an elastic buffer to handle spikes in job submissions and ensuring that the CI/CD pipeline remains operational even under catastrophic load. When multiple clusters are running simultaneously, GitLab intelligently balances jobs between them, optimizing resource distribution across the available compute nodes.
Technical Configuration of the Kubernetes Executor
The configuration of a GitLab Runner for Kubernetes is primarily handled via TOML files or through the GitLab Runner Operator. The configuration defines how the runner interacts with the cluster, how it manages pod resources, and how it handles the lifecycle of the build containers.
Core Runner Configuration
The primary configuration involves defining the executor type and the specific parameters for the Kubernetes environment.
```toml
[[runners]]
name = "kubernetes-runner"
executor = "kubernetes"
[runners.kubernetes]
namespace = "gitlab-runner"
image = "alpine:latest"
pull_policy = ["if-not-present"]
privileged = false
```
In this configuration, the namespace defines where the job pods will be executed, and the image specifies the default container image used for jobs that do not define their own. The pull_policy of if-not-present optimizes performance by using cached images on the node when available. The privileged flag is set to false by default for security reasons; however, it must be enabled if Docker-in-Docker (DinD) is required, despite the inherent security risks associated with privileged containers.
Resource Management and Pod Constraints
Proper resource allocation is vital to prevent a single runaway job from destabilizing the entire Kubernetes node. This is achieved through a combination of requests and limits for the runner manager, the job pods, the service containers, and the helper containers.
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Job Pod | 500m | 2 | 1Gi | 4Gi |
| Service Container | N/A | 1 | N/A | 1Gi |
| Helper Container | N/A | 500m | N/A | 256Mi |
| Runner Manager | 100m | 500m | 128Mi | 256Mi |
Beyond raw resource limits, advanced scheduling is handled via pod affinity and anti-affinity rules. To prevent the runner manager from being co-located with too many job pods on a single node, the following configuration can be used:
yaml
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: gitlab-runner
topologyKey: "kubernetes.io/hostname"
Security Contexts and Identity Management
Managing user and group identities within a Kubernetes pod is critical for security and file permission management. GitLab provides a robust mechanism to override the default security context at multiple levels: the pod, the build container, the helper container, and the service container.
Administrators can define an allowlist of permitted users and groups to prevent unauthorized identity escalation.
toml
[runners.kubernetes]
allowed_users = ["1000", "1001", "65534"]
allowed_groups = ["1001", "65534"]
To provide granular control, security contexts can be defined as follows:
```toml
[runners.kubernetes.podsecuritycontext]
runasuser = 1500
runasgroup = 1500
[runners.kubernetes.buildcontainersecuritycontext]
runasuser = 2000
runas_group = 2001
[runners.kubernetes.helpercontainersecuritycontext]
runasuser = 3000
runas_group = 3001
[runners.kubernetes.servicecontainersecuritycontext]
runasuser = 4000
runas_group = 4001
```
In this hierarchy, the pod_security_context sets the global default (1500:1500). However, the specific container contexts (build, helper, and service) override these defaults. A critical feature of this implementation is that users specified within these security contexts (e.g., 2000, 3000, 4000) bypass the allowed_users allowlist validation. This grants administrators unrestricted override control, ensuring that the runtime environment can be precisely tuned without being blocked by the general security policy.
For individual jobs that require a specific user, the .gitlab-ci.yml file can be configured to override the image settings:
yaml
job:
image:
name: alpine:latest
kubernetes:
user: "1000"
script:
- whoami
- id
Network Security and Secret Management
The dynamic nature of Kubernetes runners requires strict network policies to prevent lateral movement within the cluster and to secure the communication between the runner pods and the GitLab API.
Network Policy Implementation
A restrictive NetworkPolicy should be applied to job pods to ensure that they only communicate with necessary endpoints. The following configuration implements a "default deny" for inbound traffic and explicitly allows outbound traffic for DNS, the GitLab API, and the container registry.
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: gitlab-runner-jobs
namespace: gitlab-runner
spec:
podSelector:
matchLabels:
app: gitlab-runner-job
policyTypes:
- Ingress
- Egress
ingress: []
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 5000
Advanced Secrets Handling
Passing secrets as environment variables is often discouraged due to the risk of leaking credentials in logs. Instead, Kubernetes runners can mount secrets as volumes or use projected volumes for complex credential management.
To mount a standard secret as a volume:
toml
[[runners.kubernetes.volumes.secret]]
name = "ci-secrets"
mount_path = "/secrets"
read_only = true
secret_name = "gitlab-ci-secrets"
For scenarios requiring multiple configuration files, such as Docker and NPM credentials, projected volumes are used:
toml
[[runners.kubernetes.volumes.projected]]
name = "credentials"
mount_path = "/credentials"
[[runners.kubernetes.volumes.projected.sources.secret]]
name = "docker-config"
items = [
{ key = "config.json", path = "docker/config.json" }
]
[[runners.kubernetes.volumes.projected.sources.secret]]
name = "npm-config"
items = [
{ key = ".npmrc", path = "npm/.npmrc" }
]
Deployment Methodologies: Operator vs. Helm
There are two primary ways to deploy GitLab Runners on Kubernetes: using the GitLab Runner Operator or using the Helm chart.
The GitLab Runner Operator
The Operator is ideal for users who want the lifecycle of the runner to be managed automatically by Kubernetes. It utilizes Custom Resource Definitions (CRDs) to define the desired state of the runner.
To install the operator:
bash
helm repo add gitlab https://charts.gitlab.io
helm repo update
helm install gitlab-runner-operator gitlab/gitlab-runner-operator \
--namespace gitlab-runner \
--create-namespace
Once the operator is active, a Runner resource is created. This requires a secret containing the registration token, which is generated from the GitLab UI (Settings > CI/CD > Runners).
bash
kubectl create secret generic gitlab-runner-secret \
--namespace gitlab-runner \
--from-literal=runner-registration-token="YOUR_REGISTRATION_TOKEN"
The Runner CRD configuration then links the operator to the GitLab instance:
yaml
apiVersion: apps.gitlab.com/v1beta2
kind: Runner
metadata:
name: gitlab-runner
namespace: gitlab-runner
spec:
gitlabUrl: https://gitlab.com
token: gitlab-runner-secret
config: |
[[runners]]
[runners.kubernetes]
namespace = "gitlab-runner"
image = "alpine:latest"
Helm Deployment
While the Operator is efficient for lifecycle management, the Helm chart is preferred for production environments that require maximum control over the deployment parameters and a more traditional approach to configuration management.
Operational Pitfalls and Troubleshooting
The migration to Kubernetes runners is not without its challenges. One of the most common issues encountered during large-scale transitions is the failure of network diagnostic tools. Specifically, ping is disabled by default in many Kubernetes environments due to the lack of required capabilities in the pod's security context. This can lead to job failures for pipelines that rely on network connectivity checks.
Furthermore, the transition from tagged to untagged jobs requires careful monitoring. By allowing Kubernetes runners to accept untagged jobs while maintaining a legacy Docker runner for tagged jobs, administrators can gather insights into failure patterns. This phased approach allows the team to solve "tricky" problems before the final decommissioning of the legacy system, ensuring that the end-user experience remains seamless.
Conclusion
The shift to GitLab Runners on Kubernetes transforms CI/CD from a static resource allocation problem into a dynamic orchestration capability. By implementing a phased migration strategy—starting with opt-in tags, moving to untagged job acceptance, and finally decommissioning legacy Docker runners—organizations can mitigate risk and ensure stability. The technical depth of the Kubernetes executor, from its granular security contexts that bypass allowlist validation to its sophisticated resource limiting and network isolation policies, provides a secure and scalable foundation. Whether deployed via the Operator for automated lifecycle management or via Helm for maximum control, the Kubernetes executor allows for an elastic infrastructure that can scale up through dedicated QA clusters to meet the demands of a growing user base. The integration of projected volumes for secrets and strict NetworkPolicies ensures that this scalability does not come at the expense of security, resulting in a robust, industrial-grade CI/CD pipeline capable of supporting the most demanding research and development environments.