Distributed Tracing Architectures via Jaeger and OpenTelemetry on Kubernetes

The landscape of microservices observability has undergone a tectonic shift with the transition from legacy tracing protocols to unified telemetry standards. In complex distributed systems, such as a Spacelift installation running within a Kubernetes cluster, understanding the path of a request as it traverses various services—including the server, drain, and scheduler components—is paramount for maintaining high availability and performance. Jaeger, an open-source distributed tracing platform, has emerged as the definitive backend for receiving, storing, and visualizing these traces. While traditional tracing solutions often required a fragmented stack of disparate tools for storage, processing, and visualization, Jaeger provides an integrated ecosystem. This capability is particularly critical in Kubernetes environments where ephemeral pods and rapid scaling require a robust, automated way to manage telemetry pipelines. The evolution from Jaeger v1 to the groundbreaking Jaeger v2 represents a fundamental architectural pivot, moving from a model where the collector was merely an exporter to one where the platform is natively built upon the OpenTelemetry Collector framework.

The Architecture of a Modern Telemetry Pipeline

A functional telemetry pipeline within a Kubernetes environment is not a single monolithic entity but a coordinated sequence of data transmission and processing stages. To achieve full observability, especially for critical infrastructure like Spacelift, the architecture must be structured into three distinct functional layers.

The first layer consists of the data producers, which in the context of a Spacelift deployment are the Spacelift applications themselves. These applications—specifically the server, drain, and scheduler components—are responsible for emitting trace data as requests move through the system. These traces capture the "spans" or individual units of work that constitute a complete trace, allowing engineers to see exactly where latency occurs or where a process fails.

The second layer is the processing engine, fulfilled by the OpenTelemetry Collector. The Collector acts as a high-performance relay. Its primary role is to receive incoming telemetry data from the applications via standardized protocols, process that data (which can include filtering, batching, or enriching the metadata), and then forward it to the appropriate backend. This decoupling is vital because it allows the application code to remain agnostic of the backend storage, ensuring that changing the tracing backend does not require a redeployment of the core business logic.

The final layer is the storage and visualization backend, provided by Jaeger. Jaeger serves the dual purpose of acting as the ingestion point for the Collector and providing a comprehensive web-based User Interface (UI) for engineers to query and inspect the traces.

Component	Primary Function	Role in Spacelift Context
Spacelift Applications	Trace Emission	Generates raw spans for server, drain, and scheduler
OpenTelemetry Collector	Data Relay & Processing	Receives traces from apps and forwards to Jaeger
Jaeger	Storage & Visualization	Persists traces and provides the Web UI

Deploying the Jaeger Operator in Kubernetes

Managing Jaeger manually in a dynamic Kubernetes environment is inefficient and error-prone. The Jaeger Operator simplifies this by implementing the Operator pattern, automating the deployment and lifecycle management of Jaeger instances.

Before beginning the installation of the Jaeger Operator, the underlying Kubernetes cluster must have an ingress-controller deployed. This is a prerequisite because the Jaeger UI is typically served through an Ingress resource to allow external access. For developers using minikube, this can be accomplished by enabling the built-in ingress add-on with the following command:

minikube start --addons=ingress

Once the ingress-controller is operational, the Jaeger Operator can be deployed into its designated namespace, typically the observability namespace. After the deployment is established and the jaeger-operator pod is in a Running state, users can define a Jaeger instance using a Custom Resource (CR).

To create a simple instance named simplest in the observability namespace, the following YAML configuration is applied:

yaml apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: simplest

The command to apply this configuration is:

kubectl apply -n observability -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
EOF

Upon successful application, Kubernetes will provision the necessary pods. Users can verify the existence of the instance by querying the custom resources:

kubectl get -n observability jaegers

To investigate the underlying pods used by the simplest instance, the following command is utilized:

kubectl get pods -l app.kubernetes.io/instance=simplest

Managing Ingress and UI Access

For development and testing, the Jaeger UI can be accessed through an Ingress controller. When the simplest instance is running, the Ingress resource can be inspected using:

kubectl get -n observability ingress

The output will reveal the hostname and address, for example:

NAME HOSTS ADDRESS PORTS AGE
simplest-query * 192.168.122.34 80 3m

In this scenario, the Jaeger UI is accessible via http://192.168.122.34. For production environments, however, relying on direct IP access or port-forwarding is insufficient. Production-ready deployments require exposing the Jaeger UI through a Kubernetes Ingress, a LoadBalancer service, or a reverse proxy like Nginx or Caddy. These methods should incorporate TLS termination and authentication to ensure that sensitive trace data remains secure.

Jaeger v2 and the OpenTelemetry Paradigm Shift

The transition from Jaeger v1 to Jaeger v2 marks a significant milestone in the evolution of distributed tracing. On November 12, 2024, Jaeger v2 was released as a major upgrade, specifically designed to leverage the OpenTelemetry Collector framework. This move was driven by the need for better interoperability and to benefit both the Jaeger and OpenTelemetry ecosystems.

The fundamental difference between versions lies in the internal architecture of the collector. In the Jaeger v1 architecture, the OpenTelemetry Collector functioned as an external component, and Jaeger v1 would interact with it as one of many possible exporters. In contrast, Jaeger v2 is fully integrated with the OpenTelemetry Collector, meaning the collector logic is a core part of the Jaeger deployment.

The Jaeger v2 Operator is designed to be deployed on Kubernetes using the OpenTelemetry Operator, providing a unified management experience for users of both technologies.

Configuration for In-Memory Storage

For rapid development, debugging, or quick demonstrations, Jaeger v2 supports an "all-in-one" in-memory storage mode. This mode is highly efficient for testing as it does not require an external database, but it comes with a significant caveat: all trace data is lost if the pod restarts.

To configure an in-memory instance, the Collector configuration must specify memstore as the trace storage. A snippet of the configuration logic would look like this:

yaml receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 exporters: jaeger_storage_exporter: trace_storage: memstore

To interact with the UI of an in-memory deployment, users must port-forward the deployment or the service to their local machine. The following command facilitates this:

kubectl port-forward deployment/jaeger-inmemory-instance-collector 8080:16686

Or, if using the service:

kubectl port-forward service/jaeger-inmemory-instance-collector 8080:16686

Once the port-forwarding is active, navigating to localhost:8080 in a web browser will present the Jaeger UI.

Persistent Storage and Database Integration

In production environments, in-memory storage is unacceptable due to the lack of data persistence and the volatility of the data. Jaeger v2 requires the deployment of a supported database to ensure that traces are persisted across pod restarts and cluster events.

The deployment process for persistent storage involves several mandatory steps:

Deploy the database (e.g., Cassandra or Elasticsearch) as a dedicated deployment within the Kubernetes cluster.
Ensure the database pods reach a Ready state.
Create a Kubernetes Service to expose the database pods, allowing the Jaeger pods to communicate with them. This can be done manually or via imperative commands:

kubectl expose deployment <deployment-name> --port=<port-number> --name=<name-of-the-service>

Update the Jaeger configuration to include the database endpoint.

For a Cassandra backend, the configuration fragment would be:

yaml jaeger_storage: backends: some_storage: cassandra: connection: servers: [<name-of-the-service>]

For an Elasticsearch backend, the structure is similar:

yaml jaeger_storage: backends: some_storage: elasticsearch: servers: [<name-of-the-service>]

Advanced Configuration and Troubleshooting

The Jaeger Operator is highly flexible, allowing configuration through multiple channels. It is critical for administrators to understand the order of precedence to avoid configuration conflicts. When a parameter is defined at multiple levels, the precedence is as follows:

Command-line parameters (flags) take the highest priority.
Environment variables follow.
Configuration files have the lowest priority.

Troubleshooting Trace Visibility Issues

When traces fail to appear in the Jaeger UI, the issue typically resides in one of three locations: the data generation stage, the transit stage, or the storage stage.

Traces are not reaching Jaeger: This often occurs due to incorrect configuration in the Spacelift applications. It is vital to check the Spacelift application logs. If the application cannot connect to the OpenTelemetry Collector, it will log specific connection errors.
Data loss via Pod Restart: If the system is using in-memory storage, a pod restart (due to node pressure, updates, or crashes) will immediately wipe all existing traces.
Retention Policies: Jaeger's built-in retention policies might have already purged the traces before they were queried.

The Jaeger UI provides the necessary tools to inspect the data once it has arrived. By selecting a service from the dropdown menu (such as server, drain, or scheduler) and clicking Find Traces, users can view recent traces. Clicking any specific trace reveals a detailed span timeline, showing the exact duration of each operation and any associated metadata, which is essential for identifying bottlenecks in a microservices architecture.

Analysis of Observability Implementations

The implementation of Jaeger and OpenTelemetry in a Kubernetes-native environment represents the pinnacle of modern site reliability engineering. By moving from Jaeger v1's external collector model to the v2 integrated model, the industry has embraced a more cohesive approach to telemetry. This integration reduces the cognitive load on DevOps engineers and simplifies the deployment of complex observability stacks. However, the complexity of managing persistent storage (Cassandra/Elasticsearch) alongside the Jaeger Operator requires a deep understanding of Kubernetes service networking and resource lifecycle management. As distributed systems continue to scale, the ability to seamlessly transition from lightweight in-memory testing to robust, persistent production observability via the OpenTelemetry Collector framework will remain a cornerstone of scalable infrastructure design.