Protocol Convergence in Elastic APM: Architecting OTLP, gRPC, and Jaeger Integration

The landscape of distributed observability is currently undergoing a massive paradigm shift, moving away from proprietary, agent-specific formats toward unified, vendor-agnostic standards. At the heart of this transition is the Elastic APM Server, which has evolved into a sophisticated multi-protocol gateway capable of ingesting diverse telemetry streams. Central to this evolution is the support for the OpenTelemetry Protocol (OTLP) delivered via gRPC (Google Remote Procedure Call) and HTTP/2. This convergence allows organizations to leverage existing OpenTelemetry-instrumented applications, collectors, and agents while utilizing the powerful indexing and visualization capabilities of the Elastic Stack. Understanding the intricate mechanics of how APM Server handles gRPC requests, manages protocol negotiation, and facilitates the ingestion of Jaeger-formatted spans is critical for engineers building resilient, high-scale observability pipelines. This deep technical analysis explores the internal architecture of the APM Server gRPC implementation, the complexities of protocol multiplexing on shared ports, and the configuration requirements for seamless OpenTelemetry and Jaeger integration.

The Architecture of OTLP and gRPC Ingestion in APM Server

The Elastic APM Server is not merely a passive receiver; it is an active,-intelligent processor designed to handle complex telemetry payloads. The ingestion of traces, metrics, and logs via OTLP/gRPC is built upon a robust internal framework within the server's codebase.

The gRPC server itself is instantiated within the beater package. This is a foundational component of the server's architecture, where the initial connection is established. Once a connection is made, the requests are not processed in isolation but flow through a sophisticated chain of gRPC interceptors. These interceptors are defined under the beater/interceptors directory and are responsible for critical middleware-style tasks.

The primary responsibilities of these interceptors include:

Authentication: Verifying the identity of the incoming client via API keys or secret tokens.
Logging: Recording the metadata of incoming requests for audit and debugging.
Rate Limiting: Protecting the server from being overwhelmed by excessive telemetry spikes, ensuring stability during traffic bursts.
Context Propagation: Managing the metadata required to trace a request across different service boundaries.

Within this infrastructure, the specific logic for OTLP service registration resides in the beater/otlp package. When the gRPC server is initialized, it registers the necessary OTLP services to listen for specific event types. The actual heavy lifting—the business logic required to interpret these incoming streams—is handled by the processor/otel.Consumer type.

The Consumer type performs a multi-stage transformation process:

Decoding: The raw, encoded OTLP events are first decoded from their ProtoBuf format.
Translation: Because OpenTelemetry and Elastic APM have distinct semantic conventions, the Consumer must translate the OTLP attributes into the Elastic APM data model. This is a delicate process, as any loss in fidelity during translation can lead to gaps in observability.
Pipeline Integration: Once translated, the events are passed to the BatchProcessor, which is the standard event processing pipeline model used for all incoming data.

Because the evolution of both OTLP and Elastic APM is continuous, the mapping between semantic conventions is a moving target. There are periods where a faithful, 1:1 translation of every OpenTelemetry attribute to an Elastic APM field may not be possible. In such scenarios, the architecture provides a safety net for developers: the ability to enable debug logging. By enabling the specific otel debug logging selector, administrators can view the original, un-translated OTLP payload. This allows for a direct comparison between the raw incoming data and the final indexed documents in Elasticsearch, which is indispensable for troubleshooting data loss or attribute misconfiguration.

Protocol Multiplexing and the Challenges of HTTP/2 and TLS

A significant engineering feat in the APM Server is its ability to service gRPC requests on the exact same network port used for standard HTTP requests from Elastic APM agents. This consolidation simplifies network configuration and firewall management but introduces substantial technical complexity regarding how the server handles different protocol types.

The fundamental reason this is challenging lies in the relationship between gRPC and HTTP/2. gRPC is built on top of HTTP/2, which provides features like multiplexing, header compression (HPACK), and stream prioritization. While the standard Go net/http package is capable of transparently negotiating and serving both HTTP/1.1 and HTTP/2, the grpc-go implementation—the core library used for gRPC—requires much deeper access to the lower-level HTTP/2 framing layer. It does not operate simply on top of the high-level net/http API, meaning the server must perform "gymnastics" to manage both traffic types on a single port.

The negotiation process typically relies on several layers of technology:

ALPN (Application-Layer Protocol Negotiation): When TLS (Transport Layer Security) is active, the client and server use ALPN during the TLS handshake. ALPN allows the parties to agree on the application protocol (such as h2 for HTTP/2 or http/1.1) before any application data is exchanged.
h2c (HTTP/2 Cleartext): In environments where TLS is not used (such as local testing or secure internal networks), the server must support h2c. This is an insecure mode of HTTP/2 that allows for the same multiplexing benefits without the overhead of encryption. Most modern gRPC clients support an "insecure" mode specifically for this purpose.

The following table illustrates the differences in how these protocols are handled within the APM Server:

Temporal Synchronization and the Export Timestamp Mechanism

In distributed systems, time is a relative concept. When telemetry data is generated on a mobile device or a remote edge server, there is an inherent delay between the moment an event occurs and the moment it reaches the APM Server. This delay, known as network latency, can skew the perceived timing of traces and metrics.

To combat this, the Elastic APM iOS agent (and other compatible agents) utilizes a specific attribute known as the export timestamp. This attribute records the precise moment the client prepared the payload for transmission.

The APM Server implements a sophisticated temporal correction algorithm:

Payload Receipt: The server receives a payload at time $T_{received}$.
Attribute Extraction: The server extracts the export timestamp ($T_{export}$) from the payload.
Delta Calculation: The server calculates the difference: $\Delta T = T{received} - T{export}$.
Event Adjustment: The server iterates through every single event within that payload and adds $\Delta T$ to each event's timestamp.

For example, if a client-side event was timestamped at 1:00 PM, but due to network congestion or low connectivity, the APM Server did not receive the payload until 2:00 PM, the server will calculate a one-hour delta. It will then update the event timestamp to 2:00 PM.

While this process introduces a slight increase in timestamps due to network latency, it is considered an acceptable and necessary tradeoff. The primary goal is to ensure that the temporal relationship between the event and the server'm arrival is preserved, providing a more consistent view of the telemetry stream across a distributed landscape.

Integrating Legacy Jaeger Deployments

For organizations that have already invested heavily in Jaeger for distributed tracing, Elastic APM provides a seamless integration path. This feature, though experimental and not yet available on Elastic Cloud, allows users to redirect their existing Jaeger data streams to the APM Server without changing a single line of application code. This is achieved by reconfiguring the Jaeger infrastructure to point toward the APM Server's endpoints.

The integration can be implemented in two distinct architectural patterns depending on how the Jaeger infrastructure is deployed:

Agent-Based Architecture: In a typical setup, Jaeger Clients send spans to Jaeger Agents, which then forward them to a central Jaeger Collector. To support this, the APM Server must be configured to enable its gRPC endpoint via the following setting in apm-string.yml:
apm-server.jaeger.grpc.enabled: true
Collector-Direct Architecture: In more modern, streamlined setups, Clients may send spans directly to Collectors, bypassing the Agent layer. In this instance, the APM Server must enable its HTTP endpoint:
apm-server.jaeger.http.enabled: true

This integration enables a phased migration strategy. Users can maintain their existing Jaeger-based instrumentation while gaining the ability to store traces in Elasticsearch and visualize them within the highly integrated Elastic APM application.

Configuring the OpenTelemetry Collector for Elastic Observability

While it is possible to send data directly from an OpenTelemetry SDK to the Elastic APM Server, the industry-standard best practice is to use an OpenTelemetry Collector as an intermediary. The Collector acts as a powerful processing engine that can aggregate, filter, and transform telemetry before it ever reaches the Elastic endpoint.

When configuring the OpenTelemetry Collector, it is vital to use the otlp exporter. Using the elasticsearch exporter is explicitly discouraged because it bypasses the critical validation and processing logic performed by the APM Server. Furthermore, data sent directly to Elasticsearch via the elasticsearch exporter will not be correctly indexed for visibility within the Elastic Observability project.

A robust Collector configuration should utilize the memory_limiter and batch processors to ensure stability and efficiency. Below is a technical configuration example for a Collector designed to forward traces, metrics, and logs to an Elastic APM Server.

```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
memorylimiter:
checkinterval: 1s
limit_mib: 2000
batch:

exporters:
debug:
verbosity: detailed
otlp/elastic:
# Note: Do not include the https:// prefix in the endpoint
endpoint: "${env:ELASTICAPMSERVERENDPOINT}"
headers:
# Use the API Key or Secret Token for authentication
Authorization: "Bearer ${env:ELASTICAPMSECRETTOKEN}"
logging:
loglevel: warn

service:
pipelines:
traces:
receivers: [otlp]
processors: [memorylimiter, batch]
exporters: [debug, otlp/elastic]
metrics:
receivers: [otlp]
processors: [memorylimiter, batch]
exporters: [debug, otlp/elastic]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug, otlp/elastic]
```

In the configuration above, the otlp/elastic exporter is configured to use environment variables for the endpoint and credentials. This is a critical security practice in DevOps workflows, preventing the accidental leakage of sensitive API keys in configuration files. For Elastic Stack versions 8.0 and higher, the logs pipeline must be explicitly declared to enable the ingestion of OpenTelemetry logs.

Advanced Context Propagation in Microservices

A significant challenge in modern microservices architecture occurs when a request enters the system via an HTTP-based gateway and is subsequently passed to a backend service via gRPC. In environments such as Go (Golang), developers must ensure that the tracing context (TraceID, SpanID, etc.) is not lost during this protocol transition.

The difficulty lies in the fact that the tracing context is embedded within the HTTP headers of the original request. When an API gateway or a reverse proxy receives this request, it must extract the context from the HTTP headers and manually inject it into the gRPC metadata (headers) of the downstream call.

This process involves:

Extraction: Using an HTTP middleware to parse the incoming traceparent or custom headers.
Mapping: Converting the extracted HTTP context into a format compatible with the gRPC metadata structure.
Injection: Using the gRPC library to attach this metadata to the outgoing context.Context object in Go.

Failure to perform this "context transfer" results in broken traces, where a single user request appears as multiple, disconnected traces in the APM dashboard, making it impossible to perform true end-to-end distributed tracing.

Analysis of Architectural Implications

The integration of gRPC and OTLP into the Elastic APM Server represents a move toward a "Unified Observability" model. By supporting both the legacy Elastic APM agent protocol and the modern OpenTelemetry standard over the same infrastructure, Elastic has mitigated the risk of vendor lock-in while providing a clear migration path for enterprises.

The architectural decision to multiplex these protocols on a single port demonstrates a sophisticated understanding of the operational constraints faced by SRE (Site Reliability Engineering) teams. Reducing the number of open ports and the complexity of firewall rules is a significant operational advantage. However, as analyzed, this comes at the cost of increased complexity in the server's internal networking stack, specifically regarding HTTP/2 frame management and ALPN negotiation.

From a data integrity perspective, the use of the export timestamp correction mechanism is a vital component for maintaining the temporal accuracy of distributed traces. Without this, the "clock skew" inherent in distributed systems would render the visualization of spans in the APM app nearly useless for latency-sensitive debugging.

Finally, the recommendation to use the OpenTelemetry Collector as a buffer is the most critical architectural takeaway for engineers. The Collector provides the necessary layer of abstraction that allows for complex processing (like the memory_limiter and batch processors) to occur outside the core APM Server, thereby protecting the primary ingestion engine from being overwhelmed by the very telemetry it is designed to process. This layered approach—SDK to Collector to APM Server—is the foundation of a scalable, production-grade observability strategy.