Architecting Global Scale: The Datadog Kafka Ecosystem and Observability Framework

The operational reality of modern data observability requires a backbone capable of sustaining massive throughput while maintaining sub-second latency. For Datadog, Apache Kafka is not merely a component within their stack; it is the central nervous system of their entire data ingestion and processing pipeline. As of February 2025, the scale of this infrastructure has reached unprecedented levels, involving hundreds of Kafka clusters, thousands of individual topics, and millions of partitions. This massive distribution facilitates the movement of terabytes of data every second, feeding into a platform that processes hundreds of trillions of observability events daily across logs, metrics, and traces. To manage this level of complexity, Datadog has moved beyond standard, off-the-shelf Kafka tooling. Their engineering teams have been forced to develop bespoke infrastructure, including a custom-built Rust client and a sophisticated control plane, to ensure that the movement of telemetry data from customer environments into their real-time processing pipelines remains uninterrupted.

The Architecture of High-Throughput Ingestion

The fundamental challenge for Datadog lies in the sheer volume of telemetry data that must be ingested, processed, and made queryable. The ingestion engine serves as the first point of contact for customer data, and Kafka acts as the buffer and distribution mechanism that allows for decoupling of the ingestion layer from the processing layer. Because Datadog operates at such a massive scale, the standard Kafka clients used by many enterprises eventually encounter performance bottlenecks or lack the specific tuning required for high-density environments. To mitigate this, Datadog engineers, led by contributors like Guillaume Bort, developed a custom Rust client. This decision was driven by the need for extreme performance and memory efficiency, ensuring that the ingestion layer can handle the velocity of data without becoming a bottleneck itself.

The data pipeline's evolution is marked by significant engineering milestones. For instance, the redesign of the data server pipeline by Kai Zong and William Yu represents a critical optimization in how live data is served. By implementing a mechanism where the live data server only activates 2-second collection intervals for hosts currently being actively viewed—and signaling this state to the intake service via Kafka at 1-second intervals—the team achieved a massive reduction in overhead. Specifically, peak throughput on the host subscriptions topic plummeted from 500,000 messages per second down to 5,000 messages per second. This optimization allowed the supporting infrastructure to be scaled down by an incredible 98%, demonstrating the profound impact that fine-tuned Kafka orchestration can have on resource utilization and operational cost.

Self-Managed Kubernetes Deployment and Redundancy

Unlike many organizations that opt for managed services such as Amazon MSK or Confluent Cloud, Datadog maintains total control over its Kafka lifecycle by running all Kafka clusters on self-managed Kubernetes. This deployment model utilizes Kubernetes StatefulSets and persistent volumes to ensure data persistence and stateful integrity across a multi-cloud environment. Operating dozens of self-managed Kubernetes clusters requires a deep mastery of container orchestration and storage management. By managing the infrastructure directly, Datadog can implement highly specific tuning for their brokers, which is essential for maintaining the strict latency requirements of an observability platform.

To combat the risks inherent in large-scale distributed systems, Datadog has implemented rigorous multi-region redundancy strategies. For their most critical pipelines, the architecture employs duplicate writes to both a primary and a secondary Kafka cluster. This design pattern was formalized as a response to historical data loss events that occurred in environments where out-of-sync replicas were elected as leaders—a risk that was particularly prevalent before the widespread adoption of exactly-once semantics in Kafka 0.11.0. By maintaining a secondary cluster, Datadog ensures that even if an unclean leader election occurs on the primary cluster, the data remains recoverable from the secondary site, providing a vital safety net for the trillions of datapoints being processed.

Advanced Commit Log and Offset Management

Standard Kafka consumer models present a fundamental trade-off for high-scale operators: the "head-of-line blocking" problem. In a default configuration, a consumer tracks a single pointer per partition. If a consumer group needs to process a backlog of historical data (a "replay") while simultaneously consuming live traffic, the single pointer creates a conflict where the consumer must choose between staying current or catching up. Datadog’s Streaming Platform has addressed this by extending the commit metadata to track multiple offsets or offset ranges within a single partition simultaneously.

This architectural advancement allows for several high-value operational capabilities:

Concurrent processing of live traffic and historical backlogs within the same partition.
Elimination of head-of-line blocking for specialized consumer groups.
Increased flexibility for data reprocessing tasks without impacting real-time visibility.
Enhanced ability to perform historical data analysis alongside real-time alerting.

Furthermore, Datadog has proactively adjusted consumer offset retention settings. While the default Kafka setting is often 1 day, Datadog extends this to 7 days. This extension is a critical safeguard, preventing the need for massive, expensive data reprocessing in scenarios where a consumer group might go offline for more than 24 hours.

Operational Tooling and the kafka-kit Ecosystem

Managing hundreds of clusters and millions of partitions manually is an impossibility. To address the operational overhead, the Datadog SRE team open-sourced kafka-kit in August 2018. This suite of tools has been maintained through significant versions, including v4.2.1 in mid-2023, and is designed to automate the most complex aspects of Kafka cluster management.

The kafka-kit suite provides several critical automated functions:

Topic mapping and management across various environments.
Partition rebalancing to ensure even data distribution.
Broker replacement orchestration during maintenance cycles.
Replication throttle management to prevent network saturation.

The toolset supports different deployment strategies for partition placement. The "count" strategy utilizes even partition distribution, while the "storage" strategy employs a bin-packing approach based on storage metrics. This level of automation is essential for managing the "ISR" (In-Sync Replica) health of the clusters.

The kafka-kit autothrottle tool is specifically designed to integrate with the Datadog metrics API. This integration allows the system to dynamically manage replication throttle rates during broker replacements, ensuring that the process of re-syncing data to a new broker does not consume all available bandwidth and impact the production traffic flowing through the cluster.

Data Streams Monitoring and APM Integration

For the end-user, the complexity of this backend architecture is abstracted into powerful observability features. Datadog's Data Streams Monitoring (DSM) provides granular, end-to-end visibility into streaming data pipelines. This goes beyond simple broker health, allowing users to track the health and performance of producers, consumers, and the specific services and queues that process the events.

The integration with Application Performance Monitoring (APM) allows for deep tracing of requests to and from Kafka clients. Because Datadog utilizes single-step instrumentation, it can automatically instrument popular languages and web frameworks, allowing developers to see the flow of a request through Kafka without having to manually modify their producer or consumer source code. This enables the creation of flame graphs that visualize latency and errors, allowing engineers to pinpoint exactly where a delay is occurring in a distributed transaction.

Monitoring Layer	Key Metrics and Focus	Datadog Implementation Detail
Cluster Level	Broker health, ISR status, throughput	MBean metrics like `MessagesInPerSec` and `BytesInPerSec`
Topic Level	Retention, segment lifecycle, partition count	Segment size and time tuning to align with retention
Consumer Level	Offset lag, consumer group activity	Custom offset tracking and DSM visibility
Application Level	Request latency, trace spans, error rates	APM auto-instrumentation and flame graphs

Configuration and Segment Tuning Optimization

At the extreme scale of Datadog's operations, default Kafka configurations are often insufficient and can lead to resource exhaustion or data retention issues. One specific area of focus is segment configuration tuning, particularly for low-throughput topics. In these scenarios, default Kafka segment settings can cause log retention to exceed the configured topic-level retention because the data is not being closed into a segment frequently enough.

To resolve this, Datadog engineers have implemented specific tuning for segment.ms and segment.bytes:

segment.ms is reduced to 43,200,000 ms (12 hours) to ensure timely segment rolls.
segment.bytes is tuned to approximately 100 MB to align the segment lifecycle with topic-level retention requirements.

Monitoring open file handles is also a critical part of this configuration management. Frequent segment rollouts, while necessary for data retention compliance, increase the number of open file handles on the host. Datadog continuously monitors these handles to ensure that the frequent rotation of segments does not exhaust the operating system's limits, which would otherwise lead to broker instability.

To ensure the stability of the entire system, all configuration changes undergo a rigorous validation process. Before any change is applied to the production environment, it is tested on replicated or mirrored clusters. During the rollout of these changes, engineers maintain continuous monitoring of ISR shrinks and expands to ensure the cluster remains healthy and that the new configurations do not cause unintended oscillations in the distributed state.

Change Data Capture and the CDC Pipeline

Beyond standard telemetry, Datadog has developed a specialized Change Data Capture (CDC) replication platform. This architecture is designed to move data from various database sources into the Kafka-based streaming ecosystem. The platform utilizes Debezium as the primary change capture connector, specifically targeting Postgres and Cassandra sources. This allows Datadog to turn static database state into a continuous stream of events, which can then be processed, transformed, and analyzed through the same high-scale Kafka pipelines used for infrastructure monitoring.

Advanced Observability and Troubleshooting

To verify the integrity of the Datadog-Kafka-ZooKeeper integration, administrators can utilize the Datadog Agent's status command. After ensuring that logs_enabled: true is set in the Agent's configuration and the logs section is uncommented in the Kafka conf.yaml, running the status command provides a diagnostic overview of the integration.

Effective monitoring requires specific configuration in the kafka_consumer/conf.yaml file:

The kafka_connect_str must be updated if the Kafka endpoint is not the default localhost:9092.
The consumer_groups value must be specified if monitoring only specific groups.
The monitor_unlisted_consumer_groups setting can be set to true to automatically fetch offset values for all consumer groups in the cluster.

This granular control over agent configuration allows operators to balance the depth of monitoring against the overhead of the collection process, ensuring that the observability of the Kafka cluster itself does not become a burden on the cluster's performance.

Conclusion

The Datadog Kafka architecture is a testament to the necessity of bespoke engineering in the face of extreme scale. By rejecting managed services in favor of self-managed Kubernetes deployments and developing custom Rust-based tooling, Datadog has built a pipeline capable of handling hundreds of trillions of events with high reliability and efficiency. The integration of advanced features—such as multi-offset tracking in the commit log, automated kafka-kit tooling, and sophisticated CDC pipelines—provides a blueprint for managing massive-scale distributed systems. Ultimately, the ability to maintain deep, end-to-end visibility through Data Streams Monitoring and APM is what enables Datadog to turn the overwhelming complexity of terabytes of per-second data into actionable, real-time intelligence for engineering teams worldwide.