The fundamental backbone of modern observability lies in the ability to process massive streams of telemetry data with near-zero latency and absolute reliability. For a platform of New Relic's magnitude, this requirement is not merely a preference but a structural necessity. The company manages an immense data throughput, processing more than 15 million messages every single second. This workload translates to an aggregate data rate that approaches a staggering 1 Tbps. To handle this scale, New Relic has architected a pipeline where every single telemetry data point—from the moment of arrival to the moment it becomes queryable in the database—is routed through Apache Kafka.
This is not a monolithic undertaking; rather, it is a distributed feat involving more than 100 independent Kafka clusters. These clusters act as the central nervous system for the entire platform, facilitating every stage of the data lifecycle: ingestion, transformation, aggregation, and eventual storage. The engineering complexity involved in this architecture transcends simple throughput management; the true challenge lies in building a pipeline capable of absorbing billions of data points per minute while maintaining isolation to ensure that a failure in one segment of the system does not cause a catastrophic cascade across the entire customer base. By scaling horizontally and evolving from a centralized model to a cell-based architecture, New Relic has created a resilient data plane that supports the massive scale of modern distributed systems.
Evolution from Monolithic Clusters to Cell-Based Architecture
The historical trajectory of New Relic's relationship with Apache Kafka reveals a significant shift in distributed systems engineering. Prior to 2020, the platform’s operational model relied on a massive, single Kafka cluster housed within New Relic's own data center. This centralized cluster contained thousands of broker nodes, all tasked with processing the entirety of the platform's data stream. While powerful, this monolithic approach presented a critical ceiling for horizontal scalability. As the volume of incoming telemetry grew, the single cluster eventually reached a point where it could no longer scale effectively to meet the demands of the expanding user base.
To address the limitations of the monolithic design, New Relic's engineering team implemented a "cell-based" architecture. This transition involved decomposing the platform into independent, self-contained units known as cells. Each cell is a microcosm of the entire platform, containing its own dedicated Kafka cluster, an ingestion service, a suite of data pipeline services, and a specific NRDB cluster.
| Component | Architectural Role within a Cell | Impact of Cell Isolation |
|---|---|---|
| Kafka Cluster | Localized message backbone | Prevents cluster-wide congestion |
| Ingestion Service | Entry point for telemetry | Limits ingestion failure to a single cell |
| Pipeline Services | Transformation and aggregation | Isolates processing errors |
| NRDB Cluster | Proprietary time-series storage | Localizes storage outages |
Currently, in the US region, there are approximately 10 of these cells. The primary advantage of this decomposition is fault isolation. In the event of a cluster failure or a localized software error, the blast radius is restricted to the specific cell affected. In a 10-cell configuration, a failure only impacts roughly 10% of the customer base, rather than causing a total platform outage. However, this architecture also introduces new operational complexities. A notable incident in 2021 demonstrated that if infrastructure automation is not strictly configured to respect cell boundaries, a single configuration change could accidentally propagate across all cells simultaneously, effectively nullifying the benefits of the cell-based design.
Migration to Amazon Managed Streaming for Apache Kafka (MSK)
As New Relic transitioned more of its core infrastructure to Amazon Web Services (AWS), the strategy for managing Kafka evolved. The company migrated its cell-based Kafka clusters to Amazon Managed Streaming for Apache Kafka (MSK). This move was part of a broader strategic push toward cloud-native infrastructure, which saw 95% of New Relic's data ingestion move to AWS in less than one year.
In this modern cloud-native environment, the architecture is highly containerized. Consumer services are hosted on Amazon Elastic Kubernetes Service (EKS) within each individual cell. This combination of Amazon MSK and Amazon EKS provides a robust, scalable, and managed approach to handling the heavy lifting of stream processing.
Data Partitioning and the NRDB Storage Model
A critical aspect of New Relic's ability to maintain high-speed queries while managing massive datasets is its approach to data partitioning and storage. The platform utilizes NRDB (New Relic Database), a proprietary, high-performance time-series store. NRDB is engineered to scan trillions of events with exceptional efficiency, maintaining a median query response time of approximately 60 milliseconds.
To ensure that data retrieval remains performant as the system grows, New Relic employs a strict partitioning model. Customer event data is partitioned based on the specific customer account to which the data belongs.
- Customer-centric partitioning
- Ensures efficient per-account data locality
- Optimizes query performance for individual users
- Facilitates rapid data retrieval within the NRDB layer
This partitioning strategy ensures that when a user queries their specific telemetry, the system can target the relevant data segments quickly, preventing the "noisy neighbor" effect where one user's massive data volume slows down queries for everyone else.
Deep Visibility via Kafka for APM and Java Instrumentation
Observability of the Kafka layer itself is achieved through a multi-layered instrumentation strategy. New Relic provides deep visibility into Kafka performance by combining infrastructure-level monitoring with Application Performance Monitoring (APM) through the use of the New Relic Java Agent.
The Java Agent allows engineers to instrument Kafka message queues, providing granular visibility into Kafka node metrics. This creates a seamless flow from high-level "Queues and Streams" capabilities into a dedicated Kafka for APM experience. Once an engineer identifies a specific topic entity that requires investigation, they can dive directly into node-level metrics to pinpoint the source of latency or error.
To implement this level of monitoring, several configuration steps must be executed:
- Install the Java Agent: The agent must be installed directly on the Kafka brokers to collect and integrate key metrics into the New Relic platform.
- Configure the Agent: The
newrelic.ymlconfiguration file must be modified to include Kafka instrumentation, allowing for custom transaction naming and error tracking. - Enable Node Metrics: Kafka must be configured to output metrics in a JMX-compatible format. This involves specific Java options and JMX monitoring setup.
- Verify Collection: Users must validate that the metrics are accurately appearing on the New Relic dashboards to ensure the monitoring loop is closed.
Infrastructure Integration and Metric Collection
For users who are not using the Java Agent for application-level instrumentation, New Relic offers a dedicated Kafka on-host integration. This integration is designed to report metrics and configuration data from the Kafka service by instrumenting all critical elements of the cluster.
The integration covers a comprehensive range of entities:
- Brokers (including both ZooKeeper and Bootstrap)
- Producers
- Consumers
- Topics
The mechanism for data collection depends on the specific entity being monitored. Inventory data is primarily harvested from ZooKeeper nodes, whereas real-time performance metrics are collected through Java Management Extensions (JMX). Because of this, JMX must be enabled for Brokers, Java Producers, and Java Consumers to ensure full visibility.
Integration Compatibility and Technical Specifications
The on-host integration is designed to be flexible across different operating systems and Kafka versions, though it does have specific technical prerequisites.
| Feature | Requirement / Specification |
|---|---|
| Supported Kafka Versions | 0.8 and higher (specifically compatible with 3.0 and lower) |
| Supported Operating Systems | Windows and Linux |
| Required Build Language | Golang (Version 1.11 or higher recommended) |
| Inventory Source | ZooKeeper nodes |
| Metric Source | JMX (Java Management Extensions) |
| Deployment Environments | Linux/Windows Host, Kubernetes, or Amazon ECS |
For those managing Kafka on Kubernetes or Amazon ECS, the infrastructure agent can be installed on a host that has the capability to remotely access the Kafka installation.
Building and Deploying the Kafka Integration
The New Relic Kafka integration is an open-source tool available via GitHub. It is built using Golang and requires the go-vendor tool to manage external dependencies. The build and deployment process follows a standardized command-line workflow.
To build the integration, the following sequence is typically used in a terminal:
```bash
Clone the repository and enter the directory
cd nri-kafka
Execute tests and build the executable
make
```
Upon successful completion of the make command, an executable file named nri-kafka is generated within the bin directory. To initiate the integration, the user navigates to the directory and runs the binary:
```bash
Navigate to the bin directory and run the executable
./bin/nri-kafka
```
For users requiring a detailed breakdown of available command-line options and parameters, the help command is provided:
```bash
Access the help menu
./bin/nri-kafka -help
```
Strategic Importance of Kafka Monitoring in Real-Time Pipelines
Monitoring Apache Kafka is critical because it serves as the foundation for building real-time, fault-tolerant, and scalable data pipelines. Because Kafka supports native replication, it is possible to build complex streaming applications within production environments. However, without active monitoring, these systems can fail silently or experience degradation that is difficult to trace.
New Relic’s Kafka monitoring provides "instant observability" by tracking several high-level and low-level metrics.
- Broker Performance: Monitoring CPU usage and memory consumption of the broker nodes.
- Throughput: Tracking bytes in and out per second and the total number of client requests.
- Data Integrity: Observing replication status and alerting on issues that could lead to data loss.
- Consumer Health: Monitoring consumer lag, which is perhaps the most vital metric for identifying if downstream services are falling behind the data stream.
- Retention and Space: Tracking space and time retention settings to ensure data is stored according to organizational policy.
By utilizing dashboards that aggregate these metrics—such as messages per second, consumer lag, and broker bytes—engineers can move from reactive firefighting to proactive system optimization.
Conclusion: The Interdependency of Scale and Observability
The relationship between New Relic and Apache Kafka is a symbiotic one, where the scale of the data necessitates the complexity of the architecture, and the complexity of the architecture necessitates the depth of the observability. The transition from a single, massive Kafka cluster to a distributed, cell-based architecture using Amazon MSK represents a sophisticated response to the challenges of horizontal scaling and fault isolation. This architectural evolution ensures that New Relic can continue to process 15 million messages per second without compromising the stability of the platform for individual customers.
Furthermore, the ability to achieve granular visibility—ranging from high-level Kafka topic trends to low-level JMX metrics and Java Agent instrumentation—allows for a holistic view of the data pipeline. Whether it is monitoring consumer lag to prevent data processing delays or verifying the health of a specific Kafka broker in a cell, the integration of New Relic’s observability tools with Kafka’s distributed messaging capabilities provides the necessary telemetry to maintain a robust, high-performance data ecosystem. The capability to monitor producers, consumers, and brokers in real-time transforms Kafka from a "black box" into a transparent, manageable component of the modern data stack.