The transition from monolithic application architectures to microservices-based infrastructures frequently necessitates a fundamental shift in observability strategies. When organizations perform a Proof of Concept (POC) to validate the migration of legacy monoliths into a Kubernetes (K8s) environment, a critical technical hurdle arises: the integration of ephemeral, containerized metrics into an existing, established monitoring stack. It is a common scenario where a robust Prometheus and Grafana infrastructure is already operational, serving as the single source of truth for long-standing Linux and Windows server workloads. As the organization introduces a three-node Ubuntu-based Kubernetes cluster, the primary architectural challenge becomes whether this new, dynamic environment can communicate its internal telemetry to the stable, external Prometheus instance without necessitating a complete, redundant overhaul of the monitoring stack.
The Dual-Path Architecture for Kubernetes Metric Acquisition
When an external Prometheus server—one hosted on a separate virtual machine or a physical server outside the Kubernetes cluster—needs to ingest data from Kubernetes pods, two primary architectural patterns emerge. Each path carries distinct implications for network topology, security configuration, and operational complexity.
The first method involves direct scraping from the Prometheus server residing outside the cluster. This approach requires the external Prometheus instance to reach into the cluster and pull metrics directly from the pods. The realization of this method hinges on two critical requirements: network reachability and API discovery.
First, the external Prometheus server must possess direct network access to the internal cluster networking. Because Kubernetes pods typically reside on an overlay network that is not natively routable from an external corporate subnet, this connection usually requires a bridge, such as a service mesh overlay or a specialized VPN/tunneling mechanism. Without this, the external Prometheus server remains blind to the IP addresses assigned to the transient pods.
Second, the external server must utilize the kubernetes_sd (Kubernetes Service Discovery) configuration within its prometheus.yml file. This mechanism allows Prometheus to query the Kubernetes API server to dynamically discover new pods, services, and endpoints as they are created or destroyed. However, because the API server is a protected resource, this discovery process requires the external Prometheus server to present a valid security token. This necessitates configuring basic_auth, authorization, or oauth2 within the scrape configuration to ensure the external entity has the appropriate permissions to interact with the cluster's control plane.
The second method, often preferred for Proof of Concept (POC) deployments due to its relative simplicity and lower barrier to entry, involves deploying a local collector within the Kubernetes cluster. This collector acts as an intermediary, scraping the local pods and then forwarding the data to the external Prometheus instance.
In this "agent" or "sidecar" pattern, several implementation choices exist:
- Running a full-fat Prometheus instance within the cluster to act as a local aggregator.
- Utilizing Prometheus in "agent mode," which is a streamlined, more resource-efficient version of the server designed specifically for remote writing without the overhead of full local storage.
- Leveraging the Prometheus Operator, which provides a highly scalable, Kubernetes-native way to manage scrape configurations via Custom Resource Definitions (CRDs) such as
ServiceMonitorandPodMonitor.
The choice between direct scraping and the remote-write pattern involves a trade-off between centralized control and operational simplicity. Direct scraping maintains a single point of management but requires complex networking and security plumbing. Remote writing simplifies the cluster setup but introduces an additional component (the agent) to maintain within the Kubernetes environment.
Implementing Service Discovery and RBAC for External Scrapers
When the decision is made to use kubernetes_sd for an external Prometheus server, the administrator must meticulously configure Role-Based Access Control (RBAC). Kubernetes is secure by default; a service or external entity cannot simply "look" at the pods or endpoints without explicit permission.
To enable discovery, a ServiceAccount must be created within the cluster. This identity is then bound to a ClusterRole through a ClusterRoleBinding. This permission set must be granular to adhere to the principle of least privilege while providing enough visibility for the scraper to function.
The configuration requires specific permissions to allow the Prometheus server to "watch" the state of the cluster. A typical RBAC manifest for this purpose includes the following components:
- A
ServiceAccountnamedprometheus. - A
ClusterRolethat grantsget,list, andwatchpermissions on critical resources:nodesnodes/metricsservicesendpointspodsconfigmapsingresses
- Permission to access
nonResourceURLssuch as/metricsto allow the scraper to hit the API server's metrics endpoint. - A
ClusterRoleBindingthat links theServiceAccountto theClusterRole.
| Resource Type | Verbs Required | Purpose in Discovery |
|---|---|---|
| Nodes | get, list, watch | To identify the physical/virtual machines running the pods. |
| Services | get, list, watch | To identify the stable entry points for various applications. |
| Endpoints | get, list, watch | To find the actual IP addresses of the pods behind a service. |
| Pods | get, list, watch | To identify individual container instances and their metadata. |
| ConfigMaps | get | To read configuration data required for service discovery. |
| Non-Resource URLs | get | To access the /metrics endpoint of the API server. |
Failure to correctly implement these RBAC settings results in the kubernetes_sd mechanism failing to populate the target list, leaving the Prometheus dashboard empty despite the presence of active pods.
Deploying the Prometheus Operator for Local Collection
For organizations choosing the "internal collector" route, the Prometheus Operator is the industry-standard mechanism. The Operator pattern extends the Kubernetes API to handle complex applications like Prometheus, automating much of the heavy lifting involved in managing scrape jobs and service discovery.
The lifecycle of a Prometheus deployment via the Operator involves several stages, from the initial installation of the operator itself to the creation of custom resources that define how metrics are collected.
The Deployment Workflow
To deploy the Prometheus Operator within a Kubernetes cluster, the administrator typically follows these steps:
Create a dedicated workspace for the Kubernetes manifests:
bash mkdir operator_k8s cd operator_k8sDefine the RBAC requirements. This involves creating a manifest (e.g.,
prom_rbac.yaml) that defines theServiceAccount,ClusterRole, andClusterRoleBinding. This ensures that once the operator is running, it has the necessary authority to manage the Prometheus instance and discover targets.Apply the manifests to the cluster:
bash kubectl apply -f prom_rbac.yamlDeploy the Prometheus instance. Depending on the deployment method, this might involve deploying a
Prometheuscustom resource. Once deployed, the status can be verified using:
bash kubectl get prometheusVerify the underlying pods are in a
Runningstate:
bash kubectl get pod
Exposing Prometheus via Kubernetes Services
Once the Prometheus pods are operational, they are not automatically accessible from outside the cluster. A Kubernetes Service must be created to provide a stable IP address and a way to reach the pods.
To create a service that allows users to access the Prometheus web interface, a manifest file (e.g., prom_svc.yaml) is used:
yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
labels:
app: prometheus
spec:
ports:
- name: web
port: 9090
targetPort: web
selector:
app: kubernetes.io/name: prometheus
sessionAffinity: ClientIP
This specific configuration achieves several objectives:
- It assigns the name prometheus to the service.
- It maps the external port 9090 to the web port on the target pods.
- It uses a selector to ensure traffic is directed to pods labeled with app: kubernetes.io/name: prometheus.
- It employs sessionAffinity: ClientIP to ensure that a client's connection to the Prometheus web UI is consistently routed to the same Pod, which is vital for maintaining session state during troubleshooting.
After applying the service with kubectl apply -f prom_svc.yaml, the service can be accessed locally by forwarding a port from the local machine to the cluster:
bash
kubectl port-forward svc/prometheus 9090
The user can then navigate to http://localhost:9090 to access the Prometheus interface, where they can navigate to Status -> Targets to monitor the health of the scraping processes.
Understanding Kubernetes System Component Metrics
Beyond application-level metrics, Kubernetes exposes critical system-level data that is essential for cluster health monitoring. These metrics are emitted by various components in a structured, plain-text Prometheus format, specifically designed for both human and machine readability.
The Nature of Kubernetes Metrics
Metrics are typically exposed via an HTTP server on a /metrics endpoint. For many components, this is active by default, but some may require the --bind-address flag to enable the endpoint.
The Kubernetes ecosystem categorizes metrics into several stages of maturity, which is a critical distinction for engineers building long-term dashboards and alerting rules:
- Alpha metrics: These have no stability guarantees. They can be modified or deleted at any time without notice.
- Beta metrics: These observe a looser API contract than stable metrics, offering more predictability but still lacking full stability.
- Stable metrics: The standard for production-ready monitoring.
- Deprecated/Hidden/Deleted metrics: These represent the end of a metric's lifecycle.
Key System Endpoints
Different components expose different subsets of metrics, often on different paths within the kubelet or the API server:
/metrics: The standard metrics endpoint./metrics/cadvisor: Provides container-level resource usage (CPU, memory, network) using cAdvisor./metrics/resource: Specifically focused on resource utilization./metrics/probes: Contains data regarding the status of liveness and readiness probes.
It is important to note that metrics from /metrics/cadvisor, /metrics/resource, and /metrics/probes do not share the same lifecycle as standard system metrics, and engineers should account for this when designing high-availability alerting systems.
ServiceMonitors and the Prometheus Operator Paradigm
In a standard Prometheus installation, adding a new target for scraping requires manual modification of the prometheus.yml configuration file and a reload of the Prometheus process. In a dynamic Kubernetes environment, this is unmanageable.
The Prometheus Operator solves this by introducing the ServiceMonitor custom resource. A ServiceMonitor allows an administrator to define a set of selection criteria (labels) that Prometheus should use to automatically discover and scrape targets.
Instead of writing complex scrape_config blocks, the administrator creates a ServiceMonitor object that says: "Find all services with the label app: my-app and scrape them on port http using the Prometheus scrape configuration." The Operator then automatically translates this custom resource into the appropriate Prometheus configuration and updates the running Prometheus instance.
This approach transforms monitoring from a manual, configuration-driven task into a declarative, Kubernetes-native workflow, where the existence of a service in the cluster automatically triggers its inclusion in the monitoring ecosystem.
Conclusion: Strategic Selection for Observability
The architecture chosen for Kubernetes metric ingestion fundamentally dictates the long-term operational overhead and the complexity of the networking stack. For organizations in the midst of a Proof of Concept, migrating from a monolith to microservices, the "Internal Collector" pattern using the Prometheus Operator and Remote Write is the most efficient path. It minimizes the need to expose complex cluster networks to external servers and leverages Kubernetes-native primitives like ServiceMonitor to automate the discovery of the highly transient pod lifecycles.
Conversely, for mature, large-scale environments where a single, centralized Prometheus instance is the strict organizational standard, the kubernetes_sd approach is the correct architectural choice. While it requires more rigorous upfront work in terms of RBAC configuration and network overlay management, it provides a unified, single-pane-of-glass view that avoids the fragmentation of having multiple Prometheus instances running across different layers of the infrastructure. Ultimately, the decision hinges on whether the organization prioritizes ease of cluster deployment or the centralized management of the global monitoring estate.