Orchestrating Observability: Deploying and Managing Custom Grafana Instances on OpenShift 4

The evolution of monitoring within the Red Hat OpenShift Container Platform (OCP) has undergone significant architectural shifts, particularly regarding the accessibility of visualization layers. Historically, OpenShift provided a prepackaged Grafana instance within the openshift-monitoring namespace. However, this legacy implementation was strictly read-only, limiting the ability of platform engineers and developers to extend observability via custom dashboards or specialized data sources. With the release of OpenShift 4.11, this read-only version was deprecated, and by version 4.12, it was entirely removed from the standard distribution. This architectural change necessitates a robust, self-managed approach to observability.

For modern OpenShift environments—including those running OpenShift Virtualization or operating in disconnected, air-gapped, or highly regulated sectors—the ability to deploy a custom Grafana instance is paramount. Relying on a community-supported or self-deployed Grafiona Operator allows organizations to bypass the limitations of the default stack. This is especially critical for users who wish to maintain data sovereignty, avoiding the transmission of metrics to external consoles such as console.redhat.com via Red Hat Insights. By leveraging the Grafana Operator, administrators can implement a GitOps-driven workflow, utilizing tools like ArgoCD to manage dashboards, data sources, and configurations as code, thereby ensuring that observability scales alongside the underlying cluster infrastructure.

The Architecture of Custom Observability via Grafana Operator

The deployment of a custom Grafana stack on OpenShift is not merely about installing a single application; it is about orchestrating a series of Kubernetes custom resources (CRs) that interact with the cluster's existing Prometheus monitoring stack. The core of this architecture is the Grafana Operator, a Kubernetes-native controller designed to automate the lifecycle of Grafana instances.

The Grafana Operator functions by synchronizing Kubernetes custom resources with the actual state of the Grafron instance. This synchronization encompasses several critical components:

Grafana Instances: The core deployment of the Grafana engine itself.
Dashboards: Managed via GrafanaDashboard custom resources, allowing for automated updates.
Data Sources: Managed via GrafanaDatasource resources, enabling seamless connection to Prometheus.
Folders: Organizing dashboards into logical groups for multi-tenant or multi-team environments.

It is vital to distinguish between the official Red Hat-vetted components and community-driven operators. The Grafana Operator sourced from OperatorHub is a community operator. Because these operators are not vetted or verified by Red Hat, they carry an inherent level of uncertainty regarding stability and long-term support. Red Hat provides no official support for these community-driven implementations. However, using a self-supported Grafana operator is highly advantageous because it does not modify the existing Prometheus operator. This design choice preserves the integrity of the default OpenShift monitoring stack, leaving the cluster's primary metric collection mechanism in a supported, unmodified state.

Deployment Strategies: Manual Installation vs. GitOps Automation

When deploying the Grafana Operator, engineers generally choose between manual imperative commands or a declarative GitOps approach. While manual commands are useful for rapid prototyping, the GitOps approach using OpenShift GitOps (ArgoCD) is the industry standard for production-grade stability and auditability.

Manual Deployment via CLI and OperatorHub

For administrators who prefer using the OpenShift web console or the oc command-line interface, the deployment process follows a structured sequence of operations.

The initial step involves the creation of a dedicated namespace to isolate the Grafana infrastructure. This prevents resource contention and simplifies the application of security policies.

Create a dedicated project for the Grafana deployment.
bash oc new-project grafana
Navigate to the OpenShift Web Console.
Access the Operators menu and select OperatorHub.
Search for "Grafana" within the OperatorHub catalog.
Select the Grafana operator package.
A notification will appear informing the user that this operator has not been vetted or tested by Red Hat; click "Continue" to proceed.
Configure the installation settings, accepting the default values for the operator deployment.
Click "Install" and wait for the operator status to transition to "Succeeded".
If the operator is not configured for automatic updates, you may be required to manually click "Approve" in the Installed Operators view.

Alternatively, for those utilizing Helm, the deployment can be executed through the following commands:

bash helm repo add grafana https://grafana.github.io/helm-charts helm upgrade -i grafana-operator grafana/grafana-operator

The GitOps Approach with ArgoCD

For large-scale environments, deploying the Grafana application through OpenShift GitOps (ArgoCD) ensures that the entire observability stack is version-controlled. This method is particularly effective when integrating with the Managed OpenShift Black Belt helm repositories.

To deploy via ArgoCD, the application definition must be applied to the openshift-gitops namespace:

bash oc apply -f custom-grafana.application.yaml

This approach allows the GrafanaDashboard and GrafanaDatasource resources to be reconciled automatically whenever the Git repository is updated, eliminating manual configuration drift.

RBAC Configuration and Security Context Constraints

A critical component of a successful Grafana deployment is the configuration of Role-Based Access Control (RBAC). Because Grafana must query the Prometheus instance located in the openshift-monitoring namespace, the service account used by Grafana must be granted specific cluster-wide permissions.

The grafana-sa service account, which is automatically created alongside the Grafana instance, requires elevated privileges to read cluster metrics. Without these permissions, the Prometheus data source will fail to authenticate, resulting in empty dashboards.

The following permissions must be applied to the grafiona-sa service account:

cluster-monitoring-view: To allow reading of cluster-level metrics.
openshift-cluster-monitoring-view: To allow access to the specialized OpenShift monitoring metrics.
edit role within the specific Grafana namespace: To allow the service account to access secrets and manage internal resources.

Execute the following commands to establish these permissions:

bash oc adm policy add-cluster-role-to-user cluster-monitoring-view -z grafana-sa oc adm policy add-cluster-role-to-user openshift-cluster-monitoring-view -z grafana-sa oc adm policy add-role-to-user edit -z grafana-sa -n my-grafana

Token Generation for Prometheus Authentication

Since the Prometheus instance is located in a different namespace (openshift-monitoring), authentication is handled via a Bearer token. This token must be retrieved and injected into the Grafana Data Source configuration. For long-term stability, a token with an extended duration should be generated:

bash oc create token grafana-sa --duration=8760h -n my-grafana

This command outputs a long-lived token that should be used as the Bearer value in the HTTP authorization header of the Prometheus data source configuration.

Security Context Constraints (SCC) and Non-Root Execution

When deploying advanced observability agents like Grafana Alloy on OpenShift, security posture is a primary concern. OpenShift utilizes Security Context Constraints (SCC) to restrict the permissions of Pods. To ensure compliance with enterprise security policies, agents like Alloy must be configured to run as non-root users.

Administrators must review the rbac.yaml configuration to ensure that the verbs and permissions granted to the Alloy service account are appropriate for the local environment. Failure to apply the correct SCCs can lead to Pods being blocked from starting due to insufficient privileges.

Configuring the Prometheus Data Source and Dashboards

Once the operator is running and the service account has the necessary permissions, the final stage is the instantiation of the GrafanaDatasource and GrafanaDashboard resources.

Implementing the Prometheus Data Source

The GrafanaDatasource custom resource defines how Grafana connects to the Thanos-querier within the OpenShift monitoring stack. The configuration must include the specific URL of the Thanos service and the Bearer token generated previously.

Below is a detailed specification for a GrafanaDatasource resource:

yaml apiVersion: grafana.integreatly.org/v1beta1 kind: GrafanaDatasource metadata: annotations: name: grafanadatasource-prometheus namespace: monitoring spec: allowCrossNamespaceImport: true datasource: access: proxy editable: true secureJsonData: httpHeaderValue1: Bearer <your_token_here> name: prometheus url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091' jsonData: httpHeaderName1: Authorization timeInterval: 5s tlsSkipVerify: true basicAuth: false isDefault: true type: prometheus instanceSelector: matchLabels: dashboards: grafana plugins: - name: grafana-clock-panel version: 1.3.0 resyncPeriod: 5m

In the httpHeaderValue1 field, the token from the oc create token command must be inserted immediately following the Bearer string. This configuration uses a proxy-based access model, which allows the Grafana backend to handle the authentication to the Prometheus endpoint.

Deploying Dashboards

With the data source active, dashboards can be deployed using the GrafanaDashboard custom resource. The Grafana Operator watches for these resources and automatically injects them into the Grafana instance.

If the custom resource deployment is not feasible due to specific cluster restrictions, the JSON definitions for these dashboards can be imported manually via the Grafana UI. For automated deployments, use:

bash oc create -f <dashboard_definition_file>.yaml

To verify the deployment and access the interface, retrieve the route for the Grafana instance:

bash oc get route -n grafana

The output will provide a URL, typically in the format https://grafana-route-grafana.apps<yourclusterfqdn>. Note that because the cluster may use self-signed certificates, your browser may present a certificate warning upon first access.

Analysis of Observability Scalability

The transition from a read-only, managed Grafana instance to a self-managed, operator-driven architecture represents a fundamental shift in how OpenShift administrators approach cluster observability. This architecture provides three distinct levels of operational maturity:

The first level is the establishment of a secure, authenticated connection to the existing Prometheus metrics. By utilizing the cluster-monitoring-view role, administrators ensure that the custom Grafana instance remains an extension of the existing monitoring stack rather than a competing entity. This maintains the "source of truth" within the standard OpenShift monitoring namespace.

The second level is the implementation of automation via the Grafana Operator and GitOps. By treating dashboards and data sources as code, organizations can replicate monitoring configurations across multiple clusters (e.g., dev, staging, production) with high fidelity. This reduces the risk of configuration drift and allows for rapid disaster recovery.

The third level is the expansion of observability into specialized domains, such as OpenShift Virtualization or complex microservices architectures. The ability to create custom billing dashboards or resource-utilization dashboards for VMs—without needing to modify the underlying cluster's core monitoring configuration—enables teams to build highly specialized, business-aligned observability layers. This is particularly vital in air-gapped environments where the lack of external connectivity prevents the use of cloud-based monitoring services.

Ultimately, the success of a custom Grafana deployment on OpenShift depends on the rigorous application of RBAC and the careful management of service account tokens. While the community-driven nature of the operator introduces a degree of responsibility regarding maintenance and updates, the flexibility and power gained in customizing the cluster's visibility far outweigh the risks for advanced platform engineering teams.