The landscape of modern observability is defined not merely by the ability to detect anomalies, but by the efficiency with which an organization can respond to them. As infrastructure grows in complexity, the gap between an alert firing and a human responder taking action becomes a critical vulnerability. Grafana OnCall, an open-source incident response and on-call management tool, serves as the bridge across this gap. Originally engineered by Amixr and subsequently acquired by Grafana Labs, this technology has evolved from a specialized tool into a core component of the broader Grafana ecosystem. It functions by managing alert routing, executing complex escalation policies, maintaining on-call schedules, and delivering high-priority notifications through a diverse array of communication channels including Slack, telephone calls, SMS, and email. While it possesses the capability to operate as a standalone incident management system, its true power is realized when it is tightly integrated with the Grafana ecosystem, allowing for a unified view of system health and operational response.
Deploying Grafana OnCall using Docker provides a robust, self-hosted alternative to commercial-grade platforms such as PagerDuty or OpsGenie. This approach grants engineering teams complete sovereignty over their incident data and notification logic, which is particularly vital for organizations with strict regulatory requirements or those operating in air-gapped environments. However, the transition from a standard monitoring setup to a fully functional on-call orchestration layer requires a deep understanding of the underlying containerized architecture, service dependencies, and environmental configurations.
Architectural Framework and Component Interconnectivity
The operational integrity of a Grafana OnCall deployment relies on a distributed architecture where multiple specialized services must work in perfect synchronicity. The system is not a single monolithic entity but a collection of interacting layers designed to handle high-throughput alert ingestion and reliable notification delivery.
The core of the deployment is the OnCall Engine. This component is built upon a Django-based API server, serving as the central brain of the operation. It is responsible for processing incoming webhooks, evaluating them against defined escalation policies, and determining the next logical step in the incident lifecycle. This engine interacts directly with several downstream and upstream components:
- Alertmanager: Acts as the primary ingestion point for Prometheus-style alerts.
- Grafana Alerts: Provides direct integration for alerts generated within the Grafana UI.
- Grafana UI: The plugin-based interface that allows users to manage schedules and view active incidents.
To ensure that the engine remains responsive to new alerts, the architecture utilizes a Celery worker. This worker operates in the background, decoupling the heavy lifting of notification delivery from the API's request-handling cycle. When an escalation policy dictates that a user must be notified via a phone call or SMS, the engine pushes a task to a queue, and the Celery worker picks up this task to execute the external API call. This prevents a delay in one notification channel from bottlenecking the entire system.
The persistence and coordination layers are equally critical:
- Redis: Serves a dual purpose as a high-speed caching layer and a message broker for the Celery task queue. It manages the
default,critical,long,slack,telegram, andwebhookqueues, ensuring that high-priority alerts are processed with minimal latency. - Database (MySQL or PostgreSQL): Provides the relational storage necessary for managing complex on-call rotations, user identities, and historical incident logs. While a SQLite configuration is available for lightweight testing, production-grade deployments necessitate a robust RDBMS like MySQL or PostgreSQL to ensure data durability and ACID compliance.
Essential Deployment Prerequisites and Resource Allocation
Before initiating a Docker-based deployment, the host environment must meet specific technical requirements to avoid runtime failures or performance degradation. The complexity of managing background workers and task queues imposes a non-neglburigible load on system resources.
The minimum hardware and software specifications are as follows:
- Docker Engine: A functional installation of Docker is mandatory.
- Docker Compose: The orchestration of the multi-container stack requires Docker Compose to manage service dependencies and networking.
- System Memory: Grafana OnCall OSS requires a minimum of 2GB of RAM. Failure to allocate sufficient memory will lead to OOM (Out of Memory) kills of the Celery worker or the Django engine, causing silent failures in incident escalation.
- Verified Environment: Users should verify their installation using the following commands:
docker --versiondocker compose version
Implementation of the Docker Compose Stack
The deployment process involves creating a structured environment where all services are networked together under a unified configuration. The following steps outline the procedure for setting up the official Grafana OnCall stack.
The first step in the deployment is retrieving the orchestration manifest. This file defines the images, volumes, and networks required for the stack.
bash
curl -fsSL https://raw.githubusercontent.com/grafana/oncall/dev/docker-compose.yml -o docker-compose.yml
Once the docker-compose.yml is present, the environment must be configured using a .env file or a specific configuration file like .env_hobby. This file contains the critical variables that dictate how the engine connects to its database, how it identifies itself to the outside world, and how it secures its internal communications.
A standard configuration for a hobby or local testing environment can be established with the following command:
bash
echo "DOMAIN=http://localhost:8080
SECRET_KEY=my_random_secret_must_be_more_than_32_characters_long
RABBITMQ_PASSWORD=rabbitmq_secret_pw
MYSQL_PASSWORD=mysql_secret_pw
COMPOSE_PROFILES=with_grafana
GRAFANA_USER=admin
GRAFANA_PASSWORD=admin" > .env_hobby
In this configuration, the DOMAIN variable defines the BASE_URL where the OnCall service is reachable, which is essential for webhook callbacks from external services. The SECRET_KEY is a high-entropy string used by the Django framework for cryptographic signing; it must be longer than 3RL characters and kept strictly confidential.
The docker-compose.yml file itself defines several key services. The oncall-engine service is configured to run a startup script that automates the database migration and setup process before launching the uwsgi server.
yaml
oncall-engine:
image: grafana/oncall:latest
restart: unless-stopped
command: >
sh -c "python manage.py migrate &&
python manage.py oncall_setup &&
uwsgi --ini uwsgi.ini"
environment: &oncall-env
DATABASE_TYPE: sqlite3
REDIS_URI: redis://redis:6379/0
SECRET_KEY: your-secret_key_change_in_production
BASE_URL: http://localhost:8080
GRAFANA_API_URL: http://grafana:3000
BROKER_TYPE: redis
CELERY_WORKER_QUEUE: default,critical,long,slack,telegram,webhook
volumes:
- oncall-data:/var/lib/oncall
depends_on:
- redis
ports:
- "8080:8080"
networks:
- oncall-net
The oncall-celery service acts as the execution arm, utilizing the same image but running a different command to act as the worker:
yaml
oncall-celery:
image: grafana/oncall:latest
restart: unless-stopped
command: >
sh -c "python manage.py migrate --run-syncdb &&
celery -A engine worker -l info -c 4 -Q default,critical,long,slack,telegram,webhook"
environment: *oncall-env
depends_on:
- redis
networks:
- oncall-net
Advanced Configuration and Production Hardening
Moving from a local development setup to a production-ready environment requires significant changes to the configuration to ensure security, scalability, and reliability.
The following table compares the requirements for a local/hobby deployment versus a production-grade deployment:
| Feature | Hobby/Local Configuration | Production Configuration |
|---|---|---|
| Database Type | sqlite3 |
mysql or postgresql |
| Secret Key | Simple string | High-entropy, rotated key |
| ryptography | Low security | High security (HSM or Vault) |
| Reverse Proxy | None (Direct access) | Nginx/Traefik with HTTPS |
| Persistence | Local Docker Volumes | Managed Cloud DB / Persistent Network Storage |
| Backups | Manual | Automated snapshots for Redis and RDBMS |
Security is a paramount concern when managing incident response. A critical requirement for any production deployment is the implementation of a reverse proxy (such as Nginx, Traefik, or Apache) to terminate TLS/SSL connections. All traffic to the BASE_URL must be encrypted via HTTPS to prevent the interception of sensitive alert data and credentials. Furthermore, the SECRET_KEY must be a cryptographically strong value.
For those prioritizing security even further, an alternative exists in the form of the RapidFort optimized image. This version of the Grafana OnCall container is specifically hardened to reduce the software attack surface. While the runtime instructions remain identical to the official Grafana release, the RapidFort image undergoes an automated optimization process that removes unnecessary binaries and libraries, thereby minimizing the potential for exploitation through container escapes or lateral movement within the cluster.
```docker
To use the official developer image:
docker pull grafana/oncall:dev
To use the RapidFort hardened image:
docker pull rapidfort/oncall
```
Critical Operational Maintenance and Lifecycle Management
Running a long-lived containerized service requires disciplined operational procedures, particularly regarding data integrity and resource cleanup.
The management of the database and Redis state is the most vital aspect of maintenance. Because the engine relies on Redis for task queuing and the database for rotation schedules, any loss of these volumes results in the total loss of the on-call state. Administrators must implement rigorous backup schedules for both the RDBMS and the Redis data.
When updates are required or when a complete reconfiguration of the stack is necessary, the cleanup process must be handled with care. If the intention is to remove all associated data along with the containers, the following command is used:
bash
docker compose down -v
The -v flag is particularly significant as it instructs Docker to remove the named volumes associated with the services. In a production context, this command should be used with extreme caution, as it will permanently delete the oncall-data volume, effectively erasing all historical incident data and user configurations.
Analysis of the Transition to Maintenance Mode
It is imperative for engineers to acknowledge the current lifecycle status of the Grafana OnCall (OSS) engine. As of March 11, 2025, the Open Source Software (OSS) version of Grafana OnCall officially entered maintenance mode. This status was further solidified on March 24, 2026, when the repository was archived.
This transition signifies that while the existing Docker-based deployments will continue to function, no new features will be developed, and critical updates may be limited to security patches. For organizations requiring a modern, actively maintained, and fully supported incident response infrastructure, the strategic move is toward Grafana Cloud IRM (Incident Response Management). This managed service offers the same core capabilities—such as Slack integration, automatic escalations, and multi-channel notifications (Phone, SMS, Telegram)—but removes the operational overhead of managing the underlying Docker containers, Redis instances, and database backends.
The decision to remain on a self-hosted, archived OSS version versus migrating to a managed cloud service depends on the organization's specific tolerance for operational toil versus the cost of managed services. However, the architectural principles of alert routing, escalation policies, and decoupled notification delivery remain the foundational pillars of both the archived OSS engine and the modern Grafana Cloud IRM.