Mastering Checkmk Automation with Ansible: From Official Collections to Custom API Deployments

The integration of Checkmk and Ansible represents a critical intersection of enterprise monitoring and infrastructure-as-code (IaC). By leveraging the automation capabilities of Ansible, organizations can transition from manual agent installation and server configuration to a scalable, repeatable, and verifiable deployment pipeline. This convergence is primarily achieved through two distinct pathways: the utilization of the official Checkmk Ansible collection and the implementation of custom playbooks that interact directly with the Checkmk REST API. Whether the goal is to bootstrap a new monitoring environment or maintain thousands of agents across a heterogeneous fleet of Linux distributions, understanding the technical nuances of these methods is essential for ensuring monitoring consistency and minimizing technical debt.

The Architecture of the Checkmk Ansible Collection

The Checkmk Ansible collection is a comprehensive suite of automation tools designed specifically to bridge the gap between the Checkmk monitoring core and the managed nodes in a network. In the Ansible ecosystem, a collection serves as a distribution mechanism for content, which in the case of Checkmk, is divided into two primary functional components: modules and roles.

Modules are the smallest atomic units of Ansible's execution. They are designed to perform a specific action on a target system. The Checkmk collection provides specialized modules that interact with the Checkmk REST API. This allows administrators to automate the creation of hosts, the adjustment of monitoring rules, and the general orchestration of the monitoring site without manually navigating the Checkmk GUI.

Roles, conversely, provide a higher layer of abstraction. While modules perform single tasks, roles encapsulate a series of tasks, variables, and handlers to achieve a complex end-state. The Checkmk roles are specifically engineered to handle the deployment and configuration of the Checkmk server itself and the automated rollout of monitoring agents to managed endpoints.

The distribution of this collection is managed via Ansible Galaxy, the public hub for Ansible content. Galaxy provides the authoritative source for the latest releases, installation instructions, and version history. For developers and power users, the collection is hosted on GitHub, which serves as the primary development repository. This open-source approach allows for a community-driven feedback loop where users can report bugs through the Issue Tracker or contribute new features and optimizations via pull requests.

Technical Implementation of Manual Agent Deployment

In certain organizational environments, constraints may prevent the use of the official Checkmk collection. These constraints often include strict security policies regarding third-party collections, the use of the Checkmk RAW edition, or specific requirements for minimal dependency footprints. In these scenarios, administrators must rely on "the Ansible way" using core modules to achieve agent deployment.

A manual deployment strategy typically revolves around the use of the ansible.builtin module suite. The process involves several critical phases: environment preparation, version discovery, package acquisition, and system-specific configuration.

The preparation phase involves ensuring that the target host has a designated landing zone for installation packages. This is often achieved using the ansible.builtin.file module to create a directory, such as /usr/local/install/packages, with strict ownership (root:root) and permissions (0775) to ensure the installation process has the necessary access while maintaining system security.

Version discovery is a pivotal step to prevent the installation of outdated or incompatible agents. This can be achieved through two primary methods. The first is the execution of the omd versions command on the monitoring host via the ansible.builtin.shell module, delegating the task to the monitor server. The second, more modern approach, involves querying the Checkmk REST API using the ansible.builtin.uri module. By hitting the /check_mk/api/1.0/version endpoint with an authorized Bearer token, Ansible can programmatically determine the exact version string of the running Checkmk instance.

Comparative Analysis of Deployment Methods

The choice between using the official collection and custom API-driven playbooks depends on the environment's constraints and the desired level of abstraction.

Feature Official Ansible Collection Custom API-Driven Playbook
Primary Interface Specialized Modules/Roles ansible.builtin modules (uri, get_url)
Management Ansible Galaxy / GitHub Local YAML Playbooks
REST API Interaction Abstracted through modules Explicit via ansible.builtin.uri
Flexibility Standardized for most users High; tailored to specific OS versions
Maintenance Maintained by Checkmk Maintained by local administrator
Dependency Requires collection installation Requires only core Ansible modules

Deep Dive into Multi-Platform Agent Installation

Deploying Checkmk agents across a mixed environment requires logic that can differentiate between package managers. A robust playbook must handle APT (Debian/Ubuntu), YUM (CentOS/RHEL), and DNF (Fedora/Rocky/AlmaLinux) based systems.

For APT-based systems, the workflow begins with the ansible.builtin.get_url module, which fetches the .deb package from the Checkmk server's agent directory. The URL structure typically follows a pattern such as https://{{ monitor_host }}/{{ monitor_site }}/check_mk/agents/check-mk-agent_{{ version }}-1_all.deb. Once downloaded, the ansible.builtin.apt module is used to install the package. A critical parameter here is allow_downgrade: yes, which ensures that the agent can be updated even if the versioning logic triggers a perceived downgrade.

For YUM and DNF-based systems, the process is similar but utilizes the .rpm package. The ansible.builtin.yum or ansible.builtin.dnf modules are employed to install the .noarch.rpm file. In these cases, disable_gpg_check: yes is often necessary if the internal monitoring server does not have a signed GPG key trusted by all managed nodes.

The technical challenge of agent deployment is not merely the installation of the binary, but the configuration of the agent's behavior. For instance, on APT-based systems, the mk_apt plugin is often used to monitor package updates. To optimize this, administrators use the ansible.builtin.lineinfile module to modify the mk_apt script, changing the UPGRADE variable to dist-upgrade (e.g., UPGRADE=dist-upgrade). This ensures that the monitoring system correctly reflects the intent of a distribution upgrade rather than a standard package update.

Advanced Troubleshooting and Edge Case Management

Automating agent deployment frequently reveals underlying system incompatibilities, particularly with legacy operating systems.

One notable failure point is SLES (SUSE Linux Enterprise Server) 12.x. Reports indicate that playbooks utilizing the zypper module or specific agent paths may fail on these legacy hosts, requiring exclusion from the general automation group or a specialized set of tasks tailored to the older kernel and package manager behavior.

Another critical failure involves the interaction between Ansible and the underlying Python environment on the controller. A known issue occurs when the Ansible controller runs on Rocky Linux (e.g., version 9.6) using a version of Ansible from the EPEL repository (such as v 2.14.18). In some instances, an error manifests as HTTPSConnection.__init__() got an unexpected keyword argument 'cert_file'. This is an environmental conflict related to how the specific version of Ansible handles HTTPS connections, which can be mitigated by shifting the method of agent retrieval from direct file system access to the Checkmk REST API via the ansible.builtin.uri and ansible.builtin.get_url modules.

Furthermore, when automating the agent, it is often necessary to disable unwanted services to prevent conflicts or unnecessary resource consumption. The ansible.builtin.systemd module can be used to stop and disable the cmk-agent-ctl-daemon, ensuring that the agent operates in the desired mode (e.g., purely via SSH pull) without an interfering local daemon.

The Role of the REST API in Automation

The transition from simple shell scripts to API-driven automation represents a significant leap in reliability. The Checkmk REST API allows Ansible to treat the monitoring server as a programmable entity.

By using an automation user—a specialized account with a secret—Ansible can authenticate via a Bearer token. This allows the playbook to:

  • Dynamically determine the current server version.
  • Validate the existence of the target site.
  • Programmatically fetch the correct agent binary for the host's architecture.
  • Update existing plugins by scanning the current installation and comparing it with the available versions on the server.

The use of the ansible.builtin.uri module allows for a structured JSON response, which can be parsed using Jinja2 filters. For example, the version string can be extracted from the JSON response and sliced (e.g., cmk_version.json.versions.checkmk[:-4]) to match the exact filename of the agent package stored on the server.

Implementation Workflow for Custom Agent Updates

For those avoiding the official collection, the following technical workflow is recommended for updating agents and plugins:

  1. Directory Sanitization: Use ansible.builtin.file with state: absent on the download_path to remove stale packages.
  2. Directory Reconstruction: Use ansible.builtin.file with state: directory to ensure a clean environment.
  3. Version Query: Use ansible.builtin.uri to fetch the current version from the API.
  4. Targeted Download: Use ansible.builtin.get_url to pull the appropriate .deb or .rpm based on the ansible_pkg_mgr fact.
  5. Force Installation: Use the respective package manager module with allow_downgrade enabled.
  6. Plugin Synchronization: Identify existing plugins in /usr/lib/check_mk_agent/plugins and update them using ansible.builtin.get_url by iterating through the found files.
  7. Post-Install Configuration: Apply specific line edits to plugin scripts (like mk_apt) to align with organizational standards.

Conclusion: Strategic Analysis of Automation Paths

The decision to use the official Checkmk Ansible collection versus a custom API-driven playbook is a trade-off between convenience and control. The official collection is the superior choice for the vast majority of users, as it abstracts the complexity of the REST API into reusable roles and modules, significantly reducing the amount of YAML code required to maintain a monitoring environment. It provides a supported path for server setup and agent deployment, backed by the Checkmk development team and a community of contributors.

However, the custom API-driven approach is an essential survival strategy for environments with extreme constraints. The ability to deploy agents using only ansible.builtin modules ensures that the automation is portable and independent of external collection dependencies. This method is particularly effective for the Checkmk RAW edition, where the overhead of a full collection might be unnecessary. The primary risk of the custom approach is the maintenance burden; the administrator becomes responsible for handling OS-specific quirks, such as the SLES 12.x incompatibilities or the Rocky Linux HTTPS connection errors.

Ultimately, the most resilient monitoring infrastructure is one that leverages the REST API. Whether that interaction happens through a high-level module in a collection or a low-level uri call in a playbook, the shift toward API-centric management ensures that the monitoring state is always synchronized with the actual state of the infrastructure.

Sources

  1. Introduction to Checkmk Ansible Collection
  2. Agent Deployment on RAW the Ansible Way
  3. Ansible Playbook for Updating Agents on Hosts

Related Posts