Orchestrating Distributed Storage: An Exhaustive Guide to cephadm-ansible

The administration of large-scale distributed storage clusters requires a delicate balance between manual precision and automated scalability. Within the Ceph ecosystem, cephadm serves as the primary tool for deploying and managing the cluster. However, while cephadm is powerful for the lifecycle of the cluster itself, there are numerous operational workflows—ranging from initial host preparation to the cleanup of legacy environments—that fall outside the immediate scope of the cephadm binary. This is where cephadm-ansible becomes critical. It is a specialized collection of Ansible playbooks and modules designed to wrap cephadm and ceph orch commands, transforming them into idempotent, repeatable, and scalable automation workflows. By leveraging the power of Ansible, administrators can manage complex configurations across hundreds of nodes without the risk of manual entry errors, ensuring that the underlying infrastructure is perfectly aligned with the storage requirements.

Architectural Overview and Design Philosophy

The cephadm-ansible project is engineered to bridge the gap between the low-level orchestration capabilities of cephadm and the high-level configuration management provided by Ansible. Its primary objective is to simplify workflows that are not inherently covered by the standalone cephadm tool. This is achieved by providing a set of custom Ansible modules that act as a programmatic wrapper around the Ceph orchestrator.

The project has evolved into a sophisticated Ansible Collection, identified as ceph.cephadm. This transition was executed to align with modern Ansible standards, enabling easier distribution via Ansible Galaxy while maintaining a commitment to backward compatibility. For users who prefer a traditional approach, the project remains available as a Git repository that can be cloned and executed directly. This dual-support mechanism ensures that whether a user is operating in a modern CI/CD pipeline using collections or a legacy environment using local playbooks, the functionality remains identical.

The internal structure of the collection is meticulously organized to ensure modularity and maintainability. The plugins/modules/ directory contains the core logic for the custom modules, while plugins/module_utils/ houses shared utilities, such as ceph_common.py, which provide consistent helper functions across different modules. The roles/ directory contains the ceph_defaults role, which manages default variables and settings, ensuring that the environment is consistent across different deployment scenarios.

Deployment and Installation Methodologies

There are three primary methods for integrating cephadm-ansible into a management environment, depending on the administrator's needs for version control and distribution.

The most modern approach is via the Ansible Galaxy installation. This method allows the administrator to pull the collection directly into their Ansible environment, making the modules available globally across their playbooks.

bash ansible-galaxy collection install ceph.cephadm

Once installed, the modules are accessed using their fully qualified collection name (FQCN), such as ceph.cephadm.cephadm_bootstrap.

Alternatively, administrators can clone the repository directly from GitHub. This method is often preferred by developers or those who need to modify the underlying playbooks or modules for specific edge-case requirements.

bash git clone https://github.com/ceph/cephadm-ansible cd cephadm-ansible

When using the cloned repository, playbooks are executed as local files. For example:

bash ansible-playbook -i hosts cephadm-preflight.yml

For Red Hat Ceph Storage users, the functionality is packaged as the cephadm-ansible package, which is installed on the Ansible administration node. This installation typically places the assets in /usr/share/cephadm-ansible, providing a standardized path for system-wide administration.

Comprehensive Analysis of Available Modules

The core of cephadm-ansible is its suite of custom modules. These modules encapsulate complex cephadm and ceph orch calls, reducing the need for the shell or command modules, which are generally avoided in Ansible due to their lack of idempotency.

The cephadm_bootstrap Module

This module is the entry point for any new Ceph cluster. It automates the initial bootstrapping process, which is the most critical phase of deployment.

Direct Fact: The module bootstraps a Ceph cluster using cephadm.
Technical Layer: It interacts with the cephadm binary to initialize the first monitor and manager, creating the initial cluster configuration and administrative keys.
Impact Layer: This removes the need for manual command-line execution on the bootstrap node, ensuring that the initial cluster state is consistent and reproducible.
Contextual Layer: This module is typically used after the cephadm-preflight playbook has ensured that the host is ready.

Example usage in a playbook:

yaml - name: Bootstrap Ceph cluster ceph.cephadm.cephadm_bootstrap: mon_ip: 192.168.1.10

The cephorchhost Module

Managing the membership of a cluster is a frequent task. This module provides a clean interface for host manipulation.

Direct Fact: This module is used to add or remove hosts from the cluster and can also apply labels to those hosts.
Technical Layer: It wraps the ceph orch host add and ceph orch host remove commands. The labeling functionality allows administrators to designate specific roles (e.g., mon, osd, mgr) to specific hardware.
Impact Layer: It allows for the dynamic scaling of the cluster. For instance, adding a new rack of servers becomes a matter of adding them to the Ansible inventory and running the playbook.
Contextual Layer: Host labeling is essential for the ceph_orch_apply module, as service specs often target hosts based on their labels.

The ceph_config Module

Configuration management is a cornerstone of storage stability. The ceph_config module provides a programmatic way to manage the Ceph configuration database.

Direct Fact: This module is used to set or get Ceph configuration options.
Technical Layer: It utilizes the ceph config set and ceph config get commands to modify global or daemon-specific parameters.
Impact Layer: Administrators can ensure that performance tuning parameters (like mon_allow_pool_delete) are applied uniformly across all monitors without logging into each node.
Contextual Layer: This is often used in verification playbooks to confirm that a setting has been successfully applied after a change.

Example of a configuration change playbook:

yaml - name: set pool delete hosts: host01 become: true gather_facts: false tasks: - name: set the allow pool delete option ceph_config: action: set who: mon option: mon_allow_pool_delete value: true - name: get the allow pool delete setting ceph_config: action: get who: mon option: mon_allow_pool_delete register: verify_mon_allow_pool_delete - name: print current mon_allow_pool_delete setting debug: msg: "the value of 'mon_allow_pool_delete' is {{ verify_mon_allow_pool_delete.stdout }}"

The cephorchapply Module

The Ceph orchestrator uses service specifications (YAML files) to define how daemons should be deployed.

Direct Fact: This module applies a service spec to the cluster.
Technical Layer: It takes a specification file and feeds it into the ceph orch apply command, allowing the orchestrator to reconcile the current state of the cluster with the desired state.
Impact Layer: This enables "Infrastructure as Code" (IaC) for storage, where the entire daemon layout is version-controlled in Git and applied via Ansible.
Contextual Layer: This is used to deploy OSD services, as seen in the deploy_osd_service.yml playbook.

The cephorchdaemon Module

Lifecycle management of individual daemons is handled through this module.

Direct Fact: This module is used to start, stop, or restart Ceph daemons.
Technical Layer: It wraps the ceph orch daemon commands, targeting specific daemon IDs across the cluster.
Impact Layer: This is critical for performing rolling restarts or troubleshooting specific failed services without manual intervention on the host.
Contextual Layer: It provides a safety mechanism for administrators to manage daemons across a large fleet from a single administration node.

The cephadmregistrylogin Module

Containerized deployments require access to image registries.

Direct Fact: This module allows the system to log in to a container registry.
Technical Layer: It handles the authentication process for the container engine (Podman or Docker), ensuring that the node has the necessary credentials to pull Ceph images.
Impact Layer: This prevents deployment failures caused by "ImagePullBackOff" errors in secured environments where private registries are used.
Contextual Layer: This is a prerequisite for the cephadm_bootstrap and ceph_orch_apply modules.

Operational Playbooks and Workflows

Beyond the modules, cephadm-ansible provides a set of pre-defined playbooks that handle common operational scenarios.

Host Preflight and Initialization

Before a host can join a Ceph cluster, it must meet specific software and configuration requirements. The cephadm-preflight playbook automates this.

Workflow: The preflight playbook installs essential packages including podman, lvm2, and chrony, as well as the cephadm tool itself.
Execution: It can be run against a specific host using the --limit flag.

bash ansible-playbook -i hosts cephadm-preflight.yml --extra-vars "ceph_origin=rhcs" --limit host02

Impact: This ensures that every node in the cluster has a consistent base software version, preventing "drift" that could lead to unstable cluster behavior.

Cluster Lifecycle and Maintenance

Several other playbooks facilitate the ongoing management of the cluster:

Distribute SSH Key: Automates the copying of the SSH public key to remote hosts, which is a prerequisite for cephadm to manage the nodes.
Client Setup: Configures client hosts to communicate with the cluster, ensuring that the necessary keys and configuration files are present.
Purge: Provides a destructive but necessary workflow to completely remove a Ceph cluster from the hosts, cleaning up containers, configurations, and data.
RocksDB Resharding: Performs resharding for the RocksDB database of a given OSD, which is critical for maintaining performance as the number of objects grows.
Insecure Registry: Adds a specific registry to registries.conf as insecure, which is often required in internal lab environments where SSL certificates are not fully implemented.

Technical Prerequisites and Environment Configuration

To successfully utilize cephadm-ansible, certain environmental conditions must be met on the Ansible administration node and the target storage nodes.

Administration Node Requirements

The administration node is the central point of control. It must have the following:

Ansible Installation: The cephadm-ansible package or the ceph.cephadm collection must be installed.
SSH Access: The Ansible user must have sudo privileges and passwordless SSH access to all nodes in the storage cluster.
Inventory Management: An Ansible inventory file (usually named hosts) must be maintained, containing the IP addresses or hostnames of the cluster and admin hosts.

Cluster Host Requirements

The target nodes must be prepared to receive the orchestration commands:

Admin Host Designation: A host is considered an "admin" host when it possesses the admin keyring and the Ceph config file. In the Ceph orchestrator, this is achieved by adding the _admin label to the host. While the bootstrap host is typically the first admin host, additional admin hosts can be added for redundancy.
Container Engine: Since Ceph is deployed via containers, the hosts must have a compatible container engine (Podman is the default for Red Hat Ceph Storage).

Comparison of Workflow Methods

The following table provides a comparative analysis of the different ways to utilize the cephadm-ansible project.

Method	Installation Command	Execution Example	Best For
Ansible Collection	`ansible-galaxy collection install ceph.cephadm`	`ansible-playbook la_playbook.yml` (using FQCN)	Modern CI/CD, Enterprise environments
Git Clone	`git clone https://github.com/ceph/cephadm-ansible`	`ansible-playbook -i hosts cephadm-preflight.yml`	Development, Customization, Quick starts
RHCS Package	`yum install cephadm-ansible`	`cd /usr/share/cephadm-ansible && ansible-playbook ...`	Red Hat Certified environments

Detailed Analysis of Project Structure

The cephadm-ansible project is structured to ensure that it can function both as a standalone set of scripts and as a formal Ansible Collection. This is achieved through a clever mapping of directories.

Library and Module Utils: The original repository used a library/ and module_utils/ folder. In the collection format, these are moved to plugins/modules/ and plugins/module_utils/ respectively. This allows Ansible to locate the custom Python code that powers the modules.
Role Management: The ceph_defaults role is relocated to the roles/ directory. This role is critical because it provides the default variables that the playbooks rely on, ensuring a consistent baseline.
Playbook Organization: Playbooks are stored in the playbooks/ directory (in the collection format) or at the root (in the cloned repository format).
Testing and Validation: The tests/ and validate/ directories contain suites used to ensure that the modules behave as expected across different versions of Ceph and Ansible.
Configuration: The ansible.cfg file in the repository explicitly points to the local library and module utilities, which allows the cloned repository to function without the need for a formal Galaxy installation.

Conclusion

The cephadm-ansible framework is an indispensable tool for any administrator operating a Red Hat Ceph Storage or community Ceph cluster. By abstracting the complexities of the cephadm and ceph orch command-line interfaces into idempotent Ansible modules, it transforms the process of cluster deployment and management from a manual, error-prone task into a streamlined, automated workflow.

The ability to perform "preflight" checks, bootstrap clusters, manage host labels, and tune configurations via code ensures that the infrastructure remains stable and scalable. The transition to a formal Ansible Collection (ceph.cephadm) further enhances its utility, allowing it to fit into modern DevOps pipelines while maintaining a bridge for those who prefer traditional repository-based workflows. Ultimately, the synergy between cephadm's orchestration and Ansible's configuration management provides a robust foundation for managing the lifecycle of distributed storage at scale.