The administration of large-scale distributed storage clusters requires a delicate balance between manual precision and automated scalability. Within the Ceph ecosystem, cephadm serves as the primary tool for deploying and managing the cluster. However, while cephadm is powerful for the lifecycle of the cluster itself, there are numerous operational workflows—ranging from initial host preparation to the cleanup of legacy environments—that fall outside the immediate scope of the cephadm binary. This is where cephadm-ansible becomes critical. It is a specialized collection of Ansible playbooks and modules designed to wrap cephadm and ceph orch commands, transforming them into idempotent, repeatable, and scalable automation workflows. By leveraging the power of Ansible, administrators can manage complex configurations across hundreds of nodes without the risk of manual entry errors, ensuring that the underlying infrastructure is perfectly aligned with the storage requirements.
Architectural Overview and Design Philosophy
The cephadm-ansible project is engineered to bridge the gap between the low-level orchestration capabilities of cephadm and the high-level configuration management provided by Ansible. Its primary objective is to simplify workflows that are not inherently covered by the standalone cephadm tool. This is achieved by providing a set of custom Ansible modules that act as a programmatic wrapper around the Ceph orchestrator.
The project has evolved into a sophisticated Ansible Collection, identified as ceph.cephadm. This transition was executed to align with modern Ansible standards, enabling easier distribution via Ansible Galaxy while maintaining a commitment to backward compatibility. For users who prefer a traditional approach, the project remains available as a Git repository that can be cloned and executed directly. This dual-support mechanism ensures that whether a user is operating in a modern CI/CD pipeline using collections or a legacy environment using local playbooks, the functionality remains identical.
The internal structure of the collection is meticulously organized to ensure modularity and maintainability. The plugins/modules/ directory contains the core logic for the custom modules, while plugins/module_utils/ houses shared utilities, such as ceph_common.py, which provide consistent helper functions across different modules. The roles/ directory contains the ceph_defaults role, which manages default variables and settings, ensuring that the environment is consistent across different deployment scenarios.
Deployment and Installation Methodologies
There are three primary methods for integrating cephadm-ansible into a management environment, depending on the administrator's needs for version control and distribution.
The most modern approach is via the Ansible Galaxy installation. This method allows the administrator to pull the collection directly into their Ansible environment, making the modules available globally across their playbooks.
bash
ansible-galaxy collection install ceph.cephadm
Once installed, the modules are accessed using their fully qualified collection name (FQCN), such as ceph.cephadm.cephadm_bootstrap.
Alternatively, administrators can clone the repository directly from GitHub. This method is often preferred by developers or those who need to modify the underlying playbooks or modules for specific edge-case requirements.
bash
git clone https://github.com/ceph/cephadm-ansible
cd cephadm-ansible
When using the cloned repository, playbooks are executed as local files. For example:
bash
ansible-playbook -i hosts cephadm-preflight.yml
For Red Hat Ceph Storage users, the functionality is packaged as the cephadm-ansible package, which is installed on the Ansible administration node. This installation typically places the assets in /usr/share/cephadm-ansible, providing a standardized path for system-wide administration.
Comprehensive Analysis of Available Modules
The core of cephadm-ansible is its suite of custom modules. These modules encapsulate complex cephadm and ceph orch calls, reducing the need for the shell or command modules, which are generally avoided in Ansible due to their lack of idempotency.
The cephadm_bootstrap Module
This module is the entry point for any new Ceph cluster. It automates the initial bootstrapping process, which is the most critical phase of deployment.
- Direct Fact: The module bootstraps a Ceph cluster using
cephadm. - Technical Layer: It interacts with the
cephadmbinary to initialize the first monitor and manager, creating the initial cluster configuration and administrative keys. - Impact Layer: This removes the need for manual command-line execution on the bootstrap node, ensuring that the initial cluster state is consistent and reproducible.
- Contextual Layer: This module is typically used after the
cephadm-preflightplaybook has ensured that the host is ready.
Example usage in a playbook:
yaml
- name: Bootstrap Ceph cluster
ceph.cephadm.cephadm_bootstrap:
mon_ip: 192.168.1.10
The cephorchhost Module
Managing the membership of a cluster is a frequent task. This module provides a clean interface for host manipulation.
- Direct Fact: This module is used to add or remove hosts from the cluster and can also apply labels to those hosts.
- Technical Layer: It wraps the
ceph orch host addandceph orch host removecommands. The labeling functionality allows administrators to designate specific roles (e.g.,mon,osd,mgr) to specific hardware. - Impact Layer: It allows for the dynamic scaling of the cluster. For instance, adding a new rack of servers becomes a matter of adding them to the Ansible inventory and running the playbook.
- Contextual Layer: Host labeling is essential for the
ceph_orch_applymodule, as service specs often target hosts based on their labels.
The ceph_config Module
Configuration management is a cornerstone of storage stability. The ceph_config module provides a programmatic way to manage the Ceph configuration database.
- Direct Fact: This module is used to set or get Ceph configuration options.
- Technical Layer: It utilizes the
ceph config setandceph config getcommands to modify global or daemon-specific parameters. - Impact Layer: Administrators can ensure that performance tuning parameters (like
mon_allow_pool_delete) are applied uniformly across all monitors without logging into each node. - Contextual Layer: This is often used in verification playbooks to confirm that a setting has been successfully applied after a change.
Example of a configuration change playbook:
yaml
- name: set pool delete
hosts: host01
become: true
gather_facts: false
tasks:
- name: set the allow pool delete option
ceph_config:
action: set
who: mon
option: mon_allow_pool_delete
value: true
- name: get the allow pool delete setting
ceph_config:
action: get
who: mon
option: mon_allow_pool_delete
register: verify_mon_allow_pool_delete
- name: print current mon_allow_pool_delete setting
debug:
msg: "the value of 'mon_allow_pool_delete' is {{ verify_mon_allow_pool_delete.stdout }}"
The cephorchapply Module
The Ceph orchestrator uses service specifications (YAML files) to define how daemons should be deployed.
- Direct Fact: This module applies a service spec to the cluster.
- Technical Layer: It takes a specification file and feeds it into the
ceph orch applycommand, allowing the orchestrator to reconcile the current state of the cluster with the desired state. - Impact Layer: This enables "Infrastructure as Code" (IaC) for storage, where the entire daemon layout is version-controlled in Git and applied via Ansible.
- Contextual Layer: This is used to deploy OSD services, as seen in the
deploy_osd_service.ymlplaybook.
The cephorchdaemon Module
Lifecycle management of individual daemons is handled through this module.
- Direct Fact: This module is used to start, stop, or restart Ceph daemons.
- Technical Layer: It wraps the
ceph orch daemoncommands, targeting specific daemon IDs across the cluster. - Impact Layer: This is critical for performing rolling restarts or troubleshooting specific failed services without manual intervention on the host.
- Contextual Layer: It provides a safety mechanism for administrators to manage daemons across a large fleet from a single administration node.
The cephadmregistrylogin Module
Containerized deployments require access to image registries.
- Direct Fact: This module allows the system to log in to a container registry.
- Technical Layer: It handles the authentication process for the container engine (Podman or Docker), ensuring that the node has the necessary credentials to pull Ceph images.
- Impact Layer: This prevents deployment failures caused by "ImagePullBackOff" errors in secured environments where private registries are used.
- Contextual Layer: This is a prerequisite for the
cephadm_bootstrapandceph_orch_applymodules.
Operational Playbooks and Workflows
Beyond the modules, cephadm-ansible provides a set of pre-defined playbooks that handle common operational scenarios.
Host Preflight and Initialization
Before a host can join a Ceph cluster, it must meet specific software and configuration requirements. The cephadm-preflight playbook automates this.
- Workflow: The preflight playbook installs essential packages including
podman,lvm2, andchrony, as well as thecephadmtool itself. - Execution: It can be run against a specific host using the
--limitflag.
bash
ansible-playbook -i hosts cephadm-preflight.yml --extra-vars "ceph_origin=rhcs" --limit host02
- Impact: This ensures that every node in the cluster has a consistent base software version, preventing "drift" that could lead to unstable cluster behavior.
Cluster Lifecycle and Maintenance
Several other playbooks facilitate the ongoing management of the cluster:
- Distribute SSH Key: Automates the copying of the SSH public key to remote hosts, which is a prerequisite for
cephadmto manage the nodes. - Client Setup: Configures client hosts to communicate with the cluster, ensuring that the necessary keys and configuration files are present.
- Purge: Provides a destructive but necessary workflow to completely remove a Ceph cluster from the hosts, cleaning up containers, configurations, and data.
- RocksDB Resharding: Performs resharding for the RocksDB database of a given OSD, which is critical for maintaining performance as the number of objects grows.
- Insecure Registry: Adds a specific registry to
registries.confas insecure, which is often required in internal lab environments where SSL certificates are not fully implemented.
Technical Prerequisites and Environment Configuration
To successfully utilize cephadm-ansible, certain environmental conditions must be met on the Ansible administration node and the target storage nodes.
Administration Node Requirements
The administration node is the central point of control. It must have the following:
- Ansible Installation: The
cephadm-ansiblepackage or theceph.cephadmcollection must be installed. - SSH Access: The Ansible user must have sudo privileges and passwordless SSH access to all nodes in the storage cluster.
- Inventory Management: An Ansible inventory file (usually named
hosts) must be maintained, containing the IP addresses or hostnames of the cluster and admin hosts.
Cluster Host Requirements
The target nodes must be prepared to receive the orchestration commands:
- Admin Host Designation: A host is considered an "admin" host when it possesses the admin keyring and the Ceph config file. In the Ceph orchestrator, this is achieved by adding the
_adminlabel to the host. While the bootstrap host is typically the first admin host, additional admin hosts can be added for redundancy. - Container Engine: Since Ceph is deployed via containers, the hosts must have a compatible container engine (Podman is the default for Red Hat Ceph Storage).
Comparison of Workflow Methods
The following table provides a comparative analysis of the different ways to utilize the cephadm-ansible project.
| Method | Installation Command | Execution Example | Best For |
|---|---|---|---|
| Ansible Collection | ansible-galaxy collection install ceph.cephadm |
ansible-playbook la_playbook.yml (using FQCN) |
Modern CI/CD, Enterprise environments |
| Git Clone | git clone https://github.com/ceph/cephadm-ansible |
ansible-playbook -i hosts cephadm-preflight.yml |
Development, Customization, Quick starts |
| RHCS Package | yum install cephadm-ansible |
cd /usr/share/cephadm-ansible && ansible-playbook ... |
Red Hat Certified environments |
Detailed Analysis of Project Structure
The cephadm-ansible project is structured to ensure that it can function both as a standalone set of scripts and as a formal Ansible Collection. This is achieved through a clever mapping of directories.
- Library and Module Utils: The original repository used a
library/andmodule_utils/folder. In the collection format, these are moved toplugins/modules/andplugins/module_utils/respectively. This allows Ansible to locate the custom Python code that powers the modules. - Role Management: The
ceph_defaultsrole is relocated to theroles/directory. This role is critical because it provides the default variables that the playbooks rely on, ensuring a consistent baseline. - Playbook Organization: Playbooks are stored in the
playbooks/directory (in the collection format) or at the root (in the cloned repository format). - Testing and Validation: The
tests/andvalidate/directories contain suites used to ensure that the modules behave as expected across different versions of Ceph and Ansible. - Configuration: The
ansible.cfgfile in the repository explicitly points to the local library and module utilities, which allows the cloned repository to function without the need for a formal Galaxy installation.
Conclusion
The cephadm-ansible framework is an indispensable tool for any administrator operating a Red Hat Ceph Storage or community Ceph cluster. By abstracting the complexities of the cephadm and ceph orch command-line interfaces into idempotent Ansible modules, it transforms the process of cluster deployment and management from a manual, error-prone task into a streamlined, automated workflow.
The ability to perform "preflight" checks, bootstrap clusters, manage host labels, and tune configurations via code ensures that the infrastructure remains stable and scalable. The transition to a formal Ansible Collection (ceph.cephadm) further enhances its utility, allowing it to fit into modern DevOps pipelines while maintaining a bridge for those who prefer traditional repository-based workflows. Ultimately, the synergy between cephadm's orchestration and Ansible's configuration management provides a robust foundation for managing the lifecycle of distributed storage at scale.