Introduction
In the rapidly evolving landscape of modern infrastructure management, system uptime is not merely a metric; it is a critical indicator of operational stability, security posture, and service reliability. For DevOps engineers, system administrators, and cloud architects, maintaining visibility into server longevity is paramount. Excessive uptime can indicate unpatched systems vulnerable to zero-day exploits, while frequent, unexplained reboots may signal hardware failures, software crashes, or unauthorized administrative access. The integration of Ansible into this workflow transforms manual checks into scalable, automated, and highly accurate infrastructure-as-code (IaC) solutions. This guide exhaustively details the technical methodologies for extracting, categorizing, reporting, and acting upon server uptime data, leveraging the full power of Ansible's command execution, fact gathering, and Jinja2 templating capabilities. We will dissect the mathematical formulas for time conversion, the logic gates for status categorization, the implementation of historical tracking logs, and the integration with external monitoring ecosystems like Uptime Kuma. This document serves as the ultimate technical manual for any organization seeking to standardize their uptime auditing and fleet health checks.
The Foundation: Ad-Hoc Checks and Basic Shell Commands
The fastest way to gather raw uptime data across a fleet of servers is through the use of Ansible ad-hoc commands. These commands allow administrators to execute shell scripts on remote nodes instantaneously without the need for a full playbook structure. The fundamental command to retrieve uptime information involves the command module.
When executing an ad-hoc check, the syntax targets a specific group or the entire inventory. The ansible executable runs the command module with the -o (one-liner) flag to condense output for quick consumption.
```bash
Check uptime across all servers
ansible all -m command -a "uptime" -o
```
For environments where structured data is preferred over raw text output, the setup module provides a JSON-formatted return of system facts. By filtering specifically for the ansible_uptime_seconds variable, administrators can retrieve the raw integer value representing the total seconds the system has been running since its last boot.
```bash
Using setup module for structured data
ansible all -m setup -a "filter=ansibleuptimeseconds" -o
```
Beyond these basic commands, gathering specific system metadata is essential for comprehensive reporting. The Ansible who -b command retrieves the exact date and time of the most recent system boot, while uname -r extracts the running kernel version. These commands are executed via the ansible.builtin.command module, ensuring that the state of the system is read without triggering a "changed" state in the Ansible run log.
```bash
Get the last boot time
- name: Get last boot time
ansible.builtin.command:
cmd: "who -b"
register: lastboot
changedwhen: false
```
```bash
Get current running kernel
- name: Get running kernel version
ansible.builtin.command:
cmd: uname -r
register: kernelversion
changedwhen: false
```
The technical rationale behind setting changed_when: false is to ensure that these read-only operations do not interfere with the idempotency tracking of the Ansible run, clearly signaling that no modifications were made to the target infrastructure.
Advanced Metrics and Mathematical Calculations
Raw uptime data provided by the ansible_uptime_seconds fact is an integer representing total seconds. To render this data human-readable and operationally useful, precise mathematical conversion is required using Jinja2 filters. The calculation logic relies on standard time-division constants. Converting seconds to days involves dividing by 86,400, the number of seconds in a 24-hour period.
jinja2
{{ (ansible_uptime_seconds | int / 86400) | round(1) }}
This formula utilizes the int filter to cast the value to an integer, performs the division, and then applies the round(1) filter to ensure the result is rounded to one decimal place, providing a precise measurement of system longevity. This same mathematical framework applies to calculating uptime in minutes (dividing by 60) and hours (dividing by 3,600).
yaml
- name: Calculate uptime metrics
ansible.builtin.set_fact:
uptime_seconds: "{{ ansible_uptime_seconds }}"
uptime_minutes: "{{ (ansible_uptime_seconds | int / 60) | round(1) }}"
uptime_hours: "{{ (ansible_uptime_seconds | int / 3600) | round(1) }}"
uptime_days: "{{ (ansible_uptime_seconds | int / 86400) | round(1) }}"
The operational impact of these calculations is significant. By establishing standardized units of measurement, administrators can apply uniform threshold logic across the entire fleet, allowing for automated categorization of server health.
Categorization Logic and Uptime Thresholds
Once the uptime is calculated in days, the system must be categorized based on specific operational thresholds. This process uses Ansible facts and Jinja2 conditionals to assign a status label to each host. The categorization relies on defining three distinct thresholds: critical, warning, and recent reboot.
The primary threshold for a critical status is set at 180 days. A system running continuously for six months is highly likely to contain unpatched security vulnerabilities. The warning threshold is set at 90 days, signaling that the system is approaching a maintenance window. Finally, a check for recent reboots looks for uptime under 24 hours, triggering a notice for potential unexpected restarts.
yaml
- name: Categorize uptime status
ansible.builtin.set_fact:
uptime_status: >-
{% if uptime_days | float > critical_uptime_days %}CRITICAL - Uptime exceeds {{ critical_uptime_days }} days
{% elif uptime_days | float > warning_uptime_days %}WARNING - Uptime exceeds {{ warning_uptime_days }} days
{% elif uptime_hours | float < recent_reboot_hours %}NOTICE - Recently rebooted
{% else %}OK{% endif %}
This categorization logic follows a strict flowchart path. First, the system checks if the uptime exceeds 180 days. If yes, it flags the server as CRITICAL, noting that it is likely unpatched. If no, it checks if the uptime exceeds 90 days, flagging it as WARNING, indicating a need for maintenance. If neither is true, it checks if the uptime is less than 24 hours, flagging it as NOTICE, indicating a recent reboot. If none of these conditions are met, the system is marked as OK, signifying a normal operational state.
Comprehensive Fleet Reporting and CSV Generation
To translate raw uptime data into actionable intelligence, Ansible playbooks can aggregate the gathered facts into structured reports. A standard report entry for a single host comprises the hostname, IP address, operating system distribution and version, kernel version, calculated uptime in days and hours, the exact last boot timestamp, and the categorized uptime status.
yaml
- name: Build host report
ansible.builtin.set_fact:
host_report:
hostname: "{{ inventory_hostname }}"
ip: "{{ ansible_host | default(ansible_default_ipv4.address) }}"
os: "{{ ansible_distribution }} {{ ansible_distribution_version }}"
kernel: "{{ kernel_version.stdout }}"
uptime_days: "{{ uptime_days }}"
uptime_hours: "{{ uptime_hours }}"
last_boot: "{{ last_boot.stdout | trim }}"
status: "{{ uptime_status | trim }}"
For fleet-wide visibility, a consolidated CSV report is generated. This report is written to a temporary directory with a timestamp in its filename to ensure version control of the audit data. The content is constructed using a Jinja2 for loop that iterates through all hosts defined in the Ansible inventory groups.
yaml
- name: Generate CSV uptime report
ansible.builtin.copy:
dest: "/tmp/uptime-report-{{ ansible_date_time.date }}.csv"
mode: '0644'
content: |
hostname,ip,os,kernel,uptime_days,last_boot,status
{% for host in groups['all'] %}
{% if hostvars[host]['uptime_days'] is defined %}
{{ host }},{{ hostvars[host]['ansible_host'] | default('') }},{{ hostvars[host]['ansible_distribution'] | default('') }} {{ hostvars[host]['ansible_distribution_version'] | default('') }},{{ hostvars[host]['kernel_version']['stdout'] | default('') }},{{ hostvars[host]['uptime_days'] }},{{ hostvars[host]['last_boot']['stdout'] | default('') | trim }},{{ hostvars[host]['uptime_status'] | default('') | trim }}
{% endif %}
{% endfor %}
This CSV file serves as an immutable record of the system landscape at a specific point in time, facilitating compliance audits and historical trend analysis.
Dashboarding and Fleet Visualization
Beyond static files, dynamic in-console dashboards provide real-time feedback during the Ansible execution. This is achieved by counting the number of servers falling into the critical and warning categories and displaying the totals alongside the specific hostnames.
yaml
- name: Display uptime dashboard
ansible.builtin.debug:
msg:
- "========================================"
- " FLEET UPTIME DASHBOARD "
- "========================================"
- "Total servers: {{ groups['all'] | length }}"
- ""
- "CRITICAL (>180 days): {{ critical_servers | length }}"
- "{{ critical_servers }}"
- ""
- "WARNING (>90 days): {{ warning_servers | length }}"
- "{{ warning_servers }}"
The filtering logic to extract these lists relies on Ansible's built-in selectattr or filtering capabilities. For critical servers, the list is constructed by selecting hosts where uptime_days is greater than 180. Similarly, warning servers are selected where uptime_days is greater than 90 but less than or equal to 180. This dashboard transforms raw data into a highly visual, actionable summary for the operations team.
Service Health and Load Average Monitoring
Uptime alone provides an incomplete picture of system health. A server might have high uptime but suffer from severe performance degradation. Therefore, the Ansible playbook must also check the system load average by reading the /proc/loadavg file.
yaml
- name: Get load average
ansible.builtin.command:
cmd: cat /proc/loadavg
register: load_avg
changed_when: false
To generate an alert, the current load average is compared against a threshold derived from the system's CPU architecture. The threshold is typically calculated as the number of virtual CPUs multiplied by a safety factor (commonly 2). If the load average exceeds this calculated limit, a warning is triggered.
yaml
- name: Alert on high load
ansible.builtin.debug:
msg: "WARNING: High load on {{ inventory_hostname }}: {{ load_avg.stdout }}"
when: load_avg.stdout.split(' ')[0] | float > ansible_processor_vcpus | default(2) | int * 2
This integration ensures that uptime monitoring is contextualized with performance metrics, preventing administrators from being blind to thrashing systems that are technically "up" but operationally broken.
Historical Tracking and Reboot Detection
To establish a temporal dimension to uptime monitoring, historical tracking records are essential. This is accomplished by appending the current timestamp, raw uptime seconds, and calculated uptime days to a dedicated log file located at /var/log/uptime-history.log.
yaml
- name: Record current uptime to a log file
ansible.builtin.lineinfile:
path: /var/log/uptime-history.log
line: "{{ ansible_date_time.iso8601 }},{{ ansible_uptime_seconds }},{{ (ansible_uptime_seconds | int / 86400) | round(1) }}"
create: true
mode: '0644'
The system can detect unexpected reboots by comparing the current uptime against the previous recorded value. By reading the second-to-last line of the log file, the previous uptime is extracted. If the current uptime is strictly less than the previously recorded uptime, an alert is triggered.
```yaml
- name: Read last recorded uptime
ansible.builtin.shell:
cmd: "tail -2 /var/log/uptime-history.log | head -1 | cut -d',' -f2"
register: lastuptime
changedwhen: false
failed_when: false
- name: Detect unexpected reboot
ansible.builtin.debug:
msg: "ALERT: {{ inventoryhostname }} appears to have rebooted since last check!"
when:- ansibleuptimeseconds | int < last
```
This mechanism provides an automated audit trail of system interruptions, which is critical for post-incident analysis and root cause investigation.
EC2 and Cloud Instance Auditing
For cloud infrastructure, specifically Amazon EC2 Linux instances, a simplified approach is often required to audit compliance. The primary use case is identifying machines that have not been rebooted for extended periods, ensuring they receive necessary security patches.
```yaml
- name : Get the list of all the nodes which are running over a month
hosts : all
gather_facts : false
tasks :
- name : Show hostname for sanity check
command : hostname
register : hostname
- name : Check uptime prior reboot
shell : cut -d ' ' -f1 /proc/uptime
register : UPTIME_PRE_REBOOT
- name : Setting fact for number of days
set_fact :
uptime_days : "{{ (UPTIME_PRE_REBOOT.stdout | int / 86400) | round(0) }}"
- name : Hosts to be rebooted
debug :
msg : "{{ inventory_hostname }} has not been rebooted in {{ uptime_days }} days, which is older than a month"
when : (uptime_days | int) > 30
```
This script extracts the raw uptime string from the Linux /proc/uptime file, parses the first field (total seconds), converts it to days, and checks if it exceeds a 30-day threshold. The operational implication is clear: any instance exceeding 30 days without a reboot is flagged for mandatory patching and maintenance, ensuring cloud security compliance.
Ecosystem Integration: Ansible and Uptime Kuma
For organizations utilizing dedicated monitoring platforms, Ansible offers seamless integration with Uptime Kuma. This requires the installation of the lucasheld.uptime_kuma collection, which relies on the uptime-kuma-api Python module for API communication.
bash
pip install uptime-kuma-api
ansible-galaxy collection install lucasheld.uptime_kuma
The collection supports specific versions of Uptime Kuma and Ansible. The compatibility matrix dictates that Uptime Kuma versions 1.21.3 through 1.23.2 are supported by collection versions 1.0.0 through 1.2.0. Similarly, Uptime Kuma versions 1.17.0 through 1.21.2 correspond to collection versions 0.1.0 through 0.14.0. This ensures API stability and prevents integration failures due to schema changes.
| Uptime Kuma Version | Ansible Collection Version | API Module Version |
|---|---|---|
| 1.21.3 - 1.23.2 | 1.0.0 - 1.2.0 | 1.0.0+ |
| 1.17.0 - 1.21.2 | 0.1.0 - 0.14.0 | 0.1.0 - 0.13.0 |
Installation of the Python API module can be version-locked to ensure environment consistency across development and production environments.
bash
pip install uptime-kuma-api==0.13.0
Conclusion
The management of server uptime via Ansible represents a critical intersection of operational transparency, security compliance, and infrastructure automation. By leveraging Ansible's command execution, Jinja2 mathematical filters, and conditional logic, administrators can move beyond superficial checks to build robust, automated auditing frameworks. The ability to categorize systems into critical, warning, and normal states, combined with historical tracking and cloud-specific auditing, provides a holistic view of infrastructure health. Furthermore, integration with platforms like Uptime Kuma bridges the gap between configuration management and active monitoring. This comprehensive approach ensures that every server in the fleet is actively monitored, accurately categorized, and proactively maintained, fundamentally strengthening the organization's technical resilience.
Sources
- OneUptime Blog (https://oneuptime.com/blog/post/2026-02-21-ansible-check-server-uptime/view)
- Lucas Held - Ansible Uptime Kuma (https://github.com/lucasheld/ansible-uptime-kuma)
- IAM-J - Ansible Machine Uptime (https://iam-j.github.io/iac/ansible-machine-uptime/)