High Availability K3s Architecture with External PostgreSQL Datastore

Deploying a Kubernetes cluster in a high-availability (HA) configuration requires a robust, reliable, and scalable backend for state management. While K3s traditionally uses etcd for its distributed key-value store, certain enterprise environments—such as those running on IBM Z mainframes—benefit significantly from offloading the datastore to an external relational database. PostgreSQL serves as a powerful alternative to etcd, allowing administrators to leverage existing database management expertise, backup tools, and high-availability patterns. In an HA K3s deployment, the server nodes remain stateless, meaning the entire state of the cluster resides within the external PostgreSQL instance. This architectural shift ensures that if a K3s server node fails, another node can instantly take over without any loss of cluster metadata, provided the database remains reachable. To maintain the integrity of this communication, particularly in production environments, a rigorous security layer involving SSL with client certificate verification and password authentication is mandatory to prevent unauthorized access to the cluster's core configuration.

External Database Architecture and Load Balancing

In a standard HA deployment of K3s, multiple server nodes are utilized to ensure that the Kubernetes API remains available even if one or more servers crash. However, because the K3s servers themselves are stateless when using an external database, the critical point of failure shifts to the datastore and the method by which clients access the API.

To mitigate this, a Layer 4 load balancer, such as Nginx, is typically deployed in front of the K3s servers. The load balancer provides a single stable IP address or DNS name (the TLS Subject Alternative Name) that routes requests to the available server nodes. If a server node goes down, the load balancer redirects traffic to the healthy nodes, maintaining seamless cluster operations. It is important to note that the load balancer itself can become a single point of failure unless it is also deployed in an HA configuration.

The flow of communication in this architecture is as follows:

  • Client requests hit the Nginx Load Balancer.
  • The Load Balancer forwards traffic to one of the K3s Server nodes.
  • The K3s Server node communicates with the external PostgreSQL database to read or write the cluster state.

Secure Communication via SSL and Client Verification

When connecting K3s to an external PostgreSQL instance, relying on simple password authentication is insufficient for production-grade security. A mutual TLS (mTLS) approach is required, where both the K3s server (the client) and the PostgreSQL server verify each other's identities using certificates.

K3s Client Certificate Generation

The K3s server must present a certificate to PostgreSQL to prove its identity. This process involves creating a self-signed certificate and a corresponding private key on the first K3s server node.

The following command is used to generate these assets:

openssl req -new -x509 -days 365 -nodes -text -out K3s.crt -keyout K3s.key -subj "/CN=K3s" -addext "subjectAltName=DNS:K3s"

Once generated, the security of the private key is paramount. Permissions must be restricted to prevent other users on the system from accessing the key, which could allow them to impersonate the K3s cluster.

chmod 0600 K3s.key

The public certificate (K3s.crt) must then be distributed to the PostgreSQL virtual machine so that the database can verify the K3s server during the handshake. This is typically achieved via secure copy:

scp /home/sles/K3s.crt [email protected]:

Finally, both the public and private keys must be copied to all other K3s server nodes in the HA cluster to ensure consistent authentication across the fleet.

PostgreSQL Server Certificate Generation

Conversely, the PostgreSQL server must also identify itself to the K3s servers to prevent man-in-the-middle attacks. Certificates are created and stored within the PostgreSQL data directory.

openssl req -new -x509 -days 365 -nodes -text -out /var/lib/pgsql/data/postgres.crt -keyout /var/lib/pgsql/data/postgres.key -subj "/CN=postgres.rancher.rke2" -addext "subjectAltName=DNS:postgres.rancher.rke2"

Just as with the K3s keys, the PostgreSQL private key must have strictly limited permissions, or the PostgreSQL service will refuse to start due to security concerns.

chmod 0600 /var/lib/pgsql/data/postgres.key

chown postgres:postgres /var/lib/pgsql/data/postgres.key

The resulting public certificate (postgres.crt) is then distributed to all K3s server nodes:

scp /var/lib/pgsql/data/postgres.crt [email protected]:

scp /var/lib/pgsql/data/postgres.crt [email protected]:

PostgreSQL Access Control and Configuration

The pg_hba.conf (Host-Based Authentication) file is the primary mechanism PostgreSQL uses to control which hosts can connect and which authentication methods are required. To enforce the security model described, the configuration must be modified to mandate SSL and client certificate verification for remote connections.

The configuration should be updated as follows:

```conf

TYPE DATABASE USER ADDRESS METHOD

"local" is for Unix domain socket connections only

local all all peer

IPv4 local connections:

host all all 127.0.0.1/32 ident
hostssl all all 0.0.0.0/0 md5 clientcert=verify-full
```

This configuration establishes a tiered access policy:

  • Local connections via Unix sockets use the peer method, which is sufficient for local administration.
  • Local IPv4 connections (127.0.0.1) use the ident method.
  • All remote connections (0.0.0.0/0) are forced to use SSL (hostssl). They must provide a valid password (md5) and a certificate that can be verified against the trusted CA (clientcert=verify-full).

K3s Server Deployment and Configuration

When initializing the K3s server to use PostgreSQL, several critical flags must be passed to the binary to tell K3s how to find and authenticate with the database.

CLI Execution Flags

For a basic setup, the following flags are utilized:

  • --datastore-cafile: Points to the public key certificate of the PostgreSQL server. In a self-signed environment, the public certificate acts as its own CA.
  • --datastore-certfile: Points to the public certificate identifying the K3s cluster.
  • --datastore-keyfile: Points to the private key belonging to the K3s cluster.
  • --token: A secret password used to authenticate other servers or agents joining the cluster.
  • --tls-san: The IP address or DNS name of the load balancer.

To ensure the K3s server can resolve the database hostname (e.g., postgres.rancher.rke2), an entry must be added to the /etc/hosts file:

10.161.129.212 postgres.rancher.rke2

Production-Grade Systemd Configuration

For production environments, using a configuration file is preferred over long CLI strings. This allows for cleaner management and easier updates.

First, create the configuration directory:

sudo mkdir -p /etc/rancher/k3s

Then, create the config.yaml file with the following structure:

```yaml

K3s Server Configuration for PostgreSQL Backend

This configuration enables a production-ready K3s server

PostgreSQL connection string

Use SSL mode 'require' for encrypted connections in production

datastore-endpoint: "postgres://k3s_user:[email protected]:5432/k3s?sslmode=require"

TLS Subject Alternative Names for API server certificate

Include all hostnames and IPs that will access the API

tls-san:
- "k3s-api.example.com"
- "10.0.0.10"
- "192.168.1.100"

Disable local storage provisioner if using external storage

disable:
- local-storage

Write kubeconfig with correct permissions

write-kubeconfig-mode: "0644"

Enable secrets encryption at rest

secrets-encryption: true
```

Once the configuration is in place, the K3s server is installed using the official script:

curl -sfL https://get.k3s.io | sh -s - server

Database Connection String Options

The datastore-endpoint is the bridge between the K3s application and the PostgreSQL database. Depending on the environment, different connection strings are used.

Security Level Connection String Format Recommended Use Case
Low postgres://user:password@host:5432/database Local testing/Development only
Medium postgres://user:password@host:5432/database?sslmode=require Production (Encrypted)
High postgres://user:password@host:5432/database?sslmode=verify-full Production (Encrypted & Verified)

Database Backup and Maintenance Strategy

Since the external database holds the entire state of the Kubernetes cluster, a failure of the database results in a total cluster outage. A robust backup strategy is mandatory.

Automated Backup Implementation

A bash-based automation script can be used to perform dumps of the K3s database. This script should be scheduled via cron to ensure regular intervals.

The following configuration and script logic are employed:

```bash

Configuration

DBHOST="postgres.example.com"
DB
PORT="5432"
DBNAME="k3s"
DB
USER="k3suser"
BACKUP
DIR="/var/backups/k3s"
RETENTION_DAYS=30

Create backup directory if it doesn't exist

mkdir -p "${BACKUP_DIR}"

Generate timestamp for backup filename

TIMESTAMP=$(date +%Y%m%d%H%M%S)
BACKUP
FILE="${BACKUPDIR}/k3sbackup_${TIMESTAMP}.sql.gz"

Perform the backup

--no-owner: Don't output commands to set object ownership

--no-acl: Don't output commands to set access privileges

Compress with gzip to save storage space

PGPASSWORD="${DBPASSWORD}" pgdump \
-h "${DBHOST}" \
-p "${DB
PORT}" \
-U "${DBUSER}" \
-d "${DB
NAME}" \
--no-owner \
--no-acl \
--format=plain \
| gzip > "${BACKUP_FILE}"

Verify backup was created successfully

if [ -f "${BACKUPFILE}" ] && [ -s "${BACKUPFILE}" ]; then
echo "Backup successful: ${BACKUP_FILE}"

Calculate and store checksum for integrity verification

sha256sum "${BACKUPFILE}" > "${BACKUPFILE}.sha256"
else
echo "Backup failed!"
exit 1
fi

Clean up old backups beyond retention period

find "${BACKUPDIR}" -name "k3sbackup*.sql.gz" -mtime +${RETENTIONDAYS} -delete
find "${BACKUPDIR}" -name "k3sbackup*.sha256" -mtime +${RETENTIONDAYS} -delete
echo "Backup complete"
```

This script implements several critical safeguards:
- Compression: Using gzip reduces the storage footprint of the SQL dumps.
- Integrity: Generating a sha256sum ensures that the backup has not been corrupted.
- Retention: The find command automatically deletes files older than 30 days to prevent disk exhaustion.
- Clean Dumps: The --no-owner and --no-acl flags ensure that the backup can be restored to a different database user or environment without ownership conflicts.

Performance Degradation and Troubleshooting

Over time, K3s clusters utilizing an external PostgreSQL database may experience significant performance issues, particularly as the cluster ages or grows in complexity.

Identifying Slow SQL Queries

One common symptom is the occurrence of "Slow SQL queries" in the logs, which manifests as timeouts when executing kubectl commands. In clusters running for extended periods (e.g., over 3 years in homelab environments), the database can become a bottleneck.

The root cause is often identified by analyzing the execution plan of the slow queries using EXPLAIN in a tool like pg_cli. A frequent culprit is the "Seq Scan" (Sequential Scan).

A Sequential Scan occurs when the PostgreSQL engine must read every single row in a table to find the relevant records because no suitable index exists or the optimizer decided a scan was more efficient. In a K3s context, these scans often happen against the kine (the K3s database adapter) tables.

Environmental Factors in Performance Failure

Performance issues are often tied to specific environmental configurations. For example, issues have been reported in environments with:
- K3s versions such as v1.28.5+k3s1.
- Operating systems like Fedora 39 (kernel 6.6.9-200.fc39.x86_64).
- PostgreSQL versions like 15.4-1.fc39.x86_64.
- Cluster sizes consisting of 2 servers and 6 nodes.

When these symptoms appear, administrators should investigate the database indices and the kine configuration to determine if the sequential scans can be converted into index scans.

Post-Deployment Verification

Once the K3s servers are installed and pointed to the PostgreSQL database, it is necessary to verify that the connection is active and the cluster is healthy.

The first step is to set the KUBECONFIG environment variable to point to the generated configuration file:

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

Then, verify the status of the system pods:

kubectl get pods -A

A successful connection is indicated when the command returns a list of pods across all namespaces, with all core components (such as CoreDNS, Metrics Server, and Traefik) showing a Running status. If the pods are stuck in Pending or if the command timeouts, it typically indicates a failure in the connectivity between the K3s server and the PostgreSQL backend or an authentication failure in the SSL handshake.

Conclusion

Transitioning K3s from its default etcd backend to an external PostgreSQL database provides a scalable path for high-availability deployments, particularly in specialized hardware environments like IBM Z mainframes. The architectural shift allows for a complete separation of the control plane's state from the compute nodes, enabling a stateless server model that is highly resilient to node failure. However, this flexibility introduces significant security requirements. The implementation of mutual TLS (mTLS) through the creation of distinct certificates for both the K3s cluster and the PostgreSQL server is not optional; it is the primary defense against unauthorized state manipulation.

Furthermore, the operational burden shifts toward database administration. The necessity of a strict pg_hba.conf policy, coupled with a rigorous backup and retention strategy using pg_dump and sha256sum, highlights that the reliability of the Kubernetes cluster is now directly tied to the reliability of the PostgreSQL instance. Finally, the potential for performance degradation—specifically the emergence of slow sequential scans in long-running clusters—suggests that ongoing database tuning and query analysis are essential parts of the K3s lifecycle. By combining a Layer 4 load balancer for API availability and a hardened PostgreSQL backend for state management, organizations can achieve a production-ready Kubernetes environment that is both flexible and robust.

Sources

  1. SUSE Rancher Blog
  2. OneUptime Blog
  3. K3s GitHub Issues

Related Posts