Orchestrating Machine Learning Workflows via Kubeflow and K3s Integration

The intersection of machine learning operations (MLOps) and lightweight container orchestration has culminated in the strategic pairing of Kubeflow and K3s. Kubeflow serves as the definitive Machine Learning Toolkit for Kubernetes, designed specifically to simplify the deployment of machine learning workflows. Its primary objective is to ensure that ML pipelines are simple, portable, and scalable, regardless of the underlying infrastructure. Rather than attempting to replace existing best-of-breed open-source systems for machine learning, Kubeflow provides a cohesive framework to deploy these tools across any environment where Kubernetes is operational.

When integrated with K3s, a highly optimized, lightweight distribution of Kubernetes, Kubeflow becomes accessible for a wider range of use cases, from rapid prototyping on local laptops to edge device deployments and cloud-based experimentation via platforms like Civo. This synergy allows data scientists and ML engineers to move from a "zero state" to a fully functional notebook server and pipeline environment with minimal friction. The use of K3s removes the heavy resource overhead typically associated with standard Kubernetes (K8s) distributions, making it an ideal candidate for local clusters, virtual machines, and specialized AI hardware.

The Architectural Foundation of K3s and Civo Kubernetes

Civo Kubernetes utilizes K3s as its underlying engine, which provides a streamlined approach to cluster management. K3s is engineered to be lightweight, stripping away legacy cloud provider integrations and unnecessary components to reduce the memory footprint and CPU usage. This architecture is particularly beneficial for ML workloads where the majority of system resources should be dedicated to model training and data processing rather than cluster orchestration overhead.

A critical feature of Civo Kubernetes powered by K3s is the inclusion of local-path dynamic Persistent Volume (PV) creation storage by default. This mechanism ensures that when a user creates a notebook or a specific Kubeflow component, the system automatically provisions the necessary storage. The impact of this is a significant reduction in manual configuration; users do not need to pre-allocate storage classes or manually bind volumes to claims.

The following table details the typical storage allocation observed during a standard Kubeflow deployment on a K3s-based cluster:

Volume Name (PVC) Capacity Access Mode Reclaim Policy Storage Class Purpose/Context
pvc-2a97f0fd... 10Gi RWO Delete local-path kubeflow/katib-mysql
pvc-d04ce1a7... 20Gi RWO Delete local-path kubeflow/minio-pvc
pvc-4eb317ea... 20Gi RWO Delete local-path kubeflow/mysql-pv-claim
pvc-db650934... 10Gi RWO Delete local-path istio-system/authservice-pvc
pvc-85d07563... 10Gi RWO Delete local-path kubeflow-user-example-com/workspace-demo

While the local-path provisioner is sufficient for testing and experimentation, it is restricted to the local node. For production-oriented scenarios where data persistence across multiple nodes or high availability is required, exploring advanced storage options such as Longhorn is recommended. This transition allows for a more resilient infrastructure capable of handling production-grade ML pipelines.

Deploying Kubeflow Pipelines on Local Infrastructure

Deploying Kubeflow Pipelines (KFP) locally allows developers to iterate on their ML models without incurring cloud costs. There are several supported paths for local deployment, including kind, K3s, K3s on Windows Subsystem for Linux (WSL), K3ai, and Docker Desktop. For those utilizing Kustomize, it is a prerequisite to have kubectl version 1.14 or higher to ensure native support for the manifest applications used by Kubeflow.

For users operating within a Windows environment, the Windows Subsystem for Linux (WSL) provides a viable path for running K3s. The process involves downloading the K3s binary and configuring the environment for execution.

The sequence for initializing K3s on WSL is as follows:

  • Navigate to the directory where the K3s binary was downloaded.
  • Grant execution permissions to the binary using the command chmod +x k3s.
  • Start the K3s server by executing sudo ./k3s server.

Once the server is operational, the user must establish access to the WSL instance from the host machine. This is achieved by copying the configuration file located at /etc/rancher/k3s/k3s.yaml to the local $HOME/.kube/config path. To ensure the kubectl client can communicate with the cluster, the server URL must be updated. Specifically, the default https://localhost:6443 must be replaced with the actual IP address of the WSL instance, which can be identified by running ip addr show dev eth0. An example of a corrected URL would be https://192.168.170.170:6443.

K3ai: Specialized Infrastructure for AI Acceleration

K3ai represents an alpha-stage "infrastructure in a box" solution specifically tailored for the installation and configuration of AI tools on portable hardware. It is designed for laptops and edge devices, enabling rapid experimentation with Kubeflow on a local cluster. The primary value proposition of K3ai is the ability to deploy Kubernetes (K3s-based), Kubeflow Pipelines, NVIDIA GPU support, and TensorFlow Serving using a single line of code.

Depending on the available hardware, users can choose between two installation paths:

  • CPU-only support: curl -sfL https://get.k3ai.in | bash -s -- --cpu --plugin_kfpipelines
  • GPU support: curl -sfL https://get.k3ai.in | bash -s -- --gpu --plugin_kfpipelines

The inclusion of GPU support is critical for deep learning tasks, as it allows the Kubeflow environment to leverage NVIDIA hardware for accelerating model training and inference. K3ai simplifies the complexity of driver installation and device plugin configuration, which are typically the most challenging aspects of setting up a local AI environment. At the conclusion of the installation, K3ai automatically outputs the URL for the web UI, removing the need for manual port-forwarding.

Execution and Verification of Kubeflow Pipelines

The deployment of Kubeflow Pipelines often relies on Kustomize manifests. These manifests can be provided via local paths or Hashicorp go-getter URLs. The process involves waiting for the Custom Resource Definitions (CRDs) to be established before applying the actual pipeline configurations.

To deploy the pipelines, the following command sequence is utilized:

kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

It is important to note that the kubectl apply -k command accepts paths that appear to be URLs but are specifically formatted for the Kustomize controller. The deployment process may take several minutes as the cluster pulls the necessary images and initializes the pods.

To verify that the Kubeflow Pipelines UI is accessible, the user must establish a network tunnel from the local machine to the cluster service:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Following this command, the UI can be accessed via http://localhost:8080/. If the deployment is hosted on a virtual machine or a remote K3s cluster, the address changes to http://{YOUR_VM_IP_ADDRESS}:8080/.

Managing the Lifecycle: Deletion and Uninstallation

Maintaining a clean environment is essential, especially when experimenting with different versions of Kubeflow. The uninstallation process varies depending on how the manifests were originally applied.

If the installation was performed using a local manifest file, the removal is executed as follows:

kubectl delete -k {YOUR_MANIFEST_FILE}

If the installation utilized manifests directly from the Kubeflow Pipelines GitHub repository, a more specific sequence is required to ensure all cluster-scoped resources are removed. This involves exporting the version and running a double deletion:

export PIPELINE_VERSION=2.15.0
kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"
kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"

For those using a local repository or file system for their manifests, the commands are:

kubectl delete -k manifests/kustomize/env/platform-agnostic
kubectl delete -k manifests/kustomize/cluster-scoped-resources

Tooling Alternatives and Compatibility

While K3s is a primary focus, other local orchestration tools are compatible with Kubeflow's architectural needs. Docker Desktop provides a robust, hybrid toolkit for building and running applications, offering an out-of-the-box containerization environment. For users on Windows, the Docker Desktop Installer.exe serves as the entry point for this environment.

Additionally, kind (Kubernetes in Docker) is frequently mentioned as a viable alternative. Unlike K3s, which is a lightweight distribution intended for production or edge use, kind is specifically designed for testing Kubernetes itself by running nodes as Docker containers. This makes it highly effective for CI/CD pipelines where a cluster needs to be spun up and torn down rapidly.

The evolution of Kubeflow is also evident in its versioning. While the legacy V1 installation guides are still referenced, the community has transitioned toward V2. The final release of the V1 SDK was kfp==1.8.22. While the V2 backend maintains the ability to run pipelines submitted by the V1 SDK, migrating to the V2 SDK is strongly recommended to take advantage of updated features and improved stability.

Detailed Analysis of Kubeflow and K3s Synergy

The deployment of Kubeflow on K3s creates a powerful paradigm for "Edge AI" and localized development. By leveraging K3s, the operational overhead is shifted away from cluster maintenance and toward actual machine learning value creation. The integration of local-path storage in Civo's K3s implementation solves one of the most common pain points for newcomers: the complex configuration of StorageClasses. By automating the creation of PVCs for critical components like MinIO (used for artifact storage) and MySQL (used for metadata storage), the time-to-first-pipeline is drastically reduced.

The introduction of K3ai further accelerates this by abstracting the infrastructure layer entirely. The ability to install a full AI stack—including GPU support—via a single curl command represents a shift toward "infrastructure as a utility" for data scientists. This allows for a seamless transition where a model can be developed on a K3ai-powered laptop, tested on a Civo K3s cluster, and eventually deployed to a massive production Kubernetes cluster without changing the underlying pipeline logic.

The security aspect, particularly the requirement to secure the Istio gateway with HTTPS, ensures that even in a "lightweight" setup, the communication between the user and the Kubeflow components remains encrypted. Istio acts as the service mesh, managing traffic and providing the necessary routing to reach various Kubeflow components like the Jupyter Notebooks or the Pipelines UI.

Ultimately, the combination of Kubeflow and K3s democratizes access to high-end ML orchestration. It removes the requirement for expensive cloud instances for the initial stages of the ML lifecycle, allowing developers to utilize the full power of Kubernetes—including its scalability and portability—on hardware that was previously considered too limited for such complex workloads.

Sources

  1. Get up and running with Kubeflow on Civo Kubernetes
  2. Kubeflow Pipelines Local Cluster Deployment

Related Posts