Kubernetes: Replace dockershim with containerd and runc

If you followed my Kubernetes the not so hard way with Ansible blog series then you probably know that I’ve used Docker container runtime for my Kubernetes worker nodes up until now. But support for that one (or the Dockershim to be more precise) is deprecated and will be removed in Kubernetes v1.24. It was already planed for v1.22 but was postponed. I guess still a lot of organizations out there use the Docker runtime for Kubernetes…

I don’t want to go into detail why and how this happened. But TL;DR Docker as an underlying runtime is being deprecated in favor of runtimes that use the Container Runtime Interface (CRI) created for Kubernetes. Docker-produced images will continue to work in your cluster with all runtimes, as they always have.

If you need more information here are some resources:

If you still want to use dockershim this announcement may be for you: Mirantis to take over support of Kubernetes dockershim. But still I’d really think about before going down that road. All the community effort out there is going into containerd and CRI-O. E.g. it’s possible to use gVisor as a RuntimeClass in Kubernetes with containerd. gVisor is an application kernel, written in Go, that implements a substantial portion of the Linux system call interface. It provides an additional layer of isolation between running applications and the host operating system. gVisor includes an Open Container Initiative (OCI) runtime called runsc that makes it easy to work with existing container tooling. The runsc runtime integrates with Kubernetes, making it simple to run sandboxed containers. This is only one example. The containerd and CRI-O userbase will further grow over time and blog posts and articles will appear. I guess this won’t be true so much with the dockershim.

One interesting fact actually is that technically the change itself isn’t that big at all. If you don’t run a quite ancient Docker version the you already use containerd and runc. E.g. grepping for docker on one of my worker nodes shows the following result (only contains the important processes):

bash

/usr/local/bin/dockerd --bip= --host=unix:///run/docker.sock --ip-masq=false --iptables=false --log-level=error --mtu=1472 --storage-driver=overlay2
containerd --config /var/run/docker/containerd/containerd.toml --log-level error
/usr/local/bin/kubelet --cloud-provider= --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --config=/var/lib/kubelet/kubelet-config.yaml --container-runtime=docker --image-pull-progress-deadline=2m --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-ip=10.8.0.205 --register-node=true
/usr/local/bin/containerd-shim-runc-v2 -namespace moby -id 93aed22ecdeaad7bec4217325557bedab919cd466010144afd616db3aba2656a -address /var/run/docker/containerd/containerd.sock
/usr/local/bin/containerd-shim-runc-v2 -namespace moby -id e34afbec290f737834b463fbb3820a62b720210d10278f3df4ab72068ec61f45 -address /var/run/docker/containerd/containerd.sock

If I grep for the last ID in this list in the Docker processlist I can see the container running:

bash

sudo docker ps | grep e34afbec290f

e34afbec290f   cbb49690db02                           "/app/cmd/webhook/we…"   4 hours ago   Up 4 hours             k8s_cert-manager_cert-manager-webhook-5b68d59578-b9ftk_cert-manager_d4cac7e3-d9d0-4404-be5f-e7f2edd32623_9

As you can see dockerd already uses containerd and runc in the background which gets even more visible with pstree:

bash

dockerd(748)─┬─containerd(963)─┬─{containerd}(968)
             │                 ├─{containerd}(969)
             │                 ├─{containerd}(970)
             │                 ├─{containerd}(971)
             │                 ├─{containerd}(972)
             │                 ├─{containerd}(1013)
             │                 ├─{containerd}(1014)
             │                 ├─{containerd}(1015)
             │                 ├─{containerd}(1017)
             │                 └─{containerd}(4981)

So containerd is a child process of dockerd. And containerd-shim-runc-v2 processes are children of the containerd process.

My goal is to remove Docker on all my Kubernetes worker nodes and replace it with containerd and runc. I’ll do this node by node. So my first step is to remove all workload (container) on the first worker node and then continue with all the other worker nodes.

Before starting to do so you optionally can configure a Pod disruption budget to ensure that your workloads remain available during maintenance. If availability is important for any applications that run or could run on the node(s) that you are draining, configure a PodDisruptionBudgets first and the continue following this guide.

You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node. Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified. So lets start:

bash

kubectl drain worker99 --ignore-daemonsets
node/worker99 already cordoned
WARNING: ignoring DaemonSet-managed Pods: cilium/cilium-g2xgj, traefik/traefik-4jsng
evicting pod cert-manager/cert-manager-webhook-5b68d59578-b9ftk
evicting pod cilium/cilium-operator-7f9745f9b6-r7llz
pod/server-755f675b4c-wjkxp evicted
pod/cert-manager-webhook-5b68d59578-b9ftk evicted
pod/cilium-operator-7f9745f9b6-r7llz evicted
node/worker99 evicted

As there are two pods running on the node that are managed by the DaemonSet controller I’ve to include --ignore-daemonsets. If there are daemon set-managed pods, drain will not proceed without --ignore-daemonset, and regardless it will not delete any daemon set-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores unschedulable markings.

Now it’s save to do whatever we want to do with the node as no new pods will be scheduled there:

bash

kubectl get nodes worker99
NAME       STATUS                     ROLES    AGE    VERSION
worker99   Ready,SchedulingDisabled   <none>   516d   v1.21.4

Next I’ll shutdown a few services. As I’m using Ansible to manage my Kubernetes cluster I’ll use Ansible to make the changes now. You can use of course other tooling or just plain shell commands.

So first I’ll shutdown kubelet, kube-proxy and docker processes:

bash

for SERVICE in kubelet kube-proxy docker; do
  ansible -m systemd -a "name=${SERVICE} state=stopped" worker99
done

Next I’ll remove everything Docker related as it’s no longer needed. As I used my Ansible Docker Role to install Docker I’ll basically revert just everything this role installed (If you installed Docker via package manager just remove or prune the Docker package accordingly now). I haven’t implemented a uninstall task in my Ansible role so lets to it manually (your paths and hostname most probably vary so please adjust):

bash

ansible -m file -a 'path=/etc/systemd/system/docker.service state=absent' worker99
ansible -m file -a 'path=/etc/systemd/system/docker.socket state=absent' worker99
ansible -m systemd -a 'daemon_reload=yes' worker99

for FILE in containerd containerd-shim containerd-shim-runc-v2 ctr docker dockerd docker-init docker-proxy runc; do
  ansible -m file -a "path=/usr/local/bin/${FILE} state=absent" worker99
done

ansible -m shell -a 'sudo rm -fr /opt/tmp/docker*' worker99

ansible -m user -a 'name=docker state=absent' worker99
ansible -m group -a 'name=docker state=absent' worker99

Next I’ll install containerd, runc and the CNI plugins with the help of my Ansible roles:

So for this to work I need to change my Ansible hosts file for my Kubernetes cluster. There I’ve a group called k8s_worker which currently looks like this:

yaml

-
  hosts: k8s_worker
  roles:
    -
      role: githubixx.harden_linux
      tags: role-harden-linux
    -
      role: githubixx.docker
      tags: role-docker
    -
      role: githubixx.kubernetes_worker
      tags: role-kubernetes-worker

For more information also see my blog post Kubernetes the not so hard way with Ansible - The worker. I’ll now replace the githubixx.docker role with githubixx.containerd and add githubixx.cni and githubixx.runc. So after the change it looks like this:

yaml

-
  hosts: k8s_worker
  roles:
    -
      role: githubixx.harden_linux
      tags: role-harden-linux
    -
      role: githubixx.cni
      tags: role-cni
    -
      role: githubixx.runc
      tags: role-runc
    -
      role: githubixx.containerd
      tags: role-containerd
    -
      role: githubixx.kubernetes_worker
      tags: role-kubernetes-worker

Now everything is setup to install the roles with Ansible on the host in question:

bash

ansible-playbook --tags=role-cni --limit=worker99 k8s.yml
ansible-playbook --tags=role-runc --limit=worker99 k8s.yml
ansible-playbook --tags=role-containerd --limit=worker99 k8s.yml

Finally some changes for the systemd kubelet.service is needed. As I manage my Kubernetes workers with Ansible with my Kubernetes worker role a few changes were needed there too. Again if you don’t use Ansible and my role you can do the changes manually or deploy them with whatever deployment tool you use. So for the kublet service my role contained a variable called k8s_worker_kubelet_settings and it looked like this (these are the parameters provided to kubelet binary):

yaml

k8s_worker_kubelet_settings:
  "config": "{{k8s_worker_kubelet_conf_dir}}/kubelet-config.yaml"
  "node-ip": "{{hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address}}"
  "container-runtime": "docker"
  "image-pull-progress-deadline": "2m"
  "kubeconfig": "{{k8s_worker_kubelet_conf_dir}}/kubeconfig"
  "network-plugin": "cni"
  "cni-conf-dir": "{{k8s_cni_conf_dir}}"
  "cni-bin-dir": "{{k8s_cni_bin_dir}}"
  "cloud-provider": ""
  "register-node": "true"

These were the settings to make kubelet and Docker/dockershim work. As there is no Docker/dockershim anymore a few variables will go away if dockershim gets removed and are already deprecated in Kubernetes v1.21. So the new config looks like this:

yaml

k8s_worker_kubelet_settings:
  "config": "{{ k8s_worker_kubelet_conf_dir }}/kubelet-config.yaml"
  "node-ip": "{{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}"
  "kubeconfig": "{{ k8s_worker_kubelet_conf_dir }}/kubeconfig"

The previous settings image-pull-progress-deadline, network-plugin, cni-conf-dir and cni-bin-dir will all be removed with the dockershim removal. cloud-provider will be removed in Kubernetes v1.23, in favor of removing cloud provider code from Kubelet. container-runtime has only two possible values and changed from docker to remote. And finally one new setting is needed which is container-runtime-endpoint which points to containerd's socket.

I also needed to change the [Unit] section of kubelet.service as Kublet no longer depends on Docker but on containerd. So I needed

yaml

[Unit]
After=docker.service
Requires=docker.service

to be

yaml

[Unit]
After=containerd.service
Requires=containerd.service

Now the changes for the worker node can be deployed:

bash

ansible-playbook --tags=role-kubernetes-worker --limit=worker99 k8s.yml

This will also start kubelet.service and kube-proxy.service again. If both services start successfully you’ll recognize that the DaemonSet controller already started the DaemonSet pods again if any DaemonSet's were running before e.g. something like Cilium or Calico. So to finally allow Kubernetes to schedule regular Pods again on that node we need to

bash

kubectl uncordon worker99

To make sure everything works as intended I’d recommend to reboot the node now and check if all needed services start successfully after the reboot and if new Pod's are getting scheduled on that node. So with

bash

kubectl get pods -A -o wide

you should see Pod's scheduled on the node that was just migrated from Docker to containerd!

It can also be checked which container runtime a node uses. So after I migrated two out of four Kubernetes nodes to containerd getting the node info it looks like this:

bash

kubectl get nodes -o wide
NAME       STATUS   ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker96   Ready    <none>   517d   v1.21.8   10.8.0.202    <none>        Ubuntu 20.04.3 LTS   5.13.0-23-generic   docker://20.10.12
worker97   Ready    <none>   517d   v1.21.8   10.8.0.203    <none>        Ubuntu 20.04.3 LTS   5.13.0-23-generic   docker://20.10.12
worker98   Ready    <none>   517d   v1.21.8   10.8.0.204    <none>        Ubuntu 20.04.3 LTS   5.13.0-23-generic   containerd://1.5.9
worker99   Ready    <none>   517d   v1.21.8   10.8.0.205    <none>        Ubuntu 20.04.3 LTS   5.13.0-23-generic   containerd://1.5.9

As you can see worker99/98 are already using containerd while worker97/96 are still using Docker container runtime.

If you have your own Docker container registry running also make sure that your node is still able to fetch new containers. For me everything worked without issues.

Now you can continue with the next node until all nodes are using containerd without Docker.

Maybe one final note: Of course the docker CLI command on your worker nodes is now gone too. So you can’t do docker ps or something like that anymore. Normally this is something you want to avoid on a production system anyways. But sometimes it’s quite handy to check which containers are running or do some lower level debugging. There is a little tool called ctr which can be used. It’s not as powerful as docker CLI command but for basic stuff it’s enough. E.g. getting a list of running containers on a Kubernetes node you can use

bash

sudo ctr --namespace k8s.io containers ls

In this case it’s important to specify a namespace as otherwise you won’t see any output. To get a list of available namespaces use

bash

sudo ctr namespaces ls

NAME   LABELS 
k8s.io

On a Kubernetes node you normally only will get one namespace called k8s.io. For more information on how to use ctr see Why and How to Use containerd From Command Line. And if you want something which is more powerful have a look at nerdctl. nerdctl can do a lot more but that’s a different story 😉