Kubernetes: Replace dockershim with containerd and runc
Introduction
If you followed my Kubernetes the not so hard way with Ansible blog series then you probably know that I’ve used Docker container runtime for my Kubernetes worker nodes up until now. But support for that one (or the Dockershim
to be more precise) is deprecated and will be removed in Kubernetes v1.24. It was already planed for v1.22 but was postponed. I guess still a lot of organizations out there use the Docker runtime for Kubernetes…
I don’t want to go into detail why and how this happened. But TL;DR Docker as an underlying runtime is being deprecated in favor of runtimes that use the Container Runtime Interface (CRI) created for Kubernetes. Docker-produced images will continue to work in your cluster with all runtimes, as they always have.
If you need more information here are some resources:
- Docker runtime deprecation announced in Kubernetes v1.20 CHANGELOG
- Don’t Panic: Kubernetes and Docker
- Dockershim Deprecation FAQ
If you still want to use dockershim this announcement may be for you: Mirantis to take over support of Kubernetes dockershim. But still I’d really think about before going down that road. All the community effort out there is going into containerd and CRI-O. E.g. it’s possible to use gVisor as a RuntimeClass
in Kubernetes with containerd
. gVisor is an application kernel, written in Go, that implements a substantial portion of the Linux system call interface. It provides an additional layer of isolation between running applications and the host operating system. gVisor includes an Open Container Initiative (OCI) runtime called runsc
that makes it easy to work with existing container tooling. The runsc
runtime integrates with Kubernetes, making it simple to run sandboxed containers. This is only one example. The containerd
and CRI-O
user base will further grow over time and blog posts and articles will appear. I guess this won’t be true so much with the dockershim
.
One interesting fact actually is that technically the change itself isn’t that big at all. If you don’t run a quite ancient Docker version the you already use containerd
and runc
. E.g. grepping for docker
on one of my worker nodes shows the following result (only contains the important processes):
/usr/local/bin/dockerd --bip= --host=unix:///run/docker.sock --ip-masq=false --iptables=false --log-level=error --mtu=1472 --storage-driver=overlay2
containerd --config /var/run/docker/containerd/containerd.toml --log-level error
/usr/local/bin/kubelet --cloud-provider= --cni-bin-dir=/opt/cni/bin --cni-conf-dir=/etc/cni/net.d --config=/var/lib/kubelet/kubelet-config.yaml --container-runtime=docker --image-pull-progress-deadline=2m --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-ip=10.8.0.205 --register-node=true
/usr/local/bin/containerd-shim-runc-v2 -namespace moby -id 93aed22ecdeaad7bec4217325557bedab919cd466010144afd616db3aba2656a -address /var/run/docker/containerd/containerd.sock
/usr/local/bin/containerd-shim-runc-v2 -namespace moby -id e34afbec290f737834b463fbb3820a62b720210d10278f3df4ab72068ec61f45 -address /var/run/docker/containerd/containerd.sock
If I grep for the last ID in this list in the Docker process list I can see the container running:
sudo docker ps | grep e34afbec290f
e34afbec290f cbb49690db02 "/app/cmd/webhook/we…" 4 hours ago Up 4 hours k8s_cert-manager_cert-manager-webhook-5b68d59578-b9ftk_cert-manager_d4cac7e3-d9d0-4404-be5f-e7f2edd32623_9
As you can see dockerd
already uses containerd
and runc
in the background which gets even more visible with pstree
:
dockerd(748)─┬─containerd(963)─┬─{containerd}(968)
│ ├─{containerd}(969)
│ ├─{containerd}(970)
│ ├─{containerd}(971)
│ ├─{containerd}(972)
│ ├─{containerd}(1013)
│ ├─{containerd}(1014)
│ ├─{containerd}(1015)
│ ├─{containerd}(1017)
│ └─{containerd}(4981)
So containerd
is a child process of dockerd
. And containerd-shim-runc-v2
processes are children of the containerd
process.
Evicting pods on K8s node
My goal is to remove Docker
on all my Kubernetes worker nodes and replace it with containerd
and runc
. I’ll do this node by node. So my first step is to remove all workload (container) on the first worker node and then continue with all the other worker nodes.
Before starting to do so you optionally can configure a Pod disruption budget to ensure that your workloads remain available during maintenance. If availability is important for any applications that run or could run on the node(s) that you are draining, configure a PodDisruptionBudgets first and the continue following this guide.
You can use kubectl drain
to safely evict all of your pods from a node before you perform maintenance on the node. Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets
you have specified. So lets start:
kubectl drain worker99 --ignore-daemonsets
node/worker99 already cordoned
WARNING: ignoring DaemonSet-managed Pods: cilium/cilium-g2xgj, traefik/traefik-4jsng
evicting pod cert-manager/cert-manager-webhook-5b68d59578-b9ftk
evicting pod cilium/cilium-operator-7f9745f9b6-r7llz
pod/server-755f675b4c-wjkxp evicted
pod/cert-manager-webhook-5b68d59578-b9ftk evicted
pod/cilium-operator-7f9745f9b6-r7llz evicted
node/worker99 evicted
As there are two pods running on the node that are managed by the DaemonSet controller
I’ve to include --ignore-daemonsets
. If there are daemon set-managed pods, drain
will not proceed without --ignore-daemonset
, and regardless it will not delete any daemon set-managed pods, because those pods would be immediately replaced by the DaemonSet controller
, which ignores unschedulable markings.
Now it’s save to do whatever we want to do with the node as no new pods will be scheduled there:
kubectl get nodes worker99
NAME STATUS ROLES AGE VERSION
worker99 Ready,SchedulingDisabled <none> 516d v1.21.4
Shutdown K8s services
Next I’ll shutdown a few services. As I’m using Ansible
to manage my Kubernetes cluster I’ll use Ansible
to make the changes now. You can use of course other tooling or just plain shell commands.
So first I’ll shutdown kubelet
, kube-proxy
and docker
processes:
for SERVICE in kubelet kube-proxy docker; do
ansible -m systemd -a "name=${SERVICE} state=stopped" worker99
done
Remove docker artifacts
Next I’ll remove everything Docker
related as it’s no longer needed. As I used my Ansible Docker Role to install Docker
I’ll basically revert just everything this role installed (If you installed Docker
via package manager just remove or prune the Docker package accordingly now). I haven’t implemented a uninstall task in my Ansible role so lets to it manually (your paths and hostname most probably vary so please adjust):
ansible -m file -a 'path=/etc/systemd/system/docker.service state=absent' worker99
ansible -m file -a 'path=/etc/systemd/system/docker.socket state=absent' worker99
ansible -m systemd -a 'daemon_reload=yes' worker99
for FILE in containerd containerd-shim containerd-shim-runc-v2 ctr docker dockerd docker-init docker-proxy runc; do
ansible -m file -a "path=/usr/local/bin/${FILE} state=absent" worker99
done
ansible -m shell -a 'sudo rm -fr /opt/tmp/docker*' worker99
ansible -m user -a 'name=docker state=absent' worker99
ansible -m group -a 'name=docker state=absent' worker99
Install containerd, runc and CNI plugins
Next I’ll install containerd
, runc
and the CNI
plugins with the help of my Ansible roles:
So for this to work I need to change my Ansible hosts
file for my Kubernetes cluster. There I’ve a group called k8s_worker
which currently looks like this:
-
hosts: k8s_worker
roles:
-
role: githubixx.harden_linux
tags: role-harden-linux
-
role: githubixx.docker
tags: role-docker
-
role: githubixx.kubernetes_worker
tags: role-kubernetes-worker
For more information also see my blog post Kubernetes the not so hard way with Ansible - The worker. I’ll now replace the githubixx.docker
role with githubixx.containerd
and add githubixx.cni
and githubixx.runc
. So after the change it looks like this:
-
hosts: k8s_worker
roles:
-
role: githubixx.harden_linux
tags: role-harden-linux
-
role: githubixx.cni
tags: role-cni
-
role: githubixx.runc
tags: role-runc
-
role: githubixx.containerd
tags: role-containerd
-
role: githubixx.kubernetes_worker
tags: role-kubernetes-worker
Now everything is setup to install the roles with Ansible on the host in question:
ansible-playbook --tags=role-cni --limit=worker99 k8s.yml
ansible-playbook --tags=role-runc --limit=worker99 k8s.yml
ansible-playbook --tags=role-containerd --limit=worker99 k8s.yml
Adjust parameters for kubelet binary
Finally some changes for the systemd kubelet.service
is needed. As I manage my Kubernetes workers with Ansible with my Kubernetes worker role a few changes were needed there too. Again if you don’t use Ansible and my role you can do the changes manually or deploy them with whatever deployment tool you use. So for the kubelet
service my role contained a variable called k8s_worker_kubelet_settings
and it looked like this (these are the parameters provided to kubelet
binary):
k8s_worker_kubelet_settings:
"config": "{{k8s_worker_kubelet_conf_dir}}/kubelet-config.yaml"
"node-ip": "{{hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address}}"
"container-runtime": "docker"
"image-pull-progress-deadline": "2m"
"kubeconfig": "{{k8s_worker_kubelet_conf_dir}}/kubeconfig"
"network-plugin": "cni"
"cni-conf-dir": "{{k8s_cni_conf_dir}}"
"cni-bin-dir": "{{k8s_cni_bin_dir}}"
"cloud-provider": ""
"register-node": "true"
These were the settings to make kubelet
and Docker/dockershim
work. As there is no Docker/dockershim
anymore a few variables will go away if dockershim
gets removed and are already deprecated in Kubernetes v1.21
. So the new config looks like this:
k8s_worker_kubelet_settings:
"config": "{{ k8s_worker_kubelet_conf_dir }}/kubelet-config.yaml"
"node-ip": "{{ hostvars[inventory_hostname]['ansible_' + k8s_interface].ipv4.address }}"
"kubeconfig": "{{ k8s_worker_kubelet_conf_dir }}/kubeconfig"
The previous settings image-pull-progress-deadline
, network-plugin
, cni-conf-dir
and cni-bin-dir
will all be removed with the dockershim
removal. cloud-provider
will be removed in Kubernetes v1.23
, in favor of removing cloud provider code from Kubelet. container-runtime
has only two possible values and changed from docker
to remote
. And finally one new setting is needed which is container-runtime-endpoint
which points to containerd's
socket.
Adjust settings for kubelet.service
I also needed to change the [Unit]
section of kubelet.service
as Kubelet no longer depends on Docker
but on containerd
. So I needed
[Unit]
After=docker.service
Requires=docker.service
to be
[Unit]
After=containerd.service
Requires=containerd.service
Deploy the changes for the K8s worker node
Now the changes for the worker node can be deployed:
ansible-playbook --tags=role-kubernetes-worker --limit=worker99 k8s.yml
This will also start kubelet.service
and kube-proxy.service
again. If both services start successfully you’ll recognize that the DaemonSet controller
already started the DaemonSet
pods again if any DaemonSet's
were running before e.g. something like Cilium
or Calico
. So to finally allow Kubernetes to schedule regular Pods
again on that node we need to
Test the changes
kubectl uncordon worker99
To make sure everything works as intended I’d recommend to reboot the node now and check if all needed services start successfully after the reboot and if new Pod's
are getting scheduled on that node. So with
kubectl get pods -A -o wide
you should see Pod's
scheduled on the node that was just migrated from Docker
to containerd
!
It can also be checked which container runtime a node uses. So after I migrated two out of four Kubernetes nodes to containerd
getting the node info it looks like this:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
worker96 Ready <none> 517d v1.21.8 10.8.0.202 <none> Ubuntu 20.04.3 LTS 5.13.0-23-generic docker://20.10.12
worker97 Ready <none> 517d v1.21.8 10.8.0.203 <none> Ubuntu 20.04.3 LTS 5.13.0-23-generic docker://20.10.12
worker98 Ready <none> 517d v1.21.8 10.8.0.204 <none> Ubuntu 20.04.3 LTS 5.13.0-23-generic containerd://1.5.9
worker99 Ready <none> 517d v1.21.8 10.8.0.205 <none> Ubuntu 20.04.3 LTS 5.13.0-23-generic containerd://1.5.9
As you can see worker99/98
are already using containerd
while worker97/96
are still using Docker container runtime.
Test Docker registry (optional)
If you have your own Docker container registry running also make sure that your node is still able to fetch new containers. For me everything worked without issues.
Now you can continue with the next node until all nodes are using containerd
without Docker
.
Using ctr/nerdctl instead of docker CLI command
Maybe one final note: Of course the docker
CLI command on your worker nodes is now gone too. So you can’t do docker ps
or something like that anymore. Normally this is something you want to avoid on a production system anyways. But sometimes it’s quite handy to check which containers are running or do some lower level debugging. There is a little tool called ctr
which can be used. It’s not as powerful as docker
CLI command but for basic stuff it’s enough. E.g. getting a list of running containers on a Kubernetes node you can use
sudo ctr --namespace k8s.io containers ls
In this case it’s important to specify a namespace
as otherwise you won’t see any output. To get a list of available namespaces use
sudo ctr namespaces ls
NAME LABELS
k8s.io
On a Kubernetes node you normally only will get one namespace called k8s.io
. For more information on how to use ctr
see Why and How to Use containerd From Command Line. And if you want something which is more powerful have a look at nerdctl. nerdctl
can do a lot more but that’s a different story 😉