Many rancher-agent containers running on Rancher v2.x provisioned RKE cluster, where stopped containers are regularly deleted on hosts

Article Number: 000020202

Environment

A Rancher v2.x provisioned Rancher Kubernetes Engine (RKE) cluster.
Repeated deletion of stopped containers on hosts in the cluster, e.g. use of docker system prune, either manually or as part of an automated process such as a cronjob.

Situation

Issue

On a Rancher v2.x provisioned cluster, a host shows a large number of containers running the rancher-agent image, per the following output of docker ps | grep rancher-agent:

$ docker ps | grep rancher-agent
...
aeffe9725521        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   About a minute ago   Up About a minute                       sleepy_hopper
130120f49b71        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   6 minutes ago        Up 6 minutes                            stoic_hypatia
498b923d9b6e        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   11 minutes ago        Up 11 minutes                            laughing_elbakyan
3453865e5f70        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   16 minutes ago        Up 16 minutes                            wonderful_gagarin
f925209cd16a        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   21 minutes ago       Up 21 minutes                           silly_shannon
7d7fb5d4bf04        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   26 minutes ago       Up 26 minutes                           gifted_elgamal
...

A docker inspect <container_id> for these containers, shows the Path and Args are of the following format:

"Path": "run.sh",
"Args": [
    "--server",
    "https://167.172.96.240",
    "--token",
    "gwrp7zlnwvsnzh2nhbvwcgdw45ccv6cq9pztzdd92j6xlv69xxhvnp",
    "--ca-checksum",
    "bbc8c7ca05c87a7140154554fa1a516178852f2710538c57718f4c874c29533c",
    "--no-register",
    "--only-write-certs"
],

Cause

This behaviour is a result of the issue reported in Rancher GitHub issue #15364.

The share-mnt container is created on a Rancher provisioned Kubernetes cluster, and exits upon completion, but is not removed such that it can be invoked again.

Meanwhile, the Rancher node-agent Pod on a host will spawn a new share-mnt container, if the share-mnt is removed. Upon starting, the share-mnt process spawns a rancher-agent container to write certificates. This agent container will run indefinitely until the node-agent is triggered to reconnect to the Rancher server or the node-agent process is restarted.

As a result, where the share-mnt container on a host is removed repeatedly, either manually or by an automated process, this will result in multiple running rancher-agent containers.

Resolution

To trigger automatic removal of the rancher-agent containers, the node-agent container on the host can be restarted. Identifying the running agent container with docker ps | grep k8s_agent_cattle-node restart the container with docker restart <container_id>.

In addition, you can prevent further creation of multiple rancher-agent container instances by removing whichever process is triggering the deletion of stopped containers.