Many rancher-agent containers running on Rancher v2.x provisioned RKE cluster, where stopped containers are regularly deleted on hosts
Article Number: 000020202
Environment
- A Rancher v2.x provisioned Rancher Kubernetes Engine (RKE) cluster.
- Repeated deletion of stopped containers on hosts in the cluster, e.g. use of
docker system prune
, either manually or as part of an automated process such as a cronjob.
Situation
Issue
On a Rancher v2.x provisioned cluster, a host shows a large number of containers running the rancher-agent
image, per the following output of docker ps | grep rancher-agent
:
$ docker ps | grep rancher-agent
...
aeffe9725521 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" About a minute ago Up About a minute sleepy_hopper
130120f49b71 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 6 minutes ago Up 6 minutes stoic_hypatia
498b923d9b6e rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 11 minutes ago Up 11 minutes laughing_elbakyan
3453865e5f70 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 16 minutes ago Up 16 minutes wonderful_gagarin
f925209cd16a rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 21 minutes ago Up 21 minutes silly_shannon
7d7fb5d4bf04 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 26 minutes ago Up 26 minutes gifted_elgamal
...
A docker inspect <container_id>
for these containers, shows the Path and Args are of the following format:
"Path": "run.sh",
"Args": [
"--server",
"https://167.172.96.240",
"--token",
"gwrp7zlnwvsnzh2nhbvwcgdw45ccv6cq9pztzdd92j6xlv69xxhvnp",
"--ca-checksum",
"bbc8c7ca05c87a7140154554fa1a516178852f2710538c57718f4c874c29533c",
"--no-register",
"--only-write-certs"
],
Cause
This behaviour is a result of the issue reported in Rancher GitHub issue #15364.
The share-mnt
container is created on a Rancher provisioned Kubernetes cluster, and exits upon completion, but is not removed such that it can be invoked again.
Meanwhile, the Rancher node-agent
Pod on a host will spawn a new share-mnt
container, if the share-mnt
is removed. Upon starting, the share-mnt
process spawns a rancher-agent
container to write certificates. This agent container will run indefinitely until the node-agent
is triggered to reconnect to the Rancher server or the node-agent
process is restarted.
As a result, where the share-mnt
container on a host is removed repeatedly, either manually or by an automated process, this will result in multiple running rancher-agent
containers.
Resolution
To trigger automatic removal of the rancher-agent
containers, the node-agent
container on the host can be restarted. Identifying the running agent container with docker ps | grep k8s_agent_cattle-node
restart the container with docker restart <container_id>
.
In addition, you can prevent further creation of multiple rancher-agent
container instances by removing whichever process is triggering the deletion of stopped containers.