Many rancher-agent containers running on Rancher v2.x provisioned RKE cluster, where stopped containers are regularly deleted on hosts
This document (000020202) is provided subject to the disclaimer at the end of this document.
Environment
- A Rancher v2.x provisioned Rancher Kubernetes Engine (RKE) cluster.
- Repeated deletion of stopped containers on hosts in the cluster, e.g. use of
docker system prune
, either manually or as part of an automated process such as a cronjob.
Situation
Issue
On a Rancher v2.x provisioned cluster, a host shows a large number of containers running the rancher-agent
image, per the following output of docker ps | grep rancher-agent
:
$ docker ps | grep rancher-agent
...
aeffe9725521 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" About a minute ago Up About a minute sleepy_hopper
130120f49b71 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 6 minutes ago Up 6 minutes stoic_hypatia
498b923d9b6e rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 11 minutes ago Up 11 minutes laughing_elbakyan
3453865e5f70 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 16 minutes ago Up 16 minutes wonderful_gagarin
f925209cd16a rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 21 minutes ago Up 21 minutes silly_shannon
7d7fb5d4bf04 rancher/rancher-agent:v2.3.3 "run.sh --server htt…" 26 minutes ago Up 26 minutes gifted_elgamal
...
A docker inspect <container_id>
for these containers, shows the Path and Args are of the following format:
"Path": "run.sh",
"Args": [
"--server",
"https://167.172.96.240",
"--token",
"gwrp7zlnwvsnzh2nhbvwcgdw45ccv6cq9pztzdd92j6xlv69xxhvnp",
"--ca-checksum",
"bbc8c7ca05c87a7140154554fa1a516178852f2710538c57718f4c874c29533c",
"--no-register",
"--only-write-certs"
],
Resolution
To trigger automatic removal of the rancher-agent
containers, the node-agent
container on the host can be restarted. Identifying the running agent container with docker ps | grep k8s_agent_cattle-node
restart the container with docker restart <container_id>
.
In addition, you can prevent further creation of multiple rancher-agent
container instances by removing whichever process is triggering the deletion of stopped containers.
Cause
This behaviour is a result of the issue reported in Rancher GitHub issue #15364.
The share-mnt
container is created on a Rancher provisioned Kubernetes cluster, and exits upon completion, but is not removed such that it can be invoked again.
Meanwhile, the Rancher node-agent
Pod on a host will spawn a new share-mnt
container, if the share-mnt
is removed. Upon starting, the share-mnt
process spawns a rancher-agent
container to write certificates. This agent container will run indefinitely until the node-agent
is triggered to reconnect to the Rancher server or the node-agent
process is restarted.
As a result, where the share-mnt
container on a host is removed repeatedly, either manually or by an automated process, this will result in multiple running rancher-agent
containers.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.