system-reserved and kube-reserved resource reservations
Article Number: 000021853
Environment
- SUSE Rancher 2.x
- Downstream or Standalone/Custom RKE1/RKE2/K3s cluster
Situation
- Kubernetes nodes can be scheduled to
Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node. - The scheduler treats 'Allocatable' as available
capacityfor pods - The
kubeletexposes a feature named 'Node Allocatable' that helps to reserve compute resources for system daemons. Kubernetes recommends that cluster administrators configure 'Node Allocatable' based on their workload density on each node. - 'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods (Allocatable = Total node capacity - kube-reserved - system-reserved). . The scheduler does not over-subscribe 'Allocatable'. 'CPU', 'memory', and 'ephemeral-storage' are supported values.
Kube Reserved:
- KubeletConfiguration Setting:
kubeReserved: {}. Example value{cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000} kubeReservedis meant to capture resource reservation for Kubernetes system daemons like thekubelet,container runtime, etc.- In addition to
cpu,memory, andephemeral-storage,pidmay be specified to reserve the specified number of process IDs for Kubernetes system daemons.
System Reserved
- KubeletConfiguration Setting:
systemReserved: {}. Example value{cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000} systemReservedis meant to capture resource reservation for OS system daemons likesshd,udev, etc.systemReservedshould reservememoryfor thekerneltoo sincekernelmemory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slicein systemd world).- In addition to
cpu,memory, andephemeral-storage,pidmay be specified to reserve the specified number of process IDs for OS system daemons. - Be careful while enforcing
system-reservedreservation, since it can lead to critical system services being CPU starved, OOM killed, or unable to fork on the node. The recommendation is to enforcesystem-reservedonly if a user has profiled their nodes exhaustively to come up with precise estimates and is confident in their ability to recover if any process in that group is oom-killed
Resolution
How to configure system-reserved and kube-reserved reservations:
Rancher-provisioned cluster:
- RKE2 cluster: In the Rancher UI go to Cluster management -> Select your RKE2 cluster -> Edit Config -> Advanced, then specify the parameters as 'Additional Kubelet Args'. Add the following lines (sample), click on Save to apply.
kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
- K3s cluster: In the Rancher UI go to Cluster management -> Select your K3s cluster -> Edit Config -> Advanced, then specify the parameters as 'Additional Kubelet Args'. Add the following lines ( these are a sample), click on Save to apply.
kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
- RKE1 cluster: In the Rancher UI go to Cluster management -> Select your RKE cluster -> Edit Config -> Edit as YAML, and under the 'Services' -> 'Kubelet' -> 'extra_args', add the following lines (these are a sample), click on Save to apply.
kubelet:
extra_args:
kube-reserved: "cpu=1,memory=1Gi,ephemeral-storage=1Gi"
system-reserved: "cpu=500m,memory=1Gi,ephemeral-storage=1Gi"
Standalone clusters:
- RKE2 cluster: Specify the system-reserved and kube-reserved parameters under the configuration file available in the path /etc/rancher/rke2/config.yaml. Here is a sample /etc/rancher/rke2/config.yaml config for a worker (agent) node:
token: rASpGladPp
node-name: wk1.cluster.local
node-ip: 10.10.24.1
cluster-domain: cluster.local
tls-san:
- cluster.local
cluster-cidr: 10.10.16.0/21
service-cidr: 10.10.0.0/20
cluster-dns: 10.10.0.10
service-node-port-range: 30000-32767
kube-apiserver-arg:
- request-timeout=2m
kubelet-arg:
- kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
- system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
- eviction-hard=memory.available<500Mi,nodefs.available<4%
cni: calico
token: xxxxxxxxx
cluster-domain: k3s.local
tls-san:
- k3s.local
kube-apiserver-arg:
- request-timeout=2m
kubelet-arg:
- kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
- system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
- eviction-hard=memory.available<500Mi,nodefs.available<4%
nodes:
- address: 192.168.100.41
internal_address: 192.168.100.41
user: root
role: [controlplane, etcd]
- address: 192.168.100.42
internal_address: 192.168.100.42
user: root
role: [worker]
- address: 192.168.100.43
internal_address: 192.168.100.43
user: root
role: [worker]
- address: 192.168.100.44
internal_address: 192.168.100.44
user: root
role: [worker]
cluster_name: rke-sample
kubernetes_version: v1.26.15-rancher1-1
services:
etcd:
backup_config:
interval_hours: 6
retention: 30
kube-api:
service_cluster_ip_range: 10.41.0.0/16
kube-controller:
cluster_cidr: 10.40.0.0/16
service_cluster_ip_range: 10.41.0.0/16
kubelet:
cluster_dns_server: 10.41.0.10
extra_args:
enforce-node-allocatable: "pods,kube-reserved,system-reserved"
kube-reserved: "cpu=1,memory=1Gi,ephemeral-storage=1Gi"
system-reserved: "cpu=500m,memory=1Gi,ephemeral-storage=1Gi"
kube-reserved-cgroup: /kube.slice
system-reserved-cgroup: /system.slice
eviction-hard: "memory.available<500Mi,imagefs.available<10%,nodefs.available<10%,nodefs.inodesFree<5%"