Skip to content

system-reserved and kube-reserved resource reservations

Article Number: 000021853

Environment

  • SUSE Rancher 2.x
  • Downstream or Standalone/Custom RKE1/RKE2/K3s cluster

Situation

  • Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.
  • The scheduler treats 'Allocatable' as available capacity for pods
  • The kubelet exposes a feature named 'Node Allocatable' that helps to reserve compute resources for system daemons. Kubernetes recommends that cluster administrators configure 'Node Allocatable' based on their workload density on each node.
  • 'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods (Allocatable = Total node capacity - kube-reserved - system-reserved). . The scheduler does not over-subscribe 'Allocatable'. 'CPU', 'memory', and 'ephemeral-storage' are supported values.

Kube Reserved:

  • KubeletConfiguration Setting: kubeReserved: {}. Example value {cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000}
  • kubeReserved is meant to capture resource reservation for Kubernetes system daemons like the kubelet, container runtime, etc.
  • In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the specified number of process IDs for Kubernetes system daemons.

System Reserved

  • KubeletConfiguration Setting: systemReserved: {}. Example value {cpu: 100m, memory: 100Mi, ephemeral-storage: 1Gi, pid=1000}
  • systemReserved is meant to capture resource reservation for OS system daemons like sshd, udev, etc. systemReserved should reserve memory for the kernel too since kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slice in systemd world).
  • In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the specified number of process IDs for OS system daemons.
  • Be careful while enforcing system-reserved reservation, since it can lead to critical system services being CPU starved, OOM killed, or unable to fork on the node. The recommendation is to enforce system-reserved only if a user has profiled their nodes exhaustively to come up with precise estimates and is confident in their ability to recover if any process in that group is oom-killed

Resolution

How to configure system-reserved and kube-reserved reservations:

Rancher-provisioned cluster:

  • RKE2 cluster: In the Rancher UI go to Cluster management -> Select your RKE2 cluster -> Edit Config -> Advanced, then specify the parameters as 'Additional Kubelet Args'. Add the following lines (sample), click on Save to apply.

kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi

  • K3s cluster:  In the Rancher UI go to Cluster management -> Select your K3s cluster -> Edit Config -> Advanced, then specify the parameters as 'Additional Kubelet Args'. Add the following lines ( these are a sample), click on Save to apply.

kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi - RKE1 cluster: In the Rancher UI go to Cluster management -> Select your RKE cluster -> Edit Config -> Edit as YAML, and under the 'Services' -> 'Kubelet' -> 'extra_args',  add the following lines (these are a sample), click on Save to apply. 

  kubelet:
    extra_args:
      kube-reserved: "cpu=1,memory=1Gi,ephemeral-storage=1Gi"
      system-reserved: "cpu=500m,memory=1Gi,ephemeral-storage=1Gi"

Standalone clusters:

  • RKE2 cluster: Specify the system-reserved and kube-reserved parameters under the configuration file available in the path /etc/rancher/rke2/config.yaml. Here is a sample /etc/rancher/rke2/config.yaml config for a worker (agent) node:  

token:  rASpGladPp
node-name: wk1.cluster.local
node-ip: 10.10.24.1
cluster-domain: cluster.local
tls-san:
  - cluster.local
cluster-cidr: 10.10.16.0/21
service-cidr: 10.10.0.0/20
cluster-dns: 10.10.0.10
service-node-port-range: 30000-32767
kube-apiserver-arg:
  - request-timeout=2m
kubelet-arg:
  - kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
  - system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
  - eviction-hard=memory.available<500Mi,nodefs.available<4%
cni: calico
- K3s cluster:  Specify the system-reserved and kube-reserved settings under the configuration file available in /etc/rancher/h3s/config.yaml. Here is a sample /etc/rancher/k3s/config.yaml config for a worker(agent) node.  

token:  xxxxxxxxx
cluster-domain: k3s.local
tls-san:
  - k3s.local
kube-apiserver-arg:
  - request-timeout=2m
kubelet-arg:
  - kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=5Gi
  - system-reserved=cpu=1,memory=1548Mi,ephemeral-storage=30Gi
  - eviction-hard=memory.available<500Mi,nodefs.available<4%
- RKE1 cluster:  Speficy the system-reserved and kube-reserved settings as 'extra_args' for the kubelet in the cluster configuration file 'cluster.yml'. Here is a sample cluster.yml config file: 

nodes:
  - address: 192.168.100.41
    internal_address: 192.168.100.41
    user: root
    role: [controlplane, etcd]
  - address: 192.168.100.42
    internal_address: 192.168.100.42
    user: root
    role: [worker]
  - address: 192.168.100.43
    internal_address: 192.168.100.43
    user: root
    role: [worker]
  - address: 192.168.100.44
    internal_address: 192.168.100.44
    user: root
    role: [worker]
cluster_name: rke-sample
kubernetes_version: v1.26.15-rancher1-1
services:
  etcd:
    backup_config:
      interval_hours: 6
      retention: 30
  kube-api:
    service_cluster_ip_range: 10.41.0.0/16
  kube-controller:
    cluster_cidr: 10.40.0.0/16
    service_cluster_ip_range: 10.41.0.0/16
  kubelet:
    cluster_dns_server: 10.41.0.10
    extra_args:
      enforce-node-allocatable: "pods,kube-reserved,system-reserved"
      kube-reserved: "cpu=1,memory=1Gi,ephemeral-storage=1Gi"
      system-reserved: "cpu=500m,memory=1Gi,ephemeral-storage=1Gi"
      kube-reserved-cgroup: /kube.slice
      system-reserved-cgroup: /system.slice
      eviction-hard: "memory.available<500Mi,imagefs.available<10%,nodefs.available<10%,nodefs.inodesFree<5%"