I have the alert 'NodeFilesystemSpaceFillingUp ', what does this mean?

Article Number: 000020004

Situation

Issue

When using Rancher Monitoring, you may see the Alert 'NodeFilesystemSpaceFillingUp ', however, you may not understand what this means and what the impact of this is.

Background

The alert is an early warning for potentially more serious issues with the disk space on a node becoming full.

It uses the following equation:

(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

It can be divided into three parts that need to be true so that the alert triggers :

node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 15

This condition evaluates to true if the node´s filesystem is below 15% of space available.

predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0

This condition measures the rate at which the disk is losing available space and, based on the trend from the last 6 hours, will warn you that, at that rate, in 4 hours (4*60*60) the disk will have no available space at all.

node_filesystem_readonly{fstype!="",job="node-exporter"} == 0

This is just a check that the node´s filesystem is currently not read-only. The metric node_filesystem_readonly evaluates to 0 if it is not read-only, and 1 if it is.

It is important to make sure that you investigate the affected node for any potential issues with excessive logging, under-provisioned disk space, or perhaps something else that could cause such a situation.

Impact

If the Node does run out of disk space, this will cause a 'Disk Pressure' event on the node.

When a Node experiences Disk Pressure, it will evict all running containers and become unschedulable until the requisite disk space is made.

This is a serious situation, especially if the cause of the failure is a rogue container over-logging.

If an over-logging or failing workload is forced to reschedule on another node, they may all become unschedulable as the issue will follow the workload when it is rescheduled.

In a typical Kubernetes installation, Disk Pressure is caused by available space being less than 10%.

Investigative Steps

The cause of this issue can vary heavily across different environments, but as the alert is node-specific, you should start there.

The first place to investigate is a df -h, this will show you the percentage of disk space that is filled on your node, you may be able to identify immediately a place where disk space is no longer available.

This is the fastest way to assess more urgent issues, and once you've identified the disk that may be reporting as nearly full, you can immediately take precautions, such as clearing out old log files or increasing the size of a disk.

As you use Rancher Monitoring you can also look at Node-specific statistics graphed over time and make an assessment on when the issue began, compared to running workloads and other node logs.

During operations seeing an increase in either logging or a gradual increase in storage used is often expected.

For example, if the alert is sounding because a specific workload encountered issues, logged more, but then recovered and cleaned up the logs, you may no longer have an issue on your hands but could consider reducing the logging of the workload to prevent further alerts.

The 'NodeFilesystemSpaceFillingUp' is a preemptive alert that will always require investigation and understanding to ensure the best operational health of your Nodes and Clusters.

Resolution

Short-Term Solutions

If you can identify the reason for this alert, the solution should hopefully be more straightforward. You may need to delete some old files on a Node, or you may need to reduce logging, for example, if debug logging is running.

If you do hit a Disk Pressure event and you need to recover, you need to access the node directly and reduce the amount of space taken manually. When requisite space is made, you should either restart the Node or Docker on the Node to mark it as schedulable in Kubernetes again.

Long-Term Solutions

Different solutions will mitigate this alert:

Having larger disks

While Rancher does not have specific requirements for disk space on a Node, it would be recommended to, at least, have 30GB or more to better mitigate this alert's occurrence.

Exporting container logs

Rancher provides a logging deployment you can configure to export your container logs, for example, to an in-house Elasticsearch Cluster.

While these logs will be buffered locally, they will then be exported remotely, thereby reducing the amount of accumulated logs over time.

Running regular cleanups on System Logs

While out-of-scope of Rancher, a large amount of system logs can contribute to this alert sounding, it is encouraged to manage logging at an OS Level with either logrotate or by exporting logs.