rancher-logging-root-fluentd-0 pod keeps restarting continuously with exit code 137 even after increasing memory

This document (000021839) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Rancher 2.9.x
Rancher-logging 104.1.x+

Situation

A cluster with rancher-logging installed causing rancher-logging-root-fluentd-0 pod to restart continuously with error code '137'. But the same issue persists even after increasing the memory significantly.
The rancher-logging-root-fluentd-0 pod only shows below error:

Last status: Exited with 137: Error, Started: Fri Feb 28, 2025 4:50:53 PM, Exited: Fri Feb 28, 2025 4:57:07 PM

Upon further investigation, rancher-logging-root-fluentbit pod shows below errors:

[2025/03/04 11:01:38] [error] [net] TCP connection failed: rancher-logging-root-fluentd.cattle-logging-system.svc.cluster.local:24240 (Connection refused)
[2025/03/04 11:01:38] [error] [output:forward:forward.0] no upstream connections available
[2025/03/04 11:01:38] [ warn] [engine] failed to flush chunk '1-1741004097.135890300.flb', retry in 320 seconds: task_id=147, input=tail.0 > output=forward.0 (out_id=0)

Resolution

Configure the output buffer to use the type 'file' instead of 'memory'.
Below is an example output snippet for elasticsearch:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Output
metadata:
  name: efk
  namespace: cattle-logging-system
spec:
  elasticsearch:
    buffer:
      flush_interval: 30s
      flush_mode: interval
      flush_thread_count: 4
      queued_chunks_limit_size: 300
      type: file                          <<========================

Furthermore, login to Rancher >> explore the desired cluster >> Apps >> Installed Apps >> Rancher-Logging >> Click on "Edit/Upgrade" and review if the 'Buffer_Chunk_Size' and 'Buffer_Max_Size' mentioned below can be tuned further with a value that best suits the cluster needs as per https://github.com/rancher/rancher-docs/issues/90

inputTail:
    Buffer_Chunk_Size: ''
    Buffer_Max_Size: ''

Observe that the pod rancher-logging-root-fluentd-0 does not restart anymore and logs are sent successfully.

Cause

By default when Fluent Bit processes data, it uses Memory as a primary and temporary place to store the records. There are scenarios where it would be ideal to have a persistent buffering mechanism based in the filesystem to provide aggregation and data safety capabilities.
Fluentbit can lead to these issues when destination is slow or the cluster is producing large volumes of data.
It is important to understand the correct configuration in case of slow destinations or large backpressure.
More information can be found here: https://docs.fluentbit.io/manual/administration/buffering-and-storage.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.