Skip to content

rancher-logging-root-fluentd-0 pod keeps restarting continuously with exit code 137 even after increasing memory

This document (000021839) is provided subject to the disclaimer at the end of this document.

Environment

  • SUSE Rancher 2.9.x
  • Rancher-logging 104.1.x+

Situation

  • A cluster with rancher-logging installed causing rancher-logging-root-fluentd-0 pod to restart continuously with error code '137'. But the same issue persists even after increasing the memory significantly.

  • The rancher-logging-root-fluentd-0 pod only shows below error:

Last status: Exited with 137: Error, Started: Fri Feb 28, 2025 4:50:53 PM, Exited: Fri Feb 28, 2025 4:57:07 PM
  • Upon further investigation, rancher-logging-root-fluentbit pod shows below errors:
[2025/03/04 11:01:38] [error] [net] TCP connection failed: rancher-logging-root-fluentd.cattle-logging-system.svc.cluster.local:24240 (Connection refused)
[2025/03/04 11:01:38] [error] [output:forward:forward.0] no upstream connections available
[2025/03/04 11:01:38] [ warn] [engine] failed to flush chunk '1-1741004097.135890300.flb', retry in 320 seconds: task_id=147, input=tail.0 > output=forward.0 (out_id=0)

Resolution

  • Configure the output buffer to use the type 'file' instead of 'memory'.
  • Below is an example output snippet for elasticsearch:
apiVersion: logging.banzaicloud.io/v1beta1
kind: Output
metadata:
  name: efk
  namespace: cattle-logging-system
spec:
  elasticsearch:
    buffer:
      flush_interval: 30s
      flush_mode: interval
      flush_thread_count: 4
      queued_chunks_limit_size: 300
      type: file                          <<========================
  • Furthermore, login to Rancher >> explore the desired cluster >> Apps >> Installed Apps >> Rancher-Logging >> Click on "Edit/Upgrade" and review if the 'Buffer_Chunk_Size' and 'Buffer_Max_Size' mentioned below can be tuned further with a value that best suits the cluster needs as per https://github.com/rancher/rancher-docs/issues/90
inputTail:
    Buffer_Chunk_Size: ''
    Buffer_Max_Size: ''
  • Observe that the pod rancher-logging-root-fluentd-0 does not restart anymore and logs are sent successfully.

Cause

  • By default when Fluent Bit processes data, it uses Memory as a primary and temporary place to store the records. There are scenarios where it would be ideal to have a persistent buffering mechanism based in the filesystem to provide aggregation and data safety capabilities.
  • Fluentbit can lead to these issues when destination is slow or the cluster is producing large volumes of data.
  • It is important to understand the correct configuration in case of slow destinations or large backpressure.
  • More information can be found here: https://docs.fluentbit.io/manual/administration/buffering-and-storage.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.