Calico Cloud documentation

Fluent Bit metrics

Big picture

Use the Prometheus monitoring and alerting tool for Fluent Bit metrics to ensure continuous network visibility.

Value

Platform engineering teams rely on logs for visibility into their networks. If collecting or storing logs are disrupted, this can impact network visibility. Prometheus can monitor log collection and storage metrics so platform engineering teams are alerted about problems before they occur.

Concepts

Component	Description
Prometheus	Monitoring tool that scrapes metrics from instrumented jobs and displays time series data in a visualizer (such as Grafana). For Calico Cloud, the “jobs” that Prometheus can harvest metrics from the Fluent Bit component.
Fluent Bit	Ships Calico Cloud logs to the log storage backend.

How to

Create Prometheus alerts for Fluent Bit

The following example creates a Prometheus rule to monitor some important Fluent Bit metrics, and alert when they have crossed certain thresholds:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tigera-prometheus-log-collection-monitoring
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
  - name: tigera-log-collection.rules
    rules:
    - alert: FluentBitPodConsistentlyLowBufferSpace
      expr: avg_over_time(fluentbit_output_chunk_available_capacity_percent[5m]) < 75
      labels:
        severity: Warning
      annotations:
        summary: "Fluent Bit pod {{$labels.pod}}'s output buffer space is consistently below 75 percent capacity."
        description: "Fluent Bit pod {{$labels.pod}} has very low buffer space for output {{$labels.name}}. There may be
connection issues between Fluent Bit and the log destination, or there are too many logs to write out; check the logs
for the Fluent Bit pod."
    - alert: FluentBitPodDroppingLogs
      expr: rate(fluentbit_output_retries_failed_total[5m]) > 0
      labels:
        severity: Warning
      annotations:
        summary: "Fluent Bit pod {{$labels.pod}} is dropping log chunks."
        description: "Fluent Bit pod {{$labels.pod}} gave up retrying to deliver log chunks for output {{$labels.name}}.
Logs are being lost; check the logs for the Fluent Bit pod and the health of the log destination."

The alerts created in the example are described as follows:

Alert	Severity	Requires	Issue/reason
FluentBitPodConsistentlyLowBufferSpace	Non-critical, warning	Immediate investigation to ensure logs are being gathered correctly.	A Fluent Bit pod’s available output buffer capacity has averaged less than 75% over the last 5 minutes. This could mean Fluent Bit is having trouble communicating with a log destination, the destination is down, or there are simply too many logs to process.
FluentBitPodDroppingLogs	Non-critical, warning	Immediate investigation to ensure logs are not being lost.	A Fluent Bit pod exhausted the retries for one or more log chunks and dropped them.

Big picture​

Value​

Concepts​

How to​

Create Prometheus alerts for Fluent Bit​

The alerts created in the example are described as follows:​