Fluent Bit metrics
Big picture
Use the Prometheus monitoring and alerting tool for Fluent Bit metrics to ensure continuous network visibility.
Value
Platform engineering teams rely on logs for visibility into their networks. If collecting or storing logs are disrupted, this can impact network visibility. Prometheus can monitor log collection and storage metrics so platform engineering teams are alerted about problems before they occur.
Concepts
| Component | Description |
|---|---|
| Prometheus | Monitoring tool that scrapes metrics from instrumented jobs and displays time series data in a visualizer (such as Grafana). For Calico Cloud, the “jobs” that Prometheus can harvest metrics from the Fluent Bit component. |
| Fluent Bit | Ships Calico Cloud logs to the log storage backend. |
How to
Create Prometheus alerts for Fluent Bit
The following example creates a Prometheus rule to monitor some important Fluent Bit metrics, and alert when they have crossed certain thresholds:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: tigera-prometheus-log-collection-monitoring
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: tigera-log-collection.rules
rules:
- alert: FluentBitPodConsistentlyLowBufferSpace
expr: avg_over_time(fluentbit_output_chunk_available_capacity_percent[5m]) < 75
labels:
severity: Warning
annotations:
summary: "Fluent Bit pod {{$labels.pod}}'s output buffer space is consistently below 75 percent capacity."
description: "Fluent Bit pod {{$labels.pod}} has very low buffer space for output {{$labels.name}}. There may be
connection issues between Fluent Bit and the log destination, or there are too many logs to write out; check the logs
for the Fluent Bit pod."
- alert: FluentBitPodDroppingLogs
expr: rate(fluentbit_output_retries_failed_total[5m]) > 0
labels:
severity: Warning
annotations:
summary: "Fluent Bit pod {{$labels.pod}} is dropping log chunks."
description: "Fluent Bit pod {{$labels.pod}} gave up retrying to deliver log chunks for output {{$labels.name}}.
Logs are being lost; check the logs for the Fluent Bit pod and the health of the log destination."
The alerts created in the example are described as follows:
| Alert | Severity | Requires | Issue/reason |
|---|---|---|---|
| FluentBitPodConsistentlyLowBufferSpace | Non-critical, warning | Immediate investigation to ensure logs are being gathered correctly. | A Fluent Bit pod’s available output buffer capacity has averaged less than 75% over the last 5 minutes. This could mean Fluent Bit is having trouble communicating with a log destination, the destination is down, or there are simply too many logs to process. |
| FluentBitPodDroppingLogs | Non-critical, warning | Immediate investigation to ensure logs are not being lost. | A Fluent Bit pod exhausted the retries for one or more log chunks and dropped them. |