Introduction
I started by logging everything. Then I realized that the only useful graphs were the ones that answered a question. Now I keep a focused telemetry bundle.
Metrics I keep
- Total rack power draw
- Per-shelf temperature gradient
- Network interface saturation
Metrics I removed
- Every fan speed
- Per-process CPU usage
- One-off debug counters
The goal is to stay informed without drowning in noise.