Cloudflare API to Prometheus: Making the Exporter Reliable

Cloudflare API to Prometheus: Making the Exporter Reliable

Introduction

In Part 1, I focused on architecture and metric semantics. This part is about reliability, because that is where things usually get painful after the first dashboard looks good.

Reliability Changes Everything

The first version of my pipeline worked, but it was fragile in exactly the ways that matter during incidents. Data sometimes arrived late, scrapes were occasionally missed, and short restarts created visibility gaps that were easy to ignore until I needed to debug something important. I learned quickly that having metrics is not the same as having trustworthy history.

Bucket Completeness and Delayed Collection

The first reliability issue was bucket completeness. Cloudflare analytics is bucketed, but bucket boundaries do not always mean instantly final values. If I queried too aggressively at the edge of a new window, I could fetch counts that were still settling. That led to one scrape showing a value and a later scrape showing a revised value for the same logical period, which made charts noisy and confidence lower.

My fix was simple and very effective: intentional delay. I query with an offset using CF_DELAY_SECONDS so I collect from a bucket that has had time to finalize. This means I am slightly behind real time, but the data is much more stable. In practice, that trade-off is absolutely worth it for operational dashboards.

PYTHON
delayed_now = datetime.now(timezone.utc) - timedelta(seconds=CF_DELAY_SECONDS) bucket_end_timestamp = (int(delayed_now.timestamp()) // CF_BUCKET_SECONDS) * CF_BUCKET_SECONDS end = datetime.fromtimestamp(bucket_end_timestamp, timezone.utc) start = end - timedelta(seconds=CF_BUCKET_SECONDS)

Why I Keep Buckets Stable

The second reliability issue was historical drift. Even with delayed collection, there is still a design choice about whether already-collected windows should keep changing as new polls run. For this workflow, I chose stability. Once a completed bucket is captured, I keep it stable in the primary series path so historical interpretation remains consistent.

That decision made a bigger difference than I expected. It improved week-over-week comparisons, made deploy impact reviews cleaner, and reduced the amount of second-guessing when discussing traffic shifts. If historical points remain soft forever, the dashboard becomes harder to trust exactly when stakes are high.

PYTHON
# Keep a completed bucket immutable once first collected so Prometheus # always sees one stable sample per Cloudflare bucket timestamp. if cf_metrics["last_window_start_timestamp"] != window_start_timestamp: cf_metrics["requests"] = request_count cf_metrics["unique_visitors"] = unique_visitors cf_metrics["last_window_start_timestamp"] = window_start_timestamp cf_metrics["last_window_end_timestamp"] = window_end_timestamp

Scrape Gaps and Local Cache

The third reliability issue was scrape gaps. Prometheus is solid, but real systems still miss scrapes during rollouts, pod restarts, and short network interruptions. Without a resilience layer, those gaps become blind spots that distort trend analysis.

To address this, I persist recent Cloudflare buckets in a local SQLite cache inside the exporter runtime. I do not treat this cache as a TSDB replacement. I treat it as a short-term durability layer between upstream API collection and scrape ingestion. That gives me continuity through restarts, a cleaner recovery path after brief outages, and a safer way to repair missing windows.

PYTHON
with cf_cache_connect() as connection: connection.execute( """ INSERT INTO cloudflare_hourly_cache ( window_start_timestamp, window_end_timestamp, requests, unique_visitors, collected_at_timestamp ) VALUES (?, ?, ?, ?, ?) ON CONFLICT(window_start_timestamp) DO UPDATE SET requests=excluded.requests, unique_visitors=excluded.unique_visitors, collected_at_timestamp=excluded.collected_at_timestamp """, (window_start_timestamp, window_end_timestamp, requests, unique_visitors, time.time()), )

Replay and Backfill

From there, replay becomes practical. I expose a bounded recent replay window through /metrics using explicit timestamps so Prometheus can catch up automatically after temporary downtime. I keep this intentionally bounded with CF_METRICS_REPLAY_HOURS so the exporter does not over-emit historical samples and create unnecessary load.

PYTHON
for row in rows: sample_timestamp_ms = row["window_start_timestamp"] * 1000 lines.append( f'cloudflare_zone_requests{{{labels}}} {row["requests"]} {sample_timestamp_ms}' ) lines.append( f'cloudflare_zone_unique_visitors{{{labels}}} {row["unique_visitors"]} {sample_timestamp_ms}' )

For larger holes or one-off repairs, I use explicit backfill workflows instead of expecting normal scrape behavior to fix everything. My pattern is straightforward: generate controlled backfill output from cache or source data, import with explicit timestamps, verify in Prometheus and Grafana, then remove temporary backfill artifacts. I treat this as maintenance work, not implicit background behavior, because explicit workflows stay auditable.

PYTHON
if parsed.path == "/cloudflare/backfill": query = urllib.parse.parse_qs(parsed.query) hours = int(query.get("hours", [str(CF_BACKFILL_MAX_HOURS)])[0]) step_seconds = int(query.get("step_seconds", [str(CF_COLLECTION_INTERVAL_SECONDS)])[0]) extend_to_now = query.get("extend_to_now", ["1"])[0] != "0" body = cloudflare_backfill_text(hours, step_seconds, extend_to_now).encode()

Monitoring the Collector

I also monitor the exporter itself as a first-class component. Traffic metrics alone are not enough. I want to know collection success timing, collector duration, error accumulation, and cache window health so I can separate “traffic changed” from “collector reliability changed.” Those are two different operational questions and I need clear answers to both.

Kubernetes Guardrails

On Kubernetes, most of my early reliability pain was not about Cloudflare API correctness. It was about lifecycle and state. The setup became much steadier once I aligned readiness behavior with real collector health, ensured the cache path was persistent, kept resource limits conservative, and made scrape intervals explicit rather than assumed.

Trade-Offs

This does add complexity compared to a stateless API-to-metrics exporter. There is local state, replay logic, and backfill tooling to maintain. But the payoff is better continuity during outages, cleaner historical interpretation, and faster confidence during incident analysis. For my use case, that trade is worth it.

My Reliability Checklist

If I deploy this in a new environment, I now consider it incomplete until delayed collection is configured, cache durability is verified, replay windows are bounded, exporter health panels are visible, and the backfill path is tested before it is urgently needed.

Next

Part 3 will move into Kubernetes and Prometheus integration details, including deployment structure, Service and ServiceMonitor choices, scrape configuration patterns, and the operational decisions that keep this maintainable over time.

Share