Introduction
Part 2 - Storage Is Where Things Really Broke
ARM was not the hardest part.
Storage was.
The Raspberry Pis handled K3s surprisingly well. Stateless services behaved exactly as expected. The control-plane was stable. CPU usage was reasonable. Memory pressure was manageable.
From the outside, the cluster looked healthy.
Everything changed the moment I introduced real state.
Databases.
Persistent volumes.
Logging stacks.
Metrics compactions.
That is when the cluster stopped feeling solid.
Nothing dramatic happened. No explosions. No catastrophic failures.
But the system started to feel unstable.
Not because of CPU.
Not because of memory.
Because of disk I/O.
The Illusion of “It Works”
At first, everything looked fine.
Pods started. PVCs bound. Applications responded. Dashboards were green. If you only looked at health checks, you would call it production-ready.
But under moderate load, subtle symptoms appeared.
Pod startup times increased. PVC attachments felt sluggish. Prometheus compactions triggered latency spikes. The API server occasionally felt unresponsive for short bursts.
Nothing crashed.
But the cluster felt heavy.
And heaviness in Kubernetes is usually storage pressure long before it is CPU exhaustion.
That is the difference between “functional” and “predictable.”
The Raspberry Pi Reality
Even with SSDs connected over USB, small ARM nodes have physical constraints you cannot abstract away.
The USB bus is shared.
I/O throughput is limited.
Container layers, logs, databases, and the OS often live on the same disk.
Everything writes, all the time.
It is easy to assume that “SSD” automatically means “fast enough.”
But Kubernetes does not care about peak throughput. It cares about consistent latency.
And small ARM boards are not designed for sustained, mixed random I/O workloads.
When logging, database writes, and container filesystem operations compete on the same disk, you introduce unpredictable latency.
Kubernetes tolerates high load.
It does not tolerate jitter well.
The Breaking Point: Databases on Pi
Initially, I ran my database inside the cluster.
Architecturally, that felt clean. Everything self-contained. Everything orchestrated.
And technically, it worked.
But write-heavy workloads increased disk contention almost immediately. PVC reattachments during node restarts were slow. Database restarts amplified I/O spikes. The monitoring stack and the database competed for the same physical disk resources.
The cluster was not failing.
It was fragile.
Fragility is worse than failure.
Failure is visible. Fragility hides until the wrong moment.
Stepping Back Instead of Tuning Harder
My first instinct was to optimize.
Tune storage classes. Adjust resource requests. Separate volumes. Reduce retention. Squeeze more performance out of the Pis.
But then I asked a more important question:
Does the database actually need to run inside Kubernetes?
The honest answer was no.
I was keeping it there out of architectural purity, not necessity.
So I moved it.
Moving the Database to TrueNAS
I deployed the database as a dedicated VM on my TrueNAS server.
The difference was immediate.
The VM had dedicated CPU and memory allocation. Storage was backed by ZFS instead of a USB-attached SSD. There was no shared bus contention with container layers or log streams. No PVC reattachment delays during node drains.
The Raspberry Pi cluster became lighter overnight.
Kubernetes handled application logic and orchestration.
TrueNAS handled persistent state.
Clear separation of concerns.
This was not about abandoning Kubernetes.
It was about respecting hardware boundaries.
What Improved Immediately
After moving the database out of the cluster, several things stabilized at once.
Disk pressure dropped significantly. Pod scheduling became consistent. Monitoring stopped triggering unpredictable latency spikes. Node drains became routine instead of stressful. API responsiveness normalized.
The cluster felt calm again.
Not more powerful.
More intentional.
There is a difference.
The Important Lesson
Kubernetes is not a religion.
Not everything must run inside it.
On constrained hardware, especially ARM clusters, stateful workloads deserve proper storage guarantees. Isolation often beats architectural ideology. Externalizing state can simplify the entire system.
By separating orchestration from heavy persistence:
- Failure domains became clearer.
- Storage bottlenecks became isolated.
- The cluster became easier to reason about.
That is infrastructure thinking.
Not “everything in Kubernetes,” but “the right thing in the right place.”
What Still Runs Inside the Cluster
Not everything moved out.
Stateless services remain inside. Internal APIs, lightweight workers, edge-facing workloads, these are exactly what the cluster is good at.
But anything write-heavy or storage-sensitive now gets evaluated carefully.
The cluster is optimized for orchestration.
Not for heavy storage.
And once I accepted that, the architecture became simpler instead of more complex.
Next: Observability Without Overwhelming the Cluster
Moving the database fixed the biggest instability.
But observability introduced a new class of problems.
Prometheus. Loki. Metrics retention. Label cardinality. Background compactions. All of these quietly consume resources on small ARM nodes.
You can destabilize a cluster with monitoring before your application even gets stressed.
In the next part, I will break down:
- How I structured monitoring for a constrained cluster
- The mistakes I made with metrics and logs
- Why retention policies matter more than dashboards
- How to observe your cluster without crushing it
Because observability should strengthen your infrastructure.
Not compete with it.
