Fixing CNPG PostgreSQL CrashLoopBackOff: A Deep Dive into PVC Resize Failure & Disk Pressure on OpenEBS LVM LocalPV
┌─────────────────────────────────────────┐
│ CloudNativePG PostgreSQL Cluster │
│─────────────────────────────────────────│
│ customer-database-1 (Primary Pod) │
│ customer-database-2 (Replica Pod) │
└─────────────────────────────────────────┘
│ │
│ │
┌───────────────────┐ ┌───────────────────┐
│ PVC: cust-db-1 │ │ PVC: cust-db-2 │
└───────────────────┘ └───────────────────┘
│ │
└──────┬─────────┘
▼
┌──────────────────────────┐
│ StorageClass: │
│ openebs-myapp-database │
│ provisioner: local.csi │
│ volgroup: data_vg │
└──────────────────────────┘
│ │
┌──────────────────────┘ └──────────────────────┐
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ Worker Node 1 │ │ Worker Node 2 │
│ sbt-db-node-1 │ │ sbt-db-node-2 │
│ │ │ │
│ LV → PVC2 LV │ │ LV → PVC1 LV │
│ (LocalPV LVM) │ │ (LocalPV LVM) │
│ │ │ │
│ VG: data_vg (70Gi free│ │ VG: data_vg (70Gi free│
│ FS: XFS │ │ FS: XFS │
└───────────────────────┘ └───────────────────────┘
┌──────────────────────────────────┐
│ CNPG Operator │
│ WAL/Data Disk Space Checks │
│ NodeExpandVolume interactions │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Errors │
│ - Insufficient Disk Space │
│ - NodeExpandVolume exit status 5 │
│ - CrashLoopBackOff │
└──────────────────────────────────┘
A Real-World Production Issue & How We Resolved It
In this article, I’ll walk you through a production-grade debugging scenario where our CloudNativePG (CNPG) PostgreSQL cluster went into a full outage.
Both the primary and replica pods suddenly got stuck in:
CrashLoopBackOff
STATUS: Not enough disk space
Even though the PVCs showed enough free space, the pods refused to start.
This was a tricky issue involving CNPG disk space validation, OpenEBS LVM LocalPV, PVC expansion, and node-level LVM debugging.
Let’s break down the entire story—from symptoms to deep cause analysis and the final fix.
- Incident Summary
- Cluster: customer-database
- StorageClass: openebs-myapp-database (OpenEBS LocalPV LVM)
- PVC Size: 50Gi
- Issue:
- CNPG cluster status → Not enough disk space
- Pods → CrashLoopBackOff
- PostgreSQL readiness → HTTP 500
- NodeExpandVolume failing (exit status 5)
- Symptoms We Observed CrashLoopBackOff on both pods
$ kubectl get po -n sbt-prod | grep customer
customer-database-1 0/1 CrashLoopBackOff 385
customer-database-2 0/1 CrashLoopBackOff 414
CNPG operator logs
- “msg”:”Insufficient disk space detected in a pod. PostgreSQL cannot proceed until the PVC group is enlarged”
[user@sbt-adm-1 db]$ kubectl -n cnpg-system logs –tail=100 -f cnpg-cloudnative-pg-12v2nj3456-rqx24
{“level”:”info”,”ts”:”2025-11-22T06:06:43.85562733Z”,”msg”:”Cannot extract Pod status”,”controller”:”cluster”,”controllerGroup”:”postgresql.cnpg.io”,”controllerKind”:”Cluster”,”Cluster”:{“name”:”customer-database”,”namespace”:”sbt-prod”},”namespace”:”sbt-prod”,”name”:”customer-database”,”reconcileID”:”4dfd7c15-f213-45e6-bf5d-4dbc27179526″,”podName”:”customer-database-2″,”error”:”Get \”https://12.1.9.233.0000/pg/status\”: dial tcp 12.1.9.233.0000: connect: connection refused”}
Readiness probe failing
- Readiness probe failed: HTTP probe returned statuscode: 500
- PVC events showing expansion attempts
-ExternalExpanding
-FileSystemResizeRequired
-VolumeResizeFailed: exit status 5
[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
2s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
50s Warning Unhealthy pod/customer-database-1 Readiness probe failed: HTTP probe failed with statuscode: 5007s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
19s Normal ExternalExpanding persistentvolumeclaim/customer-database-1 waiting for an external controller to expand this PVC
19s Normal FileSystemResizeRequired persistentvolumeclaim/customer-database-1 Require file system resize of volume on node
1s Warning VolumeResizeFailed pod/customer-database-2 NodeExpandVolume.NodeExpandVolume failed for volume “pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec” : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = failed to handle NodeExpandVolume Request for pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec, {exit status 5}
The cluster status also reflected a critical issue:
STATUS: Not enough disk space
Even though the PVC “looked fine”, CNPG refused to start PostgreSQL.
- Understanding How CNPG Checks Disk Space
CloudNativePG runs pre-flight disk checks before starting PostgreSQL.
It validates:
- WAL directory free space
- Data directory free space
- Threshold: less than 10% free → PostgreSQL startup is blocked
This is a safety mechanism to prevent data corruption.
- Investigation & Debugging
Step 1 — PVCs Look Normal
kubectl describe pvc customer-database-1
Capacity: 50Gi
Status: Bound
Events:
From Kubernetes perspective → the PVCs were healthy.
Step 2 — But CNPG still reports disk shortage
kubectl get cluster -A
STATUS: Not enough disk space
This means the issue was at node level, not Kubernetes level.
Step 3 — Check the actual node storage (LVM level)
PVC annotation tells us where the volume lives:
volume.kubernetes.io/selected-node: sbt-db-node-2
SSH into both DB nodes:
sudo vgs
Output:
VG VSize VFree
data_vg 199.99g 69.99g
rl 96.99g 0
Critical takeaway:
- The OS volume group rl = 0 free space
- The database VolumeGroup data_vg = ~70Gi free (this is OK)
So why was PVC expansion failing?
- Root Cause of the Failure
The StorageClass used:
Provisioner: local.csi.openebs.io
storage: lvm
volgroup: data_vg
OpenEBS tried to expand PVC
But expansion failed inside LVM on the node, showing:
VolumeResizeFailed: exit status 5
This typically indicates:
- LVM LV metadata locked
- Filesystem required manual resize
- Previous resize attempt left LV inconsistent
- CSI node plugin failed resizing
- Kernel-level XFS quota issues
So CNPG correctly detected insufficient disk space and prevented PostgreSQL startup.
PostgreSQL startup failed → leading to CrashLoopBackOff.
- How We Fixed the Issue
Step 1 — Increase storage in postgresql.yaml
[user@sbt-adm-1 db]$ vim postgresql.yaml
yaml
storage:
size: 100Gi
Apply it:
[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml –dry-run=server
cluster.postgresql.cnpg.io/myapp-database unchanged (server dry run)
cluster.postgresql.cnpg.io/customer-database configured (server dry run)
[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml
cluster.postgresql.cnpg.io/myapp-database unchanged
cluster.postgresql.cnpg.io/customer-database configured
Step 2 — Delete both pods to force NodeExpandVolume
[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-2
pod “customer-database-2” deleted
[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-1
pod “customer-database-1” deleted
[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1
This triggers:
- LVM LV extension
- XFS filesystem resize
- Fresh pod scheduling with expanded volume
Step 3 — Validate PVC Expansion
kubectl get events -n sbt-prod | grep -i resize
[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal FileSystemResizeSuccessful pod/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume “pvc-554f83db-36f7-49ae-a69d-fdedbf0b7297” sbt-db-node-2
16s Normal FileSystemResizeSuccessful persistentvolumeclaim/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume
We finally saw:
FileSystemResizeSuccessful
MountVolume.NodeExpandVolume succeeded
Step 4 — Verify Pods Recover
kubectl get po -n sbt-prod | grep customer
customer-database-1 1/1 Running
customer-database-2 1/1 Running
[user@sbt-adm-1 db]$ kubectl get po -n sbt-prod | grep -i customer
customer-database-1 1/1 Running 0 94s
customer-database-2 0/1 Running 0 88s
Step 5 — Cluster Becomes Healthy
kubectl get cluster -A
customer-database → Cluster in healthy state
[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 1 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1
[user@sbt-adm-1 db]$ kubectl logs -n ped-prod –tail=-1 -f customer-database-1
Defaulted container “postgres” out of: postgres, bootstrap-controller (init)
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516600723Z”,”msg”:”Starting CloudNativePG Instance Manager”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″,”version”:”1.25.1″,”build”:{“Version”:”1.25.1″,”Commit”:”c56e00d4″,”Date”:”2025-02-28″}}
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516809527Z”,”msg”:”Checking for free disk space for WALs before starting PostgreSQL”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″}
UTC”,”virtual_transaction_id”:”9/1535″,”transaction_id”:”56833618″,”error_severity”:”LOG”,”sql_state_code”:”00000″,”message”:”autovacuum: dropping orphan temp table \”customer.pg_temp_10.dms_tmp_register_filter\””,”backend_type”:”autovacuum worker”,”query_id”:”0″}}
{“level”:”info”,”ts”:”2025-11-22T06:33:22.81405764Z”,”logger”:”wal-archive”,”msg”:”Executing barman-cloud-wal-archive”,”logging_pod”:”customer-database-1″,”walName”:”pg_wal/00000001000000910000000M”,”options”
- Key Learnings from This Incident
- PVC showing free space ≠ LVM having free space
Kubernetes PVC capacity is logical.
Actual data resides inside LVM LVs.
- OpenEBS LVM LocalPV must be monitored at node level
sudo vgs, sudo lvs, df -h, and mount become critical tools.
- CNPG’s disk checks protect data integrity
PostgreSQL refuses to start instead of corrupting WAL/data.
- PVC expansion is NOT enough
The underlying filesystem must resize successfully.
- Deleting pods (NOT PVCs) can fix NodeExpandVolume
Pod restart triggers storage driver operations.
- Commands Cheat Sheet Check cluster status:
kubectl get cluster -A
Check PVC:
kubectl describe pvc customer-database-1 -n sbt-prod
Check pod logs:
kubectl logs -n sbt-prod customer-database-1
Check CNPG operator logs:
kubectl logs -n cnpg-system -l app=cnpg
Node LVM:
sudo vgs
sudo lvs
sudo df -h
Conclusion
This incident showed how deeply interconnected Kubernetes storage, CNPG logic, and LVM-based OpenEBS volumes are.
Even when PVCs appear healthy, underlying LVM constraints can silently cause volume expansion failures—ultimately crashing PostgreSQL pods.
With a methodical approach—checking logs, PVC events, node-level LVM, and CNPG operator—we resolved the issue and restored the cluster to normal operation.