CNPG Disk Error

Fixing CNPG PostgreSQL CrashLoopBackOff: A Deep Dive into PVC Resize Failure & Disk Pressure on OpenEBS LVM LocalPV

                   ┌─────────────────────────────────────────┐
                   │     CloudNativePG PostgreSQL Cluster    │
                   │─────────────────────────────────────────│
                   │  customer-database-1 (Primary Pod)      │
                   │  customer-database-2 (Replica Pod)      │
                   └─────────────────────────────────────────┘
                                │                │
                                │                │
                   ┌───────────────────┐   ┌───────────────────┐
                   │ PVC: cust-db-1    │   │ PVC: cust-db-2    │
                   └───────────────────┘   └───────────────────┘
                                │                │
                                └──────┬─────────┘
                                       ▼
                         ┌──────────────────────────┐
                         │ StorageClass:            │
                         │  openebs-myapp-database  │
                         │  provisioner: local.csi  │
                         │  volgroup: data_vg       │
                         └──────────────────────────┘
                                │                │
         ┌──────────────────────┘                └──────────────────────┐
         ▼                                                             ▼
┌───────────────────────┐                                 ┌───────────────────────┐
│ Worker Node 1         │                                 │ Worker Node 2         │
│ sbt-db-node-1         │                                 │ sbt-db-node-2         │
│                       │                                 │                       │
│ LV → PVC2 LV          │                                 │ LV → PVC1 LV          │
│ (LocalPV LVM)         │                                 │ (LocalPV LVM)         │
│                       │                                 │                       │
│ VG: data_vg (70Gi free│                                 │ VG: data_vg (70Gi free│
│ FS: XFS               │                                 │ FS: XFS               │
└───────────────────────┘                                 └───────────────────────┘

                     ┌──────────────────────────────────┐
                     │        CNPG Operator             │
                     │ WAL/Data Disk Space Checks       │
                     │ NodeExpandVolume interactions    │
                     └──────────────────────────────────┘
                                      │
                                      ▼
                     ┌──────────────────────────────────┐
                     │             Errors                │
                     │ - Insufficient Disk Space        │
                     │ - NodeExpandVolume exit status 5 │
                     │ - CrashLoopBackOff               │
                     └──────────────────────────────────┘

A Real-World Production Issue & How We Resolved It

In this article, I’ll walk you through a production-grade debugging scenario where our CloudNativePG (CNPG) PostgreSQL cluster went into a full outage.

Both the primary and replica pods suddenly got stuck in:

CrashLoopBackOff
STATUS: Not enough disk space

Even though the PVCs showed enough free space, the pods refused to start.
This was a tricky issue involving CNPG disk space validation, OpenEBS LVM LocalPV, PVC expansion, and node-level LVM debugging.

Let’s break down the entire story—from symptoms to deep cause analysis and the final fix.

Incident Summary

Cluster: customer-database
StorageClass: openebs-myapp-database (OpenEBS LocalPV LVM)
PVC Size: 50Gi
Issue:
CNPG cluster status → Not enough disk space
Pods → CrashLoopBackOff
PostgreSQL readiness → HTTP 500
NodeExpandVolume failing (exit status 5)

Symptoms We Observed CrashLoopBackOff on both pods

$ kubectl get po -n sbt-prod | grep customer
customer-database-1 0/1 CrashLoopBackOff 385
customer-database-2 0/1 CrashLoopBackOff 414

CNPG operator logs

“msg”:”Insufficient disk space detected in a pod. PostgreSQL cannot proceed until the PVC group is enlarged”

[user@sbt-adm-1 db]$ kubectl -n cnpg-system logs –tail=100 -f cnpg-cloudnative-pg-12v2nj3456-rqx24

{“level”:”info”,”ts”:”2025-11-22T06:06:43.85562733Z”,”msg”:”Cannot extract Pod status”,”controller”:”cluster”,”controllerGroup”:”postgresql.cnpg.io”,”controllerKind”:”Cluster”,”Cluster”:{“name”:”customer-database”,”namespace”:”sbt-prod”},”namespace”:”sbt-prod”,”name”:”customer-database”,”reconcileID”:”4dfd7c15-f213-45e6-bf5d-4dbc27179526″,”podName”:”customer-database-2″,”error”:”Get \”https://12.1.9.233.0000/pg/status\”: dial tcp 12.1.9.233.0000: connect: connection refused”}
Readiness probe failing

Readiness probe failed: HTTP probe returned statuscode: 500
PVC events showing expansion attempts
-ExternalExpanding
-FileSystemResizeRequired
-VolumeResizeFailed: exit status 5

[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
2s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
50s Warning Unhealthy pod/customer-database-1 Readiness probe failed: HTTP probe failed with statuscode: 5007s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
19s Normal ExternalExpanding persistentvolumeclaim/customer-database-1 waiting for an external controller to expand this PVC
19s Normal FileSystemResizeRequired persistentvolumeclaim/customer-database-1 Require file system resize of volume on node
1s Warning VolumeResizeFailed pod/customer-database-2 NodeExpandVolume.NodeExpandVolume failed for volume “pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec” : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = failed to handle NodeExpandVolume Request for pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec, {exit status 5}

The cluster status also reflected a critical issue:
STATUS: Not enough disk space

Even though the PVC “looked fine”, CNPG refused to start PostgreSQL.

Understanding How CNPG Checks Disk Space

CloudNativePG runs pre-flight disk checks before starting PostgreSQL.

It validates:

WAL directory free space
Data directory free space
Threshold: less than 10% free → PostgreSQL startup is blocked

This is a safety mechanism to prevent data corruption.

Investigation & Debugging

Step 1 — PVCs Look Normal

kubectl describe pvc customer-database-1
Capacity: 50Gi
Status: Bound
Events:

From Kubernetes perspective → the PVCs were healthy.

Step 2 — But CNPG still reports disk shortage

kubectl get cluster -A
STATUS: Not enough disk space

This means the issue was at node level, not Kubernetes level.

Step 3 — Check the actual node storage (LVM level)

PVC annotation tells us where the volume lives:

volume.kubernetes.io/selected-node: sbt-db-node-2

SSH into both DB nodes:

sudo vgs

Output:
VG VSize VFree
data_vg 199.99g 69.99g
rl 96.99g 0

Critical takeaway:

The OS volume group rl = 0 free space
The database VolumeGroup data_vg = ~70Gi free (this is OK)

So why was PVC expansion failing?

Root Cause of the Failure

The StorageClass used:

Provisioner: local.csi.openebs.io
storage: lvm
volgroup: data_vg

OpenEBS tried to expand PVC

But expansion failed inside LVM on the node, showing:

VolumeResizeFailed: exit status 5

This typically indicates:

LVM LV metadata locked
Filesystem required manual resize
Previous resize attempt left LV inconsistent
CSI node plugin failed resizing
Kernel-level XFS quota issues

So CNPG correctly detected insufficient disk space and prevented PostgreSQL startup.
PostgreSQL startup failed → leading to CrashLoopBackOff.

How We Fixed the Issue

Step 1 — Increase storage in postgresql.yaml

[user@sbt-adm-1 db]$ vim postgresql.yaml

yaml
storage:
size: 100Gi

Apply it:

[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml –dry-run=server
cluster.postgresql.cnpg.io/myapp-database unchanged (server dry run)
cluster.postgresql.cnpg.io/customer-database configured (server dry run)

[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml
cluster.postgresql.cnpg.io/myapp-database unchanged
cluster.postgresql.cnpg.io/customer-database configured

Step 2 — Delete both pods to force NodeExpandVolume

[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-2
pod “customer-database-2” deleted

[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-1
pod “customer-database-1” deleted

[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1

This triggers:

LVM LV extension
XFS filesystem resize
Fresh pod scheduling with expanded volume

Step 3 — Validate PVC Expansion

kubectl get events -n sbt-prod | grep -i resize

[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal FileSystemResizeSuccessful pod/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume “pvc-554f83db-36f7-49ae-a69d-fdedbf0b7297” sbt-db-node-2
16s Normal FileSystemResizeSuccessful persistentvolumeclaim/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume

We finally saw:

FileSystemResizeSuccessful
MountVolume.NodeExpandVolume succeeded

Step 4 — Verify Pods Recover

kubectl get po -n sbt-prod | grep customer
customer-database-1 1/1 Running
customer-database-2 1/1 Running

[user@sbt-adm-1 db]$ kubectl get po -n sbt-prod | grep -i customer
customer-database-1 1/1 Running 0 94s
customer-database-2 0/1 Running 0 88s

Step 5 — Cluster Becomes Healthy

kubectl get cluster -A
customer-database → Cluster in healthy state

[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 1 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1

[user@sbt-adm-1 db]$ kubectl logs -n ped-prod –tail=-1 -f customer-database-1

Defaulted container “postgres” out of: postgres, bootstrap-controller (init)
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516600723Z”,”msg”:”Starting CloudNativePG Instance Manager”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″,”version”:”1.25.1″,”build”:{“Version”:”1.25.1″,”Commit”:”c56e00d4″,”Date”:”2025-02-28″}}
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516809527Z”,”msg”:”Checking for free disk space for WALs before starting PostgreSQL”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″}
UTC”,”virtual_transaction_id”:”9/1535″,”transaction_id”:”56833618″,”error_severity”:”LOG”,”sql_state_code”:”00000″,”message”:”autovacuum: dropping orphan temp table \”customer.pg_temp_10.dms_tmp_register_filter\””,”backend_type”:”autovacuum worker”,”query_id”:”0″}}
{“level”:”info”,”ts”:”2025-11-22T06:33:22.81405764Z”,”logger”:”wal-archive”,”msg”:”Executing barman-cloud-wal-archive”,”logging_pod”:”customer-database-1″,”walName”:”pg_wal/00000001000000910000000M”,”options”

Key Learnings from This Incident
PVC showing free space ≠ LVM having free space

Kubernetes PVC capacity is logical.
Actual data resides inside LVM LVs.

OpenEBS LVM LocalPV must be monitored at node level

sudo vgs, sudo lvs, df -h, and mount become critical tools.

CNPG’s disk checks protect data integrity

PostgreSQL refuses to start instead of corrupting WAL/data.

PVC expansion is NOT enough

The underlying filesystem must resize successfully.

Deleting pods (NOT PVCs) can fix NodeExpandVolume

Pod restart triggers storage driver operations.

Commands Cheat Sheet Check cluster status:

kubectl get cluster -A

Check PVC:

kubectl describe pvc customer-database-1 -n sbt-prod

Check pod logs:

kubectl logs -n sbt-prod customer-database-1

Check CNPG operator logs:

kubectl logs -n cnpg-system -l app=cnpg

Node LVM:

sudo vgs
sudo lvs
sudo df -h

Conclusion

This incident showed how deeply interconnected Kubernetes storage, CNPG logic, and LVM-based OpenEBS volumes are.

Even when PVCs appear healthy, underlying LVM constraints can silently cause volume expansion failures—ultimately crashing PostgreSQL pods.

With a methodical approach—checking logs, PVC events, node-level LVM, and CNPG operator—we resolved the issue and restored the cluster to normal operation.

Comments

No comments yet. Why don’t you start the discussion?

Comments

Leave a Reply Cancel reply

Trending Topics.

Get the Latest Blogging Tips Straight to Your Inbox!