CNPG Disk Error

Fixing CNPG PostgreSQL CrashLoopBackOff: A Deep Dive into PVC Resize Failure & Disk Pressure on OpenEBS LVM LocalPV

                   ┌─────────────────────────────────────────┐
                   │     CloudNativePG PostgreSQL Cluster    │
                   │─────────────────────────────────────────│
                   │  customer-database-1 (Primary Pod)      │
                   │  customer-database-2 (Replica Pod)      │
                   └─────────────────────────────────────────┘
                                │                │
                                │                │
                   ┌───────────────────┐   ┌───────────────────┐
                   │ PVC: cust-db-1    │   │ PVC: cust-db-2    │
                   └───────────────────┘   └───────────────────┘
                                │                │
                                └──────┬─────────┘
                                       ▼
                         ┌──────────────────────────┐
                         │ StorageClass:            │
                         │  openebs-myapp-database  │
                         │  provisioner: local.csi  │
                         │  volgroup: data_vg       │
                         └──────────────────────────┘
                                │                │
         ┌──────────────────────┘                └──────────────────────┐
         ▼                                                             ▼
┌───────────────────────┐                                 ┌───────────────────────┐
│ Worker Node 1         │                                 │ Worker Node 2         │
│ sbt-db-node-1         │                                 │ sbt-db-node-2         │
│                       │                                 │                       │
│ LV → PVC2 LV          │                                 │ LV → PVC1 LV          │
│ (LocalPV LVM)         │                                 │ (LocalPV LVM)         │
│                       │                                 │                       │
│ VG: data_vg (70Gi free│                                 │ VG: data_vg (70Gi free│
│ FS: XFS               │                                 │ FS: XFS               │
└───────────────────────┘                                 └───────────────────────┘

                     ┌──────────────────────────────────┐
                     │        CNPG Operator             │
                     │ WAL/Data Disk Space Checks       │
                     │ NodeExpandVolume interactions    │
                     └──────────────────────────────────┘
                                      │
                                      ▼
                     ┌──────────────────────────────────┐
                     │             Errors                │
                     │ - Insufficient Disk Space        │
                     │ - NodeExpandVolume exit status 5 │
                     │ - CrashLoopBackOff               │
                     └──────────────────────────────────┘


A Real-World Production Issue & How We Resolved It

In this article, I’ll walk you through a production-grade debugging scenario where our CloudNativePG (CNPG) PostgreSQL cluster went into a full outage.

Both the primary and replica pods suddenly got stuck in:

CrashLoopBackOff
STATUS: Not enough disk space

Even though the PVCs showed enough free space, the pods refused to start.
This was a tricky issue involving CNPG disk space validation, OpenEBS LVM LocalPV, PVC expansion, and node-level LVM debugging.

Let’s break down the entire story—from symptoms to deep cause analysis and the final fix.


  1. Incident Summary
  • Cluster: customer-database
  • StorageClass: openebs-myapp-database (OpenEBS LocalPV LVM)
  • PVC Size: 50Gi
  • Issue:
  • CNPG cluster status → Not enough disk space
  • Pods → CrashLoopBackOff
  • PostgreSQL readiness → HTTP 500
  • NodeExpandVolume failing (exit status 5)
  1. Symptoms We Observed CrashLoopBackOff on both pods

$ kubectl get po -n sbt-prod | grep customer
customer-database-1 0/1 CrashLoopBackOff 385
customer-database-2 0/1 CrashLoopBackOff 414

CNPG operator logs

  • “msg”:”Insufficient disk space detected in a pod. PostgreSQL cannot proceed until the PVC group is enlarged”

[user@sbt-adm-1 db]$ kubectl -n cnpg-system logs –tail=100 -f cnpg-cloudnative-pg-12v2nj3456-rqx24

{“level”:”info”,”ts”:”2025-11-22T06:06:43.85562733Z”,”msg”:”Cannot extract Pod status”,”controller”:”cluster”,”controllerGroup”:”postgresql.cnpg.io”,”controllerKind”:”Cluster”,”Cluster”:{“name”:”customer-database”,”namespace”:”sbt-prod”},”namespace”:”sbt-prod”,”name”:”customer-database”,”reconcileID”:”4dfd7c15-f213-45e6-bf5d-4dbc27179526″,”podName”:”customer-database-2″,”error”:”Get \”https://12.1.9.233.0000/pg/status\”: dial tcp 12.1.9.233.0000: connect: connection refused”}
Readiness probe failing

  • Readiness probe failed: HTTP probe returned statuscode: 500
  • PVC events showing expansion attempts
    -ExternalExpanding
    -FileSystemResizeRequired
    -VolumeResizeFailed: exit status 5

[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
2s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
50s Warning Unhealthy pod/customer-database-1 Readiness probe failed: HTTP probe failed with statuscode: 5007s Warning Unhealthy pod/customer-database-2 Readiness probe failed: HTTP probe failed with statuscode: 500
19s Normal ExternalExpanding persistentvolumeclaim/customer-database-1 waiting for an external controller to expand this PVC
19s Normal FileSystemResizeRequired persistentvolumeclaim/customer-database-1 Require file system resize of volume on node
1s Warning VolumeResizeFailed pod/customer-database-2 NodeExpandVolume.NodeExpandVolume failed for volume “pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec” : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = failed to handle NodeExpandVolume Request for pvc-211b11f0-f0d5-4cbd-9bd2-2f2bc7ed68ec, {exit status 5}

The cluster status also reflected a critical issue:
STATUS: Not enough disk space

Even though the PVC “looked fine”, CNPG refused to start PostgreSQL.


  1. Understanding How CNPG Checks Disk Space

CloudNativePG runs pre-flight disk checks before starting PostgreSQL.

It validates:

  • WAL directory free space
  • Data directory free space
  • Threshold: less than 10% free → PostgreSQL startup is blocked

This is a safety mechanism to prevent data corruption.


  1. Investigation & Debugging

Step 1 — PVCs Look Normal

kubectl describe pvc customer-database-1
Capacity: 50Gi
Status: Bound
Events:

From Kubernetes perspective → the PVCs were healthy.

Step 2 — But CNPG still reports disk shortage

kubectl get cluster -A
STATUS: Not enough disk space

This means the issue was at node level, not Kubernetes level.

Step 3 — Check the actual node storage (LVM level)

PVC annotation tells us where the volume lives:

volume.kubernetes.io/selected-node: sbt-db-node-2

SSH into both DB nodes:

sudo vgs

Output:
VG VSize VFree
data_vg 199.99g 69.99g
rl 96.99g 0

Critical takeaway:

  • The OS volume group rl = 0 free space
  • The database VolumeGroup data_vg = ~70Gi free (this is OK)

So why was PVC expansion failing?


  1. Root Cause of the Failure

The StorageClass used:

Provisioner: local.csi.openebs.io
storage: lvm
volgroup: data_vg

OpenEBS tried to expand PVC

But expansion failed inside LVM on the node, showing:

VolumeResizeFailed: exit status 5

This typically indicates:

  • LVM LV metadata locked
  • Filesystem required manual resize
  • Previous resize attempt left LV inconsistent
  • CSI node plugin failed resizing
  • Kernel-level XFS quota issues

So CNPG correctly detected insufficient disk space and prevented PostgreSQL startup.
PostgreSQL startup failed → leading to CrashLoopBackOff.


  1. How We Fixed the Issue

Step 1 — Increase storage in postgresql.yaml

[user@sbt-adm-1 db]$ vim postgresql.yaml

yaml
storage:
size: 100Gi

Apply it:

[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml –dry-run=server
cluster.postgresql.cnpg.io/myapp-database unchanged (server dry run)
cluster.postgresql.cnpg.io/customer-database configured (server dry run)

[user@sbt-adm-1 db]$ kubectl apply -f postgresql.yaml
cluster.postgresql.cnpg.io/myapp-database unchanged
cluster.postgresql.cnpg.io/customer-database configured


Step 2 — Delete both pods to force NodeExpandVolume

[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-2
pod “customer-database-2” deleted

[user@sbt-adm-1 db]$ kubectl delete po -n sbt-prod customer-database-1
pod “customer-database-1” deleted

[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1

This triggers:

  • LVM LV extension
  • XFS filesystem resize
  • Fresh pod scheduling with expanded volume

Step 3 — Validate PVC Expansion

kubectl get events -n sbt-prod | grep -i resize

[user@sbt-adm-1 db]$ kubectl get events -n sbt-prod
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal FileSystemResizeSuccessful pod/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume “pvc-554f83db-36f7-49ae-a69d-fdedbf0b7297” sbt-db-node-2
16s Normal FileSystemResizeSuccessful persistentvolumeclaim/customer-database-1 MountVolume.NodeExpandVolume succeeded for volume

We finally saw:

FileSystemResizeSuccessful
MountVolume.NodeExpandVolume succeeded


Step 4 — Verify Pods Recover

kubectl get po -n sbt-prod | grep customer
customer-database-1 1/1 Running
customer-database-2 1/1 Running

[user@sbt-adm-1 db]$ kubectl get po -n sbt-prod | grep -i customer
customer-database-1 1/1 Running 0 94s
customer-database-2 0/1 Running 0 88s


Step 5 — Cluster Becomes Healthy

kubectl get cluster -A
customer-database → Cluster in healthy state

[user@sbt-adm-1 db]$ kubectl get cluster -A
NAMESPACE NAME AGE INSTANCES READY STATUS PRIMARY
sbt-prod customer-database 212d 2 1 Waiting for the instances to become active customer-database-1
sbt-prod myapp-database 212d 2 2 Cluster in healthy state myapp-database-1

[user@sbt-adm-1 db]$ kubectl logs -n ped-prod –tail=-1 -f customer-database-1

Defaulted container “postgres” out of: postgres, bootstrap-controller (init)
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516600723Z”,”msg”:”Starting CloudNativePG Instance Manager”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″,”version”:”1.25.1″,”build”:{“Version”:”1.25.1″,”Commit”:”c56e00d4″,”Date”:”2025-02-28″}}
{“level”:”info”,”ts”:”2025-11-22T06:30:45.516809527Z”,”msg”:”Checking for free disk space for WALs before starting PostgreSQL”,”logger”:”instance-manager”,”logging_pod”:”customer-database-1″}
UTC”,”virtual_transaction_id”:”9/1535″,”transaction_id”:”56833618″,”error_severity”:”LOG”,”sql_state_code”:”00000″,”message”:”autovacuum: dropping orphan temp table \”customer.pg_temp_10.dms_tmp_register_filter\””,”backend_type”:”autovacuum worker”,”query_id”:”0″}}
{“level”:”info”,”ts”:”2025-11-22T06:33:22.81405764Z”,”logger”:”wal-archive”,”msg”:”Executing barman-cloud-wal-archive”,”logging_pod”:”customer-database-1″,”walName”:”pg_wal/00000001000000910000000M”,”options”

  1. Key Learnings from This Incident
  2. PVC showing free space ≠ LVM having free space

Kubernetes PVC capacity is logical.
Actual data resides inside LVM LVs.

  1. OpenEBS LVM LocalPV must be monitored at node level

sudo vgs, sudo lvs, df -h, and mount become critical tools.

  1. CNPG’s disk checks protect data integrity

PostgreSQL refuses to start instead of corrupting WAL/data.

  1. PVC expansion is NOT enough

The underlying filesystem must resize successfully.

  1. Deleting pods (NOT PVCs) can fix NodeExpandVolume

Pod restart triggers storage driver operations.


  1. Commands Cheat Sheet Check cluster status:

kubectl get cluster -A

Check PVC:

kubectl describe pvc customer-database-1 -n sbt-prod

Check pod logs:

kubectl logs -n sbt-prod customer-database-1

Check CNPG operator logs:

kubectl logs -n cnpg-system -l app=cnpg

Node LVM:

sudo vgs
sudo lvs
sudo df -h


Conclusion

This incident showed how deeply interconnected Kubernetes storage, CNPG logic, and LVM-based OpenEBS volumes are.

Even when PVCs appear healthy, underlying LVM constraints can silently cause volume expansion failures—ultimately crashing PostgreSQL pods.

With a methodical approach—checking logs, PVC events, node-level LVM, and CNPG operator—we resolved the issue and restored the cluster to normal operation.