Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access. controller-1:/var/log/ceph$ ceph health detail | head HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0] pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1] pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0] pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0] pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0] pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0] pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1] pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1] Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue. controller-1:~$ ceph -s cluster: id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a health: HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering services: mon: 3 daemons, quorum controller-0,controller-1,compute-0 mgr: controller-1(active), standbys: controller-0 osd: 2 osds: 2 up, 2 in data: pools: 1 pools, 64 pgs objects: 73.06 k objects, 283 GiB usage: 568 GiB used, 324 GiB / 892 GiB avail pgs: 100.000% pgs not active 64 peering controller-1:~$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.87097 root storage-tier -2 0.87097 chassis group-0 -4 0.43549 host controller-0 0 ssd 0.43549 osd.0 up 1.00000 1.00000 -3 0.43549 host controller-1 1 ssd 0.43549 osd.1 up 1.00000 1.00000 controller-1:~$ tail /var/log/ceph/ceph-osd.1.log 2019-11-13 15:33:52.302 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:53.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:54.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:55.279 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:56.267 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2723 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:57.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:58.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:33:59.274 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:00.281 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:01.318 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2726 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739) controller-1:~$ ssh sysadmin@controller-0 tail /var/log/ceph/ceph-osd.0.log sysadmin@controller-0's password: 2019-11-13 15:34:41.298 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1819 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:42.273 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:43.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:44.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:45.275 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:46.244 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1821 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:47.260 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:48.249 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:49.245 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739) 2019-11-13 15:34:50.290 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(client.2797209.0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+write+known_if_redirected e1739)