Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.
controller-1:/var/log/ceph$ ceph health detail | head
HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering
PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering
pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0]
pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1]
pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0]
pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0]
pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0]
pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0]
pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1]
pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1]
Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.
Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.
controller- 1:/var/ log/ceph$ ceph health detail | head
HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering
PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering
pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0]
pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1]
pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0]
pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0]
pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0]
pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0]
pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1]
pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1]
Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.
controller-1:~$ ceph -s 20f5-4c9a- a1f9-337796696e 3a
cluster:
id: 427bf4e1-
health: HEALTH_WARN
Reduced data availability: 64 pgs inactive, 64 pgs peering
services: 0,controller- 1,compute- 0 1(active) , standbys: controller-0
mon: 3 daemons, quorum controller-
mgr: controller-
osd: 2 osds: 2 up, 2 in
data:
pools: 1 pools, 64 pgs
objects: 73.06 k objects, 283 GiB
usage: 568 GiB used, 324 GiB / 892 GiB avail
pgs: 100.000% pgs not active
64 peering
controller-1:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-2 0.87097 chassis group-0
-4 0.43549 host controller-0
0 ssd 0.43549 osd.0 up 1.00000 1.00000
-3 0.43549 host controller-1
1 ssd 0.43549 osd.1 up 1.00000 1.00000
controller-1:~$ tail /var/log/ ceph/ceph- osd.1.log client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2869821. 0:28539 1.10 1.1d461e50 (undecoded) ondisk+ write+known_ if_redirected e1739)
2019-11-13 15:33:52.302 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(
2019-11-13 15:33:53.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(
2019-11-13 15:33:54.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(
2019-11-13 15:33:55.279 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(
2019-11-13 15:33:56.267 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2723 slow ops, oldest is osd_op(
2019-11-13 15:33:57.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(
2019-11-13 15:33:58.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(
2019-11-13 15:33:59.274 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(
2019-11-13 15:34:00.281 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2725 slow ops, oldest is osd_op(
2019-11-13 15:34:01.318 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2726 slow ops, oldest is osd_op(
controller-1:~$ ssh sysadmin@ controller- 0 tail /var/log/ ceph/ceph- osd.0.log
<snip>
sysadmin@ controller- 0's password: client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739) client. 2797209. 0:3802671 1.38 1.a0d2b9f8 (undecoded) ondisk+ write+known_ if_redirected e1739)
2019-11-13 15:34:41.298 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1819 slow ops, oldest is osd_op(
2019-11-13 15:34:42.273 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(
2019-11-13 15:34:43.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(
2019-11-13 15:34:44.299 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(
2019-11-13 15:34:45.275 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1820 slow ops, oldest is osd_op(
2019-11-13 15:34:46.244 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1821 slow ops, oldest is osd_op(
2019-11-13 15:34:47.260 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(
2019-11-13 15:34:48.249 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(
2019-11-13 15:34:49.245 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(
2019-11-13 15:34:50.290 7fa2ecd6c700 -1 osd.0 1740 get_health_metrics reporting 1822 slow ops, oldest is osd_op(