StarlingX

Bug #1789908
Comment #0

Comment 0 for bug 1789908

Revision history for this message

Frank Miller (sensfan22) wrote on 2018-08-30:

Brief Description
-----------------
During backup and restore, I noticed ceph was in health warn state as follows and appears to be stuck:

[wrsroot@controller-0 scratch(keystone_admin)]$ ceph -s
    cluster 2d62cbb0-2f6c-4382-a4ea-a024c0dc166e
     health HEALTH_WARN
            555 pgs degraded
            555 pgs stuck degraded
            1536 pgs stuck unclean
            555 pgs stuck undersized
            555 pgs undersized
     monmap e1: 3 mons at {controller-0=192.168.215.103:6789/0,controller-1=192.168.215.104:6789/0,storage-0=192.168.215.105:6789/0}
            election epoch 6, quorum 0,1,2 controller-0,controller-1,storage-0
     osdmap e82: 12 osds: 12 up, 12 in; 981 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v449: 1920 pgs, 10 pools, 1588 bytes data, 1116 objects
            460 MB used, 11383 GB / 11384 GB avail
                 561 active+remapped
                 555 active+undersized+degraded
                 420 active
                 384 active+clean

ceph osd tree reports the following:

[wrsroot@controller-0 scratch(keystone_admin)]$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-7 8.21172 root default
-6 1.45279 host storage-2
  4 0.72639 osd.4 up 1.00000 1.00000
  5 0.72639 osd.5 up 1.00000 1.00000
-8 2.25298 host storage-3
  9 1.81749 osd.9 up 1.00000 1.00000
  8 0.43549 osd.8 up 1.00000 1.00000
-9 2.25298 host storage-5
11 1.81749 osd.11 up 1.00000 1.00000
10 0.43549 osd.10 up 1.00000 1.00000
-10 2.25298 host storage-4
  7 1.81749 osd.7 up 1.00000 1.00000
  6 0.43549 osd.6 up 1.00000 1.00000
-2 0 root cache-tier
-1 2.90558 root storage-tier
-3 2.90558 chassis group-0
-4 1.45279 host storage-0
  0 0.72639 osd.0 up 1.00000 1.00000
  1 0.72639 osd.1 up 1.00000 1.00000
-5 1.45279 host storage-1
  2 0.72639 osd.2 up 1.00000 1.00000
  3 0.72639 osd.3 up 1.00000 1.00000

Severity
--------
Major: B&R fails when using more than 2 storage nodes

Steps to Reproduce
------------------
With more than 2 storage nodes, execute a B&R

Expected Behavior
------------------
No CEPH health warning should occur

Actual Behavior
----------------
see above

Reproducibility
---------------
100% reproducible with >2 storage nodes

System Configuration
--------------------
Dedicated storage config with >2 storage nodes

Branch/Pull Time/Commit
-----------------------
Any StarlingX

Timestamp/Logs
--------------
n/a

Brief Description
-----------------
During backup and restore, I noticed ceph was in health warn state as follows and appears to be stuck:

[wrsroot@controller-0 scratch(keystone_admin)]$ ceph -s 
    cluster 2d62cbb0-2f6c-4382-a4ea-a024c0dc166e 
     health HEALTH_WARN 
            555 pgs degraded 
            555 pgs stuck degraded 
            1536 pgs stuck unclean 
            555 pgs stuck undersized 
            555 pgs undersized 
     monmap e1: 3 mons at {controller-0=192.168.215.103:6789/0,controller-1=192.168.215.104:6789/0,storage-0=192.168.215.105:6789/0} 
            election epoch 6, quorum 0,1,2 controller-0,controller-1,storage-0 
     osdmap e82: 12 osds: 12 up, 12 in; 981 remapped pgs 
            flags sortbitwise,require_jewel_osds 
      pgmap v449: 1920 pgs, 10 pools, 1588 bytes data, 1116 objects 
            460 MB used, 11383 GB / 11384 GB avail 
                 561 active+remapped 
                 555 active+undersized+degraded 
                 420 active 
                 384 active+clean

ceph osd tree reports the following:

[wrsroot@controller-0 scratch(keystone_admin)]$ ceph osd tree 
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
 -7 8.21172 root default 
 -6 1.45279 host storage-2 
  4 0.72639 osd.4 up 1.00000 1.00000 
  5 0.72639 osd.5 up 1.00000 1.00000 
 -8 2.25298 host storage-3 
  9 1.81749 osd.9 up 1.00000 1.00000 
  8 0.43549 osd.8 up 1.00000 1.00000 
 -9 2.25298 host storage-5 
 11 1.81749 osd.11 up 1.00000 1.00000 
 10 0.43549 osd.10 up 1.00000 1.00000 
-10 2.25298 host storage-4 
  7 1.81749 osd.7 up 1.00000 1.00000 
  6 0.43549 osd.6 up 1.00000 1.00000 
 -2 0 root cache-tier 
 -1 2.90558 root storage-tier 
 -3 2.90558 chassis group-0 
 -4 1.45279 host storage-0 
  0 0.72639 osd.0 up 1.00000 1.00000 
  1 0.72639 osd.1 up 1.00000 1.00000 
 -5 1.45279 host storage-1 
  2 0.72639 osd.2 up 1.00000 1.00000 
  3 0.72639 osd.3 up 1.00000 1.00000

Severity
--------
Major: B&R fails when using more than 2 storage nodes

Steps to Reproduce
------------------
With more than 2 storage nodes, execute a B&R

Expected Behavior
------------------
No CEPH health warning should occur

Actual Behavior
----------------
see above

Reproducibility
---------------
100% reproducible with >2 storage nodes

System Configuration
--------------------
Dedicated storage config with >2 storage nodes

Branch/Pull Time/Commit
-----------------------
Any StarlingX

Timestamp/Logs
--------------
n/a