storage: stuck pgs cause block storage hangs
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R2.1 |
Fix Committed
|
Critical
|
Jeya ganesh babu J | |||
R2.20 |
Fix Committed
|
Critical
|
Jeya ganesh babu J | |||
Trunk |
Fix Committed
|
Critical
|
Jeya ganesh babu J |
Bug Description
This is with contrail storage 2.01/43
We are seeing frequent outages with access to block storage. virtual machines that have volumes are getting blocked or extremely slow. If the VM has a boot volume then we can see guest kernel hangs.
When this occurs we see that ceph has blocked pgs:
root@gngsvm009d
cluster eaaeaa55-
health HEALTH_WARN 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean; nodeep-scrub flag(s) set; mon.gngsvm009d low disk space
monmap e70: 10 mons at {gngsvc009a=
osdmap e48008: 428 osds: 426 up, 426 in
flags nodeep-scrub
pgmap v9304154: 37620 pgs, 4 pools, 36162 GB data, 9207 kobjects
107 TB used, 1240 TB / 1348 TB avail
client io 12871 kB/s rd, 26707 kB/s wr, 2060 op/s
root@gngsvm009d
HEALTH_WARN 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean; nodeep-scrub flag(s) set; mon.gngsvm009d low disk space
pg 4.154c is stuck inactive for 138204.365400, current state peering, last acting [328,80,280]
pg 4.1793 is stuck inactive for 75241.376101, current state peering, last acting [80,328,35]
pg 4.154c is stuck unclean for 138204.365679, current state peering, last acting [328,80,280]
pg 4.1793 is stuck unclean for 75241.376375, current state peering, last acting [80,328,35]
pg 4.1793 is peering, acting [80,328,35]
pg 4.154c is peering, acting [328,80,280]
nodeep-scrub flag(s) set
mon.gngsvm009d low disk space -- 26% avail
In the above we see that there are 2 OSDs that are common to the stuck pgs, 80 and 328. The log files for these OSDs show no errors or any indication of a problem. observing the process in "top" also shows no special behavior.
restarting the common OSDs can bring the service back. in the above case restarting OSD 80 was sufficient. However, sometimes we may need to try each of them until we find the one that is blocking.
Changed in juniperopenstack: | |
importance: | Undecided → Critical |
assignee: | nobody → saravanan purushothaman (spuru) |
information type: | Proprietary → Public |
Review in progress for https:/ /review. opencontrail. org/10635
Submitter: Jeya ganesh babu (<email address hidden>)