storage: stuck pgs cause block storage hangs

Bug #1450827 reported by tom murray
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.1
Fix Committed
Critical
Jeya ganesh babu J
R2.20
Fix Committed
Critical
Jeya ganesh babu J
Trunk
Fix Committed
Critical
Jeya ganesh babu J

Bug Description

This is with contrail storage 2.01/43

We are seeing frequent outages with access to block storage. virtual machines that have volumes are getting blocked or extremely slow. If the VM has a boot volume then we can see guest kernel hangs.

When this occurs we see that ceph has blocked pgs:

root@gngsvm009d:/opt/contrail/utils# ceph -s
    cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d
     health HEALTH_WARN 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean; nodeep-scrub flag(s) set; mon.gngsvm009d low disk space
     monmap e70: 10 mons at {gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=10.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:6789/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm009d=10.163.43.4:6789/0,gngsvm010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election epoch 21144, quorum 0,1,2,3,4,5,6,7,8,9 gngsvc009a,gngsvc009b,gngsvm009d,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc011b,gngsvc011c,gngsvm011d
     osdmap e48008: 428 osds: 426 up, 426 in
            flags nodeep-scrub
      pgmap v9304154: 37620 pgs, 4 pools, 36162 GB data, 9207 kobjects
            107 TB used, 1240 TB / 1348 TB avail
               37618 active+clean
                   2 peering
  client io 12871 kB/s rd, 26707 kB/s wr, 2060 op/s
root@gngsvm009d:/opt/contrail/utils# ceph health detail
HEALTH_WARN 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean; nodeep-scrub flag(s) set; mon.gngsvm009d low disk space
pg 4.154c is stuck inactive for 138204.365400, current state peering, last acting [328,80,280]
pg 4.1793 is stuck inactive for 75241.376101, current state peering, last acting [80,328,35]
pg 4.154c is stuck unclean for 138204.365679, current state peering, last acting [328,80,280]
pg 4.1793 is stuck unclean for 75241.376375, current state peering, last acting [80,328,35]
pg 4.1793 is peering, acting [80,328,35]
pg 4.154c is peering, acting [328,80,280]
nodeep-scrub flag(s) set
mon.gngsvm009d low disk space -- 26% avail

In the above we see that there are 2 OSDs that are common to the stuck pgs, 80 and 328. The log files for these OSDs show no errors or any indication of a problem. observing the process in "top" also shows no special behavior.

restarting the common OSDs can bring the service back. in the above case restarting OSD 80 was sufficient. However, sometimes we may need to try each of them until we find the one that is blocking.

Tags: storage
Jeba Paulaiyan (jebap)
Changed in juniperopenstack:
importance: Undecided → Critical
assignee: nobody → saravanan purushothaman (spuru)
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/10635
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/10635
Committed: http://github.org/Juniper/contrail-packaging/commit/61c1e31d232410068d3cfc71f959d8efd93b8a7e
Submitter: Zuul
Branch: R2.20

commit 61c1e31d232410068d3cfc71f959d8efd93b8a7e
Author: Jeya ganesh babu J <email address hidden>
Date: Wed May 20 20:29:30 2015 -0700

Updating ceph with latest fixes

Closes-Bug: #1450827
Issue: Issue was seen in Ganges cluster where some of the osds
get stuck in peering process.
Fix: Pulled in the latest Giant patch which has the peering fix.

Change-Id: I2351cdc30e8592197fa44301b205bc1970bf4818

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11198
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11198
Committed: http://github.org/Juniper/contrail-packaging/commit/478cc9728755e010b3bcf5397a3192b7a8bf9f2e
Submitter: Zuul
Branch: master

commit 478cc9728755e010b3bcf5397a3192b7a8bf9f2e
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 2 14:24:08 2015 -0700

Updating ceph with latest fixes

Closes-Bug: #1450827
Issue: Issue was seen in Ganges cluster where some of the osds
get stuck in peering process.
Fix: Pulled in the latest Giant patch which has the peering fix.

Change-Id: Ibe4017ba604c7c4c77e71fe105692f6094721422

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.1

Review in progress for https://review.opencontrail.org/12111
Submitter: Jeya ganesh babu (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12111
Committed: http://github.org/Juniper/contrail-packaging/commit/3dea3edcb044dbbff148d5a7690852369b1aaeca
Submitter: Zuul
Branch: R2.1

commit 3dea3edcb044dbbff148d5a7690852369b1aaeca
Author: Jeya ganesh babu J <email address hidden>
Date: Tue Jun 30 13:48:31 2015 -0700

storage package fix merge

Closes-Bug: #1450827
Issue: Issue was seen in Ganges cluster where some of the osds
get stuck in peering process.
Fix: Pulled in the latest Giant patch which has the peering fix

Change-Id: I1841bc2a90292a3be8b096ac0dec589c71af4014

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.