after detach vol from VM the multipath device is in failed state

Bug #1208799 reported by shay berman
76
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned
os-brick
Invalid
Undecided
Unassigned

Bug Description

created cinder volume (fibre channel connectivity) then I attach it to a nova VM.

Here is the multipath -l output (all the path are in active state which is correct):
-------------------------------------------------------------------------------------
mpath380 (200173800fe0226d5) dm-0 IBM,2810XIV
size=964M features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-1 status=active
  |- 2:0:16:100 sde 8:64 active undef running
  |- 2:0:13:100 sdb 8:16 active undef running
  |- 2:0:15:100 sdd 8:48 active undef running
  |- 3:0:4:100 sdi 8:128 active undef running
  |- 3:0:1:100 sdf 8:80 active undef running
  |- 3:0:2:100 sdg 8:96 active undef running
  `- 3:0:3:100 sdh 8:112 active undef running

But when I detach the volume from the nova VM, the multipath -l output (all the paths are in failed status) :
-------------------------------------------------------------------------------------
mpath380 (200173800fe0226d5) dm-0 IBM,2810XIV
size=964M features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=-1 status=enabled
  |- 2:0:16:100 sde 8:64 failed undef running
  |- 2:0:15:100 sdd 8:48 failed undef running
  |- 3:0:4:100 sdi 8:128 failed undef running
  |- 3:0:1:100 sdf 8:80 failed undef running
  |- 3:0:2:100 sdg 8:96 failed undef running
  `- 3:0:3:100 sdh 8:112 failed undef running

Note : Those failed paths in the multipath, caused a lot of bad paths messages in the /var/log/syslog.

Basic information on the environment :
---------------------------------------------------------------------
- nova verions is Grizzly version 1:2013.1.1-0ubuntu2~cloud0, on ubuntu.
- installed on the same host the cinder and the compute.
- multipath-tools package version is 0.4.9-3ubuntu5

How do I recover thoses failed path :
--------------------------------------------------------------
1. per bad device path in multipath -l, I executed :
     # echo 1 > /sys/block/${device_path}/device/delete

2. per bad multipath device I executed :
    dmsetup message ${multipath_device} 0 "fail_if_no_path"

3. the refreshed the multipath by executed :
    # multipath -F

then the multipath device and its device path (which was in failed state) gone!

melanie witt (melwitt)
tags: added: volumes
removed: detach grizzly multipath
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Ihor Kaharlichenko (madkinder) wrote :

The same problem is reproduced on CentOS with havana release. Please, see the patch that solved the problem for me.

Revision history for this message
Ilja Livenson (ilja-t) wrote :

I think the importance of this should be increased -- after the 'dirty' detach all new volumes/VMs booted from volume fail to detect multipath device and result in a serious degradation of service.

Revision history for this message
Andres Toomsalu (andres-active) wrote :

Still reproducing this with Icehouse (RDO 2014.1.2-1.el6) and its a critical/blocker bug when using cinder with FC driver and multipath.

Revision history for this message
Andres Toomsalu (andres-active) wrote :

Proper (online) cleanup procedure for failed path (multipath -F is intrusive!!!):

MPATHDEV="/dev/dm-58"
multipath -ll $MPATHDEV
for i in $( multipath -ll $MPATHDEV | awk '/ failed / { print $3 }' ); do echo "Removing: $i"; echo 1 > /sys/block/${i}/device/delete; done
multipath -ll $MPATHDEV
multipath -f $MPATHDEV

More details on Icehouse and volume detach/multipath cleanup problem:

1) It does not happen always - when nova host is cleaned up from failed multipath devices it might work for a while again
2) But once it happends it starts replicating itself (ie is reproducable) - until multipath cleanup is done manually (or host is rebooted)

Perhaps its related to multipath/nova timing?

melanie witt (melwitt)
Changed in nova:
importance: Medium → High
Revision history for this message
Andres Toomsalu (andres-active) wrote :

This issue is also linked to: https://bugs.launchpad.net/nova/+bug/1290681

Patch presented previously by Ihor Kaharlichenko for Havana does not solve the problem in Icehouse anymore.

tags: added: multipath
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is this fixed by bug 1382440 in kilo?

Revision history for this message
Luo Gangyi (luogangyi) wrote :

@Matt Riedemann

the patch you pointed seems only fix iscsi-based volume. I will check whether my fc-based volume works fine.

Revision history for this message
Luo Gangyi (luogangyi) wrote :

I test it in my cluster, Icehouse, FC-SAN ,

It seems no problem.

Revision history for this message
Keiichi KII (k-keiichi) wrote :

I tested it in my test environment.
It seems this bug is already fixed in master for Liberty after migrating to os-brick library.

Revision history for this message
Walt Boring (walter-boring) wrote :

This bug did exist back in Grizzly. I don't think the FC based libvirt volume driver was doing an multipath -f <dev> after detach, which leads to seeing a dead multipath entry for multipath -l. I believe this is already fixed.

Revision history for this message
Walt Boring (walter-boring) wrote :

For what it's worth, the current os-brick FC connector does do a multipath -f after removing the device from the host here:

https://github.com/openstack/os-brick/blob/master/os_brick/initiator/connector.py#L1122

Revision history for this message
Alon Marx (alonma) wrote :

This bug originally started in our lab, and we haven't seen it happen for a long time. I think we can close it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Per comment 11 it sounds like this should already be fixed in os-brick which nova uses in liberty. So I'm marking the bug as invalid. If this can be recreated in liberty then please re-open the bug with details.

Changed in nova:
status: Confirmed → Invalid
Changed in os-brick:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.