Concurrent deletion of instances leads to residual multipath

Bug #2048837 reported by zhou zhong
34
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

Description
===========
A 100G **iSCSI** **shared** volume was attached to 3 instances scheduled on the same node(node-2), then I deleted these 3 instances concurrently, the 3 instances could be deleted but the output of command 'multipath -ll' shown exception as follows.

[root@node-2 ~]# multipath -ll
Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 24:0:0:39 sdj 8:144 failed faulty running
| |- 17:0:0:39 sdl 8:176 failed faulty running
| |- 22:0:0:39 sdk 8:160 failed faulty running
| `- 19:0:0:39 sdn 8:208 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 23:0:0:39 sdi 8:128 failed faulty running
  |- 18:0:0:39 sdo 8:224 failed faulty running
  |- 21:0:0:39 sdm 8:192 failed faulty running
  `- 20:0:0:39 sdp 8:240 failed faulty running

Steps to reproduce
==================
1.Booting 3 instances using RBD as root disk, there is no requirement for the protocol type of the system disk in this step.
2.Creating a iSCSI shared volume as the data disk of the instance, you may using commercial storage or other storage systems using the iSCSIs protocol.
3.Attaching the shared volume to the 3 instances separately.
4.Make sure all the instances were mounted successfully, then delete the instances concurrently.

Expected result
===============
The 3 instances could be deleted completely, and no residual multipaths when execute 'multipath -ll'.

Actual result
=============
The 3 instances could be deleted, but the node had residual multipaths, as you can see the output from description above.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   Wallaby Nova & Cinder, docked a commercial storage using iSCSI.

2. Which hypervisor did you use?
   Libvirt 8.0.0 + qemu-kvm 6.2.0

2. Which storage type did you use?
   Using RBD as root disk, 1 shared iSCSI volume as data-disk to 3 instances scheduled on the same node.

3. Which networking type did you use?
   omit...

Logs & Configs
==============
According to code of deleting, nova will not disconnect the shared volume from instance when the volume also attached to the other instances on the same node, then log 'Detected multiple connections on this host for volume'. node-2 nova-compute output:

2024-01-10 11:05:29.904 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:29.904196604+08:00 stdout F 2024-01-10 11:05:29.903 59580 INFO nova.virt.libvirt.driver [req-c9082d4c-457a-4859-a0be-c2c23953a17c fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
2024-01-10 11:05:30.143 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.143536178+08:00 stdout F 2024-01-10 11:05:30.143 59580 INFO nova.virt.libvirt.driver [req-065c2b2b-ae16-453f-abb7-a5756ed87f3a fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
2024-01-10 11:05:30.334 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.334997487+08:00 stdout F 2024-01-10 11:05:30.334 59580 INFO nova.virt.libvirt.driver [req-41afd565-599f-4b35-b4cb-acf074332079 fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m

And oslo:
[root@node-2 ~]# multipath -ll
Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 24:0:0:39 sdj 8:144 failed faulty running
| |- 17:0:0:39 sdl 8:176 failed faulty running
| |- 22:0:0:39 sdk 8:160 failed faulty running
| `- 19:0:0:39 sdn 8:208 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 23:0:0:39 sdi 8:128 failed faulty running
  |- 18:0:0:39 sdo 8:224 failed faulty running
  |- 21:0:0:39 sdm 8:192 failed faulty running
  `- 20:0:0:39 sdp 8:240 failed faulty running

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/907958

Revision history for this message
sean mooney (sean-k-mooney) wrote :

"According to code of deleting, nova will not disconnect the shared volume from instance when the volume also attached to the other instances on the same node,"

I'm not sure this is correct. we wont disconenct it form the host
but we would disconenct the volume form the instance domain.

those are tow different things.

in the special cases fo deleting a instance the entire domain is being removed so there is no instance after that completes for the volume to be connected too.

for what it's worth this bug may have security implication

we will need to review this carfully

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/916322

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.