Bug #2007779 “Online extension not seen in guest instance withou...” : Bugs : Cinder

Revision history for this message

Walid Moghrabi (walid-fdj) wrote on 2023-02-19:

#1

I forgot to give some details on the setup :
* Canonical Charmed Openstack based on Ubuntu 20.04
* Openstack release : stable/Yoga
* cinnder-purestorage : 1.18.1
* cinder : 20.0.0

Sofia Enriquez (lsofia-enriquez) on 2023-02-22

Changed in cinder:
importance:	Undecided → High
tags:	added: live-extend pure-storage

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-02-22:

#2

We have experienced this behavior as well, also using Pure storage but on FC.

If you resize the volume a second time the guest will be "aware" of the previous resize.

As a workaround we have been using this procedure, which I'm sure is not the "right" way.
Do the online resize then go to the host, find the instance and get the instance name and mpath device using virsh dumpxml and find the full mpath. Then issue the below command replacing the path, new size and instance-name from the example below.
virsh -c qemu:///system blockresize --path /dev/disk/by-id/dm-uuid-mpath-NUMBER --size 100G instance-00005a47

I also saw a post on the discuss mailing list that has further info, https://lists.openstack.org/pipermail/openstack-discuss/2023-January/031764.html

Revision history for this message

Simon Dodsley (simon-dodsley) wrote on 2023-02-23:

#3

Can you confirm that Pure's Linux recommended settings have been applied to the Nova compute nodes?

Specifically, the UDEV rules to enable all the correct SCSI Unit Attentions.

Details can be found here: https://support.purestorage.com/Solutions/Linux/Linux_Reference/Linux_Recommended_Settings#SCSI_Unit_Attentions

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-02-23:

#4

Download full text (8.2 KiB)

I have contacted pure to get the ubuntu settings per their note on the compatibility matirx. However I do not believe this issue is due to the udev rules.

Before Resize:
3624a937081a2fb5441394dd20005aafa dm-87 PURE,FlashArray
size=40G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 15:0:1:9 sdhr 134:16 active ready running
  |- 17:0:6:9 sdhw 134:96 active ready running
  |- 16:0:0:9 sdht 134:48 active ready running
  |- 18:0:2:9 sdhy 134:128 active ready running
  |- 15:0:3:9 sdhs 134:32 active ready running
  |- 17:0:5:9 sdhv 134:80 active ready running
  |- 16:0:2:9 sdhu 134:64 active ready running
  `- 18:0:0:9 sdhx 134:112 active ready running

Resize issued, dmesg output:
[22883977.119323] sd 18:0:0:9: Capacity data has changed [22883977.134358] sd 15:0:1:9: Capacity data has changed
[22883977.148433] sd 17:0:6:9: Capacity data has changed
[22883977.164201] sd 16:0:0:9: Capacity data has changed
[22883977.177540] sd 18:0:2:9: Capacity data has changed
[22883977.191926] sd 15:0:3:9: Capacity data has changed
[22883977.204963] sd 17:0:5:9: Capacity data has changed
[22883977.217926] sd 16:0:2:9: Capacity data has changed
[22883977.669103] sd 15:0:3:9: alua: supports implicit TPGS
[22883977.669107] sd 15:0:3:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 3
[22883977.670099] sd 15:0:3:9: [sdhs] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.670503] sdhs: detected capacity change from 42949672960 to 48318382080
[22883977.679523] sd 15:0:3:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.704455] sd 15:0:1:9: alua: supports implicit TPGS
[22883977.704459] sd 15:0:1:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 13
[22883977.705406] sd 15:0:1:9: [sdhr] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.705886] sdhr: detected capacity change from 42949672960 to 48318382080
[22883977.715478] sd 15:0:1:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883977.735927] sd 16:0:2:9: alua: supports implicit TPGS
[22883977.735931] sd 16:0:2:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 8
[22883977.736992] sd 16:0:2:9: [sdhu] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.737408] sdhu: detected capacity change from 42949672960 to 48318382080
[22883977.747491] sd 16:0:2:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.764231] sd 16:0:0:9: alua: supports implicit TPGS
[22883977.764236] sd 16:0:0:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 18
[22883977.765291] sd 16:0:0:9: [sdht] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.765804] sdht: detected capacity change from 42949672960 to 48318382080
[22883977.775476] sd 16:0:0:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883977.794191] sd 17:0:6:9: alua: supports implicit TPGS
[22883977.794195] sd 17:0:6:9: alua: device naa.624a937081a2fb54413...

I have contacted pure to get the ubuntu settings per their note on the compatibility matirx.  However I do not believe this issue is due to the udev rules.

Before Resize:
3624a937081a2fb5441394dd20005aafa dm-87 PURE,FlashArray
size=40G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 15:0:1:9   sdhr 134:16  active ready running
  |- 17:0:6:9   sdhw 134:96  active ready running
  |- 16:0:0:9   sdht 134:48  active ready running
  |- 18:0:2:9   sdhy 134:128 active ready running
  |- 15:0:3:9   sdhs 134:32  active ready running
  |- 17:0:5:9   sdhv 134:80  active ready running
  |- 16:0:2:9   sdhu 134:64  active ready running
  `- 18:0:0:9   sdhx 134:112 active ready running

Resize issued, dmesg output:
[22883977.119323] sd 18:0:0:9: Capacity data has changed                                                                                                                                                         [22883977.134358] sd 15:0:1:9: Capacity data has changed
[22883977.148433] sd 17:0:6:9: Capacity data has changed
[22883977.164201] sd 16:0:0:9: Capacity data has changed
[22883977.177540] sd 18:0:2:9: Capacity data has changed
[22883977.191926] sd 15:0:3:9: Capacity data has changed
[22883977.204963] sd 17:0:5:9: Capacity data has changed
[22883977.217926] sd 16:0:2:9: Capacity data has changed
[22883977.669103] sd 15:0:3:9: alua: supports implicit TPGS
[22883977.669107] sd 15:0:3:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 3
[22883977.670099] sd 15:0:3:9: [sdhs] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.670503] sdhs: detected capacity change from 42949672960 to 48318382080
[22883977.679523] sd 15:0:3:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.704455] sd 15:0:1:9: alua: supports implicit TPGS
[22883977.704459] sd 15:0:1:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 13
[22883977.705406] sd 15:0:1:9: [sdhr] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.705886] sdhr: detected capacity change from 42949672960 to 48318382080
[22883977.715478] sd 15:0:1:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883977.735927] sd 16:0:2:9: alua: supports implicit TPGS
[22883977.735931] sd 16:0:2:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 8
[22883977.736992] sd 16:0:2:9: [sdhu] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.737408] sdhu: detected capacity change from 42949672960 to 48318382080
[22883977.747491] sd 16:0:2:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.764231] sd 16:0:0:9: alua: supports implicit TPGS
[22883977.764236] sd 16:0:0:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 18
[22883977.765291] sd 16:0:0:9: [sdht] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.765804] sdht: detected capacity change from 42949672960 to 48318382080
[22883977.775476] sd 16:0:0:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883977.794191] sd 17:0:6:9: alua: supports implicit TPGS
[22883977.794195] sd 17:0:6:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 4
[22883977.795055] sd 17:0:6:9: [sdhw] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.795492] sdhw: detected capacity change from 42949672960 to 48318382080
[22883977.803538] sd 17:0:6:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.824102] sd 17:0:5:9: alua: supports implicit TPGS
[22883977.824107] sd 17:0:5:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 14
[22883977.825243] sd 17:0:5:9: [sdhv] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.825795] sdhv: detected capacity change from 42949672960 to 48318382080
[22883977.835536] sd 17:0:5:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883977.850297] sd 18:0:2:9: alua: supports implicit TPGS
[22883977.850302] sd 18:0:2:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 0 rel port 7
[22883977.851273] sd 18:0:2:9: [sdhy] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)
[22883977.851719] sdhy: detected capacity change from 42949672960 to 48318382080
[22883977.859437] sd 18:0:2:9: alua: port group 00 state A non-preferred supports TolUSNA
[22883977.875338] sd 18:0:0:9: alua: supports implicit TPGS
[22883977.875342] sd 18:0:0:9: alua: device naa.624a937081a2fb5441394dd20005aafa port group 1 rel port 17
[22883977.876298] sd 18:0:0:9: [sdhx] 94371840 512-byte logical blocks: (48.3 GB/45.0 GiB)                                                                                                              [86/9567][22883977.876765] sdhx: detected capacity change from 42949672960 to 48318382080
[22883977.887507] sd 18:0:0:9: alua: port group 01 state A non-preferred supports TolUSNA
[22883978.499493] sd 15:0:4:1: alua: port group 01 state A non-preferred supports TolUSNA
[22883978.499502] sd 15:0:8:1: alua: port group 00 state A non-preferred supports TolUSNA
[22883978.499733] sd 15:0:8:1: alua: port group 00 state A non-preferred supports TolUSNA
[22883978.499759] sd 15:0:4:1: alua: port group 01 state A non-preferred supports TolUSNA

Multipath output after resize:
3624a937081a2fb5441394dd20005aafa dm-87 PURE,FlashArray
size=45G features='0' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
  |- 15:0:1:9   sdhr 134:16  active ready running
  |- 17:0:5:9   sdhv 134:80  active ready running
  |- 16:0:0:9   sdht 134:48  active ready running
  |- 18:0:0:9   sdhx 134:112 active ready running
  |- 15:0:3:9   sdhs 134:32  active ready running
  |- 17:0:6:9   sdhw 134:96  active ready running
  |- 16:0:2:9   sdhu 134:64  active ready running
  `- 18:0:2:9   sdhy 134:128 active ready running

As you can see the volume resized fine, Here is the nova-compute output of the resize:
2023-02-23 15:15:00.556 41292 INFO nova.compute.manager [req-44956955-90fa-4ca4-9664-4ceedbba1dfa 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 42030162-328b-4069-b8f7-715e79401a95] Cinder extended volume 0b113500-e81e-4cb0-8997-4bff13513854; extending it to detect new size
2023-02-23 15:15:00.953 41292 INFO os_brick.initiator.linuxscsi [req-44956955-90fa-4ca4-9664-4ceedbba1dfa 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Find Multipath device file for volume WWN 3624a937081a2fb5441394dd20005aafa
2023-02-23 15:15:00.977 41292 INFO os_brick.initiator.linuxscsi [req-44956955-90fa-4ca4-9664-4ceedbba1dfa 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937081a2fb5441394dd20005aafa) current size 42949672960
2023-02-23 15:15:01.003 41292 INFO os_brick.initiator.linuxscsi [req-44956955-90fa-4ca4-9664-4ceedbba1dfa 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937081a2fb5441394dd20005aafa) new size 42949672960

The current size and new size are the same, extending the volume a second time +1GB:
2023-02-23 15:22:04.331 41292 INFO nova.compute.manager [req-5bdcdeb3-614e-4efd-8c22-dfa91278ab69 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 42030162-328b-4069-b8f7-715e79401a95] Cinder extended volume 0b113500-e81e-4cb0-8997-4bff13513854; extending it to detect new size
2023-02-23 15:22:04.770 41292 INFO os_brick.initiator.linuxscsi [req-5bdcdeb3-614e-4efd-8c22-dfa91278ab69 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Find Multipath device file for volume WWN 3624a937081a2fb5441394dd20005aafa
2023-02-23 15:22:04.794 41292 INFO os_brick.initiator.linuxscsi [req-5bdcdeb3-614e-4efd-8c22-dfa91278ab69 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937081a2fb5441394dd20005aafa) current size 48318382080
2023-02-23 15:22:04.820 41292 INFO os_brick.initiator.linuxscsi [req-5bdcdeb3-614e-4efd-8c22-dfa91278ab69 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937081a2fb5441394dd20005aafa) new size 48318382080

Now the guest is aware of the previous 5GB added, a stop/start will show the correct full size.

Revision history for this message

Simon Dodsley (simon-dodsley) wrote on 2023-02-25:

#5

As an aside, glad to see you are not using freindly_names as this can cause issues.

Pure Storage have not been able to emulate your specific issue, so more information would be helpful.

Can you check how many multipath devices there are on the compute node when you experience this problem?

We have seen strange os-brick related behaviour where a large number of multipath devices can cause issues as os-brick gets back timeout errors from multipathd. This is currently being investigated.

Do you experience the same issue when there are only a few multipath device on the compute host?

Can you post the contents of your multipath.conf, or at least the Pure portion of it?

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-02-27:

#6

Download full text (7.7 KiB)

I tested doing a resize today on a compute node with just the 1 test instance on it, nova reported it failed. I will paste the section of the nova log below, the volume itself did extend fine and dmesg/multipath show the new size but the guest did not see it. I think this is the first time it's reported an error in the resize.

To clarify Stein on ubuntu 18.04.

Partial multipath.conf
defaults {
        polling_interval 10
        path_selector "round-robin 0"
        path_grouping_policy multibus
        uid_attribute ID_SERIAL
        rr_min_io 100
        failback immediate
        no_path_retry fail
        user_friendly_names no
        features 0
}
        device {
                vendor "PURE"
                product "FlashArray"
                path_grouping_policy group_by_prio
                hardware_handler "1 alua"
                prio alua
                failback "immediate"
                fast_io_fail_tmo 10
        }
overrides {
        no_path_retry fail
}

2023-02-27 15:03:34.273 2765 INFO nova.compute.manager [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 473c05bf-447a-446a-951f-2574a00e20c4] Cinder extended volume 255341c1-1a05-4b23-958a-e913cc9c61f5; extending it to detect new size
2023-02-27 15:03:34.622 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Find Multipath device file for volume WWN 3624a937083e6a675e5d8425b0041f043
2023-02-27 15:03:34.641 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) current size 42949672960
2023-02-27 15:03:34.657 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) new size 42949672960
2023-02-27 15:03:34.658 2765 WARNING nova.compute.manager [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 473c05bf-447a-446a-951f-2574a00e20c4] Extend volume failed, volume_id=255341c1-1a05-4b23-958a-e913cc9c61f5, reason: 'device_path'
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Exception during message handling: KeyError: 'device_path'
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-0...

I tested doing a resize today on a compute node with just the 1 test instance on it, nova reported it failed.  I will paste the section of the nova log below, the volume itself did extend fine and dmesg/multipath show the new size but the guest did not see it.  I think this is the first time it's reported an error in the resize.

To clarify Stein on ubuntu 18.04.

Partial multipath.conf
defaults {
        polling_interval        10
        path_selector           "round-robin 0"
        path_grouping_policy    multibus
        uid_attribute           ID_SERIAL
        rr_min_io               100
        failback                immediate
        no_path_retry           fail
        user_friendly_names     no
        features                0
}
        device {
                vendor "PURE"
                product "FlashArray"
                path_grouping_policy group_by_prio
                hardware_handler "1 alua"
                prio alua
                failback "immediate"
                fast_io_fail_tmo 10
        }
overrides {
        no_path_retry                   fail
}

2023-02-27 15:03:34.273 2765 INFO nova.compute.manager [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 473c05bf-447a-446a-951f-2574a00e20c4] Cinder extended volume 255341c1-1a05-4b23-958a-e913cc9c61f5; extending it to detect new size
2023-02-27 15:03:34.622 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Find Multipath device file for volume WWN 3624a937083e6a675e5d8425b0041f043
2023-02-27 15:03:34.641 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) current size 42949672960
2023-02-27 15:03:34.657 2765 INFO os_brick.initiator.linuxscsi [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) new size 42949672960
2023-02-27 15:03:34.658 2765 WARNING nova.compute.manager [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 473c05bf-447a-446a-951f-2574a00e20c4] Extend volume failed, volume_id=255341c1-1a05-4b23-958a-e913cc9c61f5, reason: 'device_path'
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server [req-ff00d827-e607-46fa-bf3c-67c89088353a 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Exception during message handling: KeyError: 'device_path'
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 79, in wrapped
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     function_name, call_dict, binary, tb)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     self.force_reraise()
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     raise value
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 69, in wrapped
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 8853, in external_instance_event
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     self.extend_volume(context, instance, event.tag)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/utils.py", line 1346, in decorated_function
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 215, in decorated_function
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     self.force_reraise()
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     raise value
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 203, in decorated_function
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 8811, in extend_volume
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     bdm.volume_size * units.Gi)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1812, in extend_volume
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     requested_size)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1443, in _extend_volume
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     requested_size)
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/fibrechannel.py", line 83, in extend_volume
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server     connection_info['data']['device_path'],
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server KeyError: 'device_path'
2023-02-27 15:03:34.724 2765 ERROR oslo_messaging.rpc.server

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-02-27:

#7

That instance was live migrated to the host for testing, maybe the 'device_path' doesn't get updated properly when an instance is migrated in.

After doing a stop/start and another resize the error did not appear again, but the guest did not see the live resize.

Revision history for this message

Walid Moghrabi (walid-fdj) wrote on 2023-02-28:

#8

Download full text (4.0 KiB)

After digging in a little deeper, I encountered 2 issues :
* first, I had permission issues due to aa-profile-mode set ton "enforce", I disabled it but that didn't fixed the issue, but now, I have another error message not related to permissions.

* second, now, I have these errors :

2023-02-28 10:27:19.220 78388 WARNING nova.compute.manager [req-d5d6e677-486b-4df8-b965-a7a97b6a3e65 e1fde3044dbc4d7a9e0ff5a22b27e13a 3796795147bb44fca375bd5cb79d6cba - 78513ca621904bef99dd7746ce1000ec 78513ca621904bef99dd7746ce1000ec] [instance: d4bb35ec-5886-4565-8035-817cc650499c] Extend volume failed, volume_id=9df89dce-d206-48c7-8c3f-51db551e992f, reason: Unexpected error while running command.
Command: multipathd resize map 3624a9370b6c0989ecb274ee8001010c7
Exit code: 1
Stdout: 'fail\n'
Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2023-02-28 10:27:19.264 78388 ERROR oslo_messaging.rpc.server [req-d5d6e677-486b-4df8-b965-a7a97b6a3e65 e1fde3044dbc4d7a9e0ff5a22b27e13a 3796795147bb44fca375bd5cb79d6cba - 78513ca621904bef99dd7746ce1000ec 78513ca621904bef99dd7746ce1000ec] Exception during message handling: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: multipathd resize map 3624a9370b6c0989ecb274ee8001010c7
Exit code: 1

For full context, I'm running yoga/stable on Ubuntu 20.04.
This is brand new cluster it is fully updated to releases available for these channels as of today.
Regarding multiple path, I only have 1 VM running with 16 path pervolume as can be seen here for the 2 volumes that are attached to the instance :

mpathc (3624a9370b6c0989ecb274ee8001010c7) dm-12 PURE,FlashArray
size=15G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:2 sdw 65:96 active ready running
  |- 10:0:0:2 sdai 66:32 active ready running
  |- 11:0:0:2 sdaa 65:160 active ready running
  |- 12:0:0:2 sdac 65:192 active ready running
  |- 13:0:0:2 sdak 66:64 active ready running
  |- 14:0:0:2 sdah 66:16 active ready running
  |- 15:0:0:2 sdv 65:80 active ready running
  |- 16:0:0:2 sdae 65:224 active ready running
  |- 2:0:0:2 sdx 65:112 active ready running
  |- 3:0:0:2 sdad 65:208 active ready running
  |- 4:0:0:2 sdz 65:144 active ready running
  |- 5:0:0:2 sdy 65:128 active ready running
  |- 6:0:0:2 sdaf 65:240 active ready running
  |- 7:0:0:2 sdag 66:0 active ready running
  |- 8:0:0:2 sdaj 66:48 active ready running
  `- 9:0:0:2 sdab 65:176 active ready running
mpathb (3624a9370b6c0989ecb274ee8001010a6) dm-8 PURE,FlashArray
size=20G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1 sdf 8:80 active ready running
  |- 10:0:0:1 sdo 8:224 active ready running
  |- 11:0:0:1 sdp 8:240 active ready running
  |- 12:0:0:1 sdq 65:0 active ready running
  |- 13:0:0:1 sdr 65:16 active ready running
  |- 14:0:0:1 sds 65:32 active ready running
  |- 15:0:0:1 sdt 65:48 active ready running
  |- 16:0:0:1 sdu 65:64 active ready running
  |- 2:0:0:1 sdj 8:144 active ready running
  |- 3:0:0:1 sdg 8:96 active ready running
  |- 4:0:0...

After digging in a little deeper, I encountered 2 issues :
* first, I had permission issues due to aa-profile-mode set ton "enforce", I disabled it but that didn't fixed the issue, but now, I have another error message not related to permissions.

* second, now, I have these errors :

2023-02-28 10:27:19.220 78388 WARNING nova.compute.manager [req-d5d6e677-486b-4df8-b965-a7a97b6a3e65 e1fde3044dbc4d7a9e0ff5a22b27e13a 3796795147bb44fca375bd5cb79d6cba - 78513ca621904bef99dd7746ce1000ec 78513ca621904bef99dd7746ce1000ec] [instance: d4bb35ec-5886-4565-8035-817cc650499c] Extend volume failed, volume_id=9df89dce-d206-48c7-8c3f-51db551e992f, reason: Unexpected error while running command.
Command: multipathd resize map 3624a9370b6c0989ecb274ee8001010c7
Exit code: 1
Stdout: 'fail\n'
Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
2023-02-28 10:27:19.264 78388 ERROR oslo_messaging.rpc.server [req-d5d6e677-486b-4df8-b965-a7a97b6a3e65 e1fde3044dbc4d7a9e0ff5a22b27e13a 3796795147bb44fca375bd5cb79d6cba - 78513ca621904bef99dd7746ce1000ec 78513ca621904bef99dd7746ce1000ec] Exception during message handling: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: multipathd resize map 3624a9370b6c0989ecb274ee8001010c7
Exit code: 1

For full context, I'm running yoga/stable on Ubuntu 20.04.
This is brand new cluster it is fully updated to releases available for these channels as of today.
Regarding multiple path, I only have 1 VM running with 16 path pervolume as can be seen here for the 2 volumes that are attached to the instance :

mpathc (3624a9370b6c0989ecb274ee8001010c7) dm-12 PURE,FlashArray
size=15G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:2  sdw  65:96  active ready running
  |- 10:0:0:2 sdai 66:32  active ready running
  |- 11:0:0:2 sdaa 65:160 active ready running
  |- 12:0:0:2 sdac 65:192 active ready running
  |- 13:0:0:2 sdak 66:64  active ready running
  |- 14:0:0:2 sdah 66:16  active ready running
  |- 15:0:0:2 sdv  65:80  active ready running
  |- 16:0:0:2 sdae 65:224 active ready running
  |- 2:0:0:2  sdx  65:112 active ready running
  |- 3:0:0:2  sdad 65:208 active ready running
  |- 4:0:0:2  sdz  65:144 active ready running
  |- 5:0:0:2  sdy  65:128 active ready running
  |- 6:0:0:2  sdaf 65:240 active ready running
  |- 7:0:0:2  sdag 66:0   active ready running
  |- 8:0:0:2  sdaj 66:48  active ready running
  `- 9:0:0:2  sdab 65:176 active ready running
mpathb (3624a9370b6c0989ecb274ee8001010a6) dm-8 PURE,FlashArray
size=20G features='0' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1  sdf  8:80   active ready running
  |- 10:0:0:1 sdo  8:224  active ready running
  |- 11:0:0:1 sdp  8:240  active ready running
  |- 12:0:0:1 sdq  65:0   active ready running
  |- 13:0:0:1 sdr  65:16  active ready running
  |- 14:0:0:1 sds  65:32  active ready running
  |- 15:0:0:1 sdt  65:48  active ready running
  |- 16:0:0:1 sdu  65:64  active ready running
  |- 2:0:0:1  sdj  8:144  active ready running
  |- 3:0:0:1  sdg  8:96   active ready running
  |- 4:0:0:1  sdh  8:112  active ready running
  |- 5:0:0:1  sdi  8:128  active ready running
  |- 6:0:0:1  sdk  8:160  active ready running
  |- 7:0:0:1  sdl  8:176  active ready running
  |- 8:0:0:1  sdm  8:192  active ready running
  `- 9:0:0:1  sdn  8:208  active ready running

apparently, there is a failure with the "multipathd resize map 3624a9370b6c0989ecb274ee8001010c7" command, I can't explain why.
Anyway, I suppose that this failure make the whole notification process stop before qemu is notified of a change on the volume size and so, the guest VM is unaware of the change until you stop/start it again.
However, detaching/attaching the volume back and forth works correctly, only the resize fails to refresh the qemu state.

Which process does that qemu volume size refresh ? I'm pretty sure that in the case of detach/attach, this is cinder itself but what about resize ? Looks like the pure storage driver is faulty here.

Revision history for this message

Simon Dodsley (simon-dodsley) wrote on 2023-03-02:

#9

The Pure Storage driver doesn't do anything other than request the volume is resized and then hands back to os-brick and nova.
If the volume has resized correctly on the Flasharray then the Pure driver has done its work correctly.
The multipath resize map is part of os-brick and that is what is failing.
The very fact that if you detach/attach and the size is correct shows that the Pure driver is not at fault.
There are multiple known issues with multipath in os-brick that have still to be addressed.
I have been working with os-brick developers to try and emulate your specific error but with no success. The resize works correctly in all the tempest tests and in manula testing through Horizon and the CLI. In fact the tempest tests for resize are part of the acceptance criteria for any Cinder driver.

The latest recommended multipath settings from Pure are as follows:

    device {
        vendor "PURE"
        product "FlashArray"
        path_selector "service-time 0"
        hardware_handler "1 alua"
        path_grouping_policy group_by_prio
        prio alua
        failback immediate
        path_checker tur
        fast_io_fail_tmo 10
        user_friendly_names no
        no_path_retry 0
        features 0
        dev_loss_tmo 600
    }

Please make sure you have these configured to eliminate these as any part of the root cause.

Revision history for this message

Walid Moghrabi (walid-fdj) wrote on 2023-03-02:

#10

Hi,

I can confirm that the volume is properly extended on the Pure Storage array side and Cinder is aware of that, only the qemu process notification is failing and apparently, this is a multipath issue.

In my /etc/multipah.conf, I only have this :

===================================
defaults {
user_friendly_names yes
}
===================================

I added the recommended settings you pointed me out, I was thinking that the cinder-purestorage charm would tweak the multipath conf itself so I never looked at that part.

Once added, it now works as expected so this is not a bug but a misconfiguration.
Thanks for your help!

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-03-02:

#11

I have replaced our previous pure section in multipath and restarted multipathd, issue still persists in our case.

2023-03-02 14:47:17.019 2765 INFO nova.compute.manager [req-44726fb0-9478-4a78-b327-6201151ac21b 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] [instance: 473c05bf-447a-446a-951f-2574a00e20c4] Cinder extended volume 255341c1-1a05-4b23-958a-e913cc9c61f5; extending it to detect new size
2023-03-02 14:47:17.379 2765 INFO os_brick.initiator.linuxscsi [req-44726fb0-9478-4a78-b327-6201151ac21b 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] Find Multipath device file for volume WWN 3624a937083e6a675e5d8425b0041f043
2023-03-02 14:47:17.396 2765 INFO os_brick.initiator.linuxscsi [req-44726fb0-9478-4a78-b327-6201151ac21b 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) current size 50465865728
2023-03-02 14:47:17.413 2765 INFO os_brick.initiator.linuxscsi [req-44726fb0-9478-4a78-b327-6201151ac21b 9c3716d5b3814e7397200140aadfc843 01cc76d46a064bd89b1f670415ee3f9a - default default] mpath(/dev/disk/by-id/dm-uuid-mpath-3624a937083e6a675e5d8425b0041f043) new size 50465865728

In our case I do not believe it specifically a pure issue, but probably more of a timing issue as suggested in the mailing list post: https://lists.openstack.org/pipermail/openstack-discuss/2023-January/031764.html

It seems like nova is checking the resize before the volume gets done.

Revision history for this message

Simon Dodsley (simon-dodsley) wrote on 2023-03-02 (last edit on 2023-03-02):

#12

Needs investigation as to whether multipath settings should be done by the vendor cinder charm or by core cinder charm

Changed in cinder:
assignee:	nobody → Simon Dodsley (simon-dodsley)
status:	New → In Progress
Changed in charm-cinder-purestorage:
status:	New → In Progress
assignee:	nobody → Simon Dodsley (simon-dodsley)
Changed in cinder:
status:	In Progress → Invalid
Changed in charm-cinder-purestorage:
status:	In Progress → Confirmed
assignee:	Simon Dodsley (simon-dodsley) → nobody

Simon Dodsley (simon-dodsley) on 2023-03-02

Changed in charm-cinder-purestorage:
assignee:	nobody → Simon Dodsley (simon-dodsley)

Revision history for this message

Terry Cowart (tcowart) wrote on 2023-03-06:

#13

Just to clarify we are still seeing this issue after the multipath changes, we do NOT use charmed openstack, ubuntu with the cloud archive.

Revision history for this message

Simon Dodsley (simon-dodsley) wrote on 2023-03-06:

#14

Terry - Please contact me directly (<email address hidden>) so we can try and work this out for you.

Cinder

Online extension not seen in guest instance without stop/start or re-attach of the volume

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Invalid	High	Simon Dodsley
	OpenStack Cinder Pure Storage Charm	Confirmed	Undecided	Simon Dodsley