Cinder PowerMax driver volume attach/detach hang issues

Bug #2021999 reported by prashkre
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
In Progress
Low
prashkre

Bug Description

We have a customer using PowerMax cinder driver of stable/yoga branch and PowerMax storage [0] with array model 'PowerMax_8000'.

[0] PowerMax details

2023-04-20 07:47:05.666 1618765 INFO cinder.volume.drivers.dell_emc.powermax.rest [req-b73fc9b4-39ab-435e-837b-d01b9feb0584 - - - - -] Unisphere version V9.2.3.22 meets minimum requirement of version 9.2.0.0.

2023-04-20 07:47:05.678 1618765 INFO powervc_cinder.volume.drivers.emc.vmax_product_rest [req-b73fc9b4-39ab-435e-837b-d01b9feb0584 - - - - -] PowerVC common extensions loaded. Array_info: {'RestServerIp': '*********, 'RestServerPort': 8443, 'RestUserName': 'powervc', 'SerialNumber': '*********', 'srpName': 'SRP_1', 'SSLVerify': '**********', 'PortGroups': ['PG1', 'PG2', 'PG3', 'PG4', 'SRDF_RM_Test'], 'is_V3': True, 'generation': 'V4', 'firmware': '5978.711.711', 'array_model': 'PowerMax_8000'}

2023-04-20 07:47:09.555 1618791 INFO cinder.service [req-b73fc9b4-39ab-435e-837b-d01b9feb0584 - - - - -] Starting cinder-volume node (version 18.0.0)
2023-04-20 07:47:09.674 1618791 INFO cinder.volume.manager [req-b73fc9b4-39ab-435e-837b-d01b9feb0584 - - - - -] Starting volume driver VMaxRestProductDriver (4.4.0)

In customer executions, they are repeating below 2 sequence of operations.

1. Concurrent deploy of 3 VMs (dappt0068, dappt0069, dwebt004) each with 2 disks( 1 boot volume and 1 data volume) each.

volume-dappt0068_alt_disk
volume-dappt0068-boot

volume-dappt0069_alt_disk
volume-dappt0069-boot

volume-dwebt0044_alt_disk
volume-dwebt0044-boot

2. Delete the above deployed 3 VMs concurrently once they are deployed.

With OS-no_SLO-SG as default storage group, after some repeated runs of above 2 sequence of operations, they were seeing volumes stuck in attaching during deploys and detaching during delete VMs.
While analyzing the logs, we found couple of issues with locking on default storage group[2.c], [2.d], [3]

Issue 1:

After some above cycles of execution, during VM deploy, one of the volumes attach operation failed to create initiator group [1] because of exists initiator group(which was created from previous runs and not cleaned up). Due to this, it is going through rollback flow [2] where it is trying to take lock on default storage group twice [2.c], [2.d], causing it to hang, then it is not allowing further attach/detach of volumes.

[1] create initiator group failed.

2023-04-20 10:57:36.862 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] Found initiator group OS-t_v3v47b1240def5-F-IG, but could not find initiator_names ['c050760c299702de', 'c050760c299702df', 'c050760c299702e2', 'c050760c299702e3', 'c050760c299702dc', 'c050760c299702dd', 'c050760c299702e0', 'c050760c299702e1'] in the login table. The contained initiators ['c050760c299702cc', 'c050760c299702cd', 'c050760c299702ce', 'c050760c299702cf', 'c050760c299702d0', 'c050760c299702d1', 'c050760c299702d2', 'c050760c299702d3'] do match up with those in the connector object. Delete initiator group OS-t_v3v47b1240def5-F-IG and retry.
2023-04-20 10:57:36.875 1646354 INFO cinder.volume.drivers.dell_emc.powermax.rest [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] GET request to httpspvc://172.30.115.229:8443/univmax/restapi/92/sloprovisioning/symmetrix/000197602020/maskingview/OS-t_v3v47b1240def5-F-PG4-MV has returned with a status code of: 404.
2023-04-20 10:57:36.876 1646354 INFO cinder.volume.drivers.dell_emc.powermax.rest [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] Masking view details in get_element_from_masking_view are None
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.rest [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] Error retrieving masking group.
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] Masking View creation or retrieval was not successful for masking view OS-t_v3v47b1240def5-F-PG4-MV. Attempting rollback.: cinder.exception.VolumeBackendAPIException: Error retrieving masking group.
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking Traceback (most recent call last):
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking File "/usr/lib/python3.9/site-packages/cinder/volume/drivers/dell_emc/powermax/masking.py", line 108, in get_or_create_masking_view_and_map_lun
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking self.rest.get_element_from_masking_view(
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking File "/usr/lib/python3.9/site-packages/cinder/volume/drivers/dell_emc/powermax/rest.py", line 1999, in get_element_from_masking_view
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking raise exception.VolumeBackendAPIException(
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking cinder.exception.VolumeBackendAPIException: Error retrieving masking group.
2023-04-20 10:57:36.876 1646354 ERROR cinder.volume.drivers.dell_emc.powermax.masking

[2] rollback flow:

[2.a]
à https://github.com/openstack/cinder/blob/stable/yoga/cinder/volume/drivers/dell_emc/powermax/masking.py#L118-L119 [github.com] [github.com] [github.com] [github.com]

self.check_if_rollback_action_for_masking_required(
serial_number, volume, device_id, rollback_dict)

[2.b]
à https://github.com/openstack/cinder/blob/stable/yoga/cinder/volume/drivers/dell_emc/powermax/masking.py#L1028-L1032 [github.com] [github.com] [github.com] [github.com]
            self.remove_and_reset_members(
                serial_number, volume, device_id,
                rollback_dict[utils.VOL_NAME],
                rollback_dict[utils.EXTRA_SPECS], reset,
                rollback_dict[utils.CONNECTOR])

[2.c]
à https://github.com/openstack/cinder/blob/stable/yoga/cinder/volume/drivers/dell_emc/powermax/masking.py#L1324-L1325 [github.com] [github.com] [github.com] [github.com]
            @coordination.synchronized("emc-sg-{sg_name}-{serial_number}")
            def do_remove_volume_from_sg(sg_name, serial_number):

2023-04-20 10:57:37.121 1646354 INFO cinder.coordination [req-249b24fd-0388-4dd2-9178-78855e6d5d7c d4ca8b69730ffe7ee8b6f229874e4b9e4e33ec957df1418d87f9287d692b3e28 f65345a5ee994f48a6571c43180f85a1 - - -] Lock "/var/lib/cinder/cinder-emc-sg-OS-no_SLO-SG-000197602020" acquired by "do_remove_volume_from_sg" :: waited 0.000s

[2.d]
à https://github.com/openstack/cinder/blob/stable/yoga/cinder/volume/drivers/dell_emc/powermax/masking.py#L1663-L1667 [github.com] [github.com] [github.com] [github.com]

            @coordination.synchronized(
                "emc-sg-{default_sg_name}-{serial_number}")
            def _move_vol_to_default_sg(default_sg_name, serial_number,
                                        device_id):

________________________________________________________________________________________

Issue 2:
--------------

In concurrent delete VMs, we found that delete of volume with device_id 00208 acquired a lock on "/var/lib/cinder/cinder-emc-sg-OS-no_SLO-SG-000197602020” [3] and never released, we didn’t see any logs related to response for below API. Due to this further detach operations on other volumes are hung as it is the common lock.

GET request to httpspvc://***********:8443/univmax/restapi/92/sloprovisioning/symmetrix/<<array_serial_number>>/volume/00208

[3]
https://github.com/openstack/cinder/blob/stable/yoga/cinder/volume/drivers/dell_emc/powermax/masking.py#L1360-L1363

            @coordination.synchronized("emc-mv-{parent_name}-{serial_number}")
            @coordination.synchronized("emc-mv-{mv_name}-{serial_number}")
            @coordination.synchronized("emc-sg-{sg_name}-{serial_number}")
            def do_remove_volume_from_sg(
                    mv_name, sg_name, parent_name, serial_number):

2023-04-29 13:00:21.765 3785904 INFO cinder.coordination [req-d0fe3a3f-6fe3-47c8-ace3-25e13b0f2de9 0688b01e6439ca32d698d20789d52169126fb41fb1a4ddafcebb97d854e836c9 f65345a5ee994f48a6571c43180f85a1 - - -] Lock "/var/lib/cinder/cinder-emc-sg-OS-no_SLO-SG-000197602020" acquired by "do_remove_volume_from_sg" :: waited 36.105s

prashkre (prashkre)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/884957

Changed in cinder:
status: New → In Progress
tags: added: drivers powermax
Changed in cinder:
importance: Undecided → Low
prashkre (prashkre)
Changed in cinder:
assignee: nobody → prashkre (prashkre)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.