Upgrading from SMI-S to RESTAPI based driver fails for long hostnames

Bug #1844314 reported by ramakrishnan
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned

Bug Description

Upgrading from SMI-S based driver to RESTAPI based driver Fails

Seamless upgrades from an SMI-S based driver to RESTAPI based driver, following the setup instructions above, are supported with a few exceptions:

    1. Live migration functionality will not work on already attached/in-use legacy volumes. These volumes will first need to be detached and reattached using the RESTAPI based driver. This is because we have changed the masking view architecture from Pike to better support this functionality.
    2. Consistency groups are deprecated in Pike. Generic Volume Groups are supported from Pike onwards.

Problem is with #1 - detaching the volume so that it can be attached again later into a new masking view with cascaded storage groups. I send the terminate_connection request using the connector that used to perform the original attach using SMI-S management. In the log can be seen this warning:

2019-02-04 17:24:48.448 128227 WARNING cinder.volume.drivers.dell_emc.vmax.fc [req-67448355-283e-4ebc-bcef-5406895555e9 c516a445257698992a7ae02c3a2eeba62147432cdea674d451bed35828522ecf 04f81a9bf1b74495be8d28428e12c310 - 201a3bd6c7a34802bbd66f3a9d345d92 201a3bd6c7a34802bbd66f3a9d345d92] Volume volume-csky-old-da6b04fc-00000004-boot-0-7d464d26-b652 is not in any masking view.

But this is not true. It is in a legacy masking view.

Our server hostnames are longer than 16 characters but shorter than the number of characters allowed in the SMI-S driver. For example, in the connector we may have 'host': 'csky-old-da6b04fc-00000004'
This is 26 characters long.

tags: added: dell drivers powermax vmax
Revision history for this message
ramakrishnan (sriramasan) wrote :

Code from Queens:
In File (cinder-stable-queens/cinder/volume/drivers/dell_emc/vmax/utils.py Line 253)
    def generate_unique_trunc_host(self, host_name):
        """Create a unique short host name under 16 characters.

        :param host_name: long host name
        :returns: truncated host name
        """
        if host_name and len(host_name) > 16:
        ...

Similar code from Ocata (SMI-S) level:
In File (cinder-stable-ocata/cinder/volume/drivers/dell_emc/vmax/utils.py Line 2547)

    def generate_unique_trunc_host(self, hostName):
        """Create a unique short host name under 40 chars

        :param sgName: long storage group name
        :returns: truncated storage group name
        """
        if hostName and len(hostName) > 38:
        ....

It seems this is what causes the masking view not to be found. In fc->terminate_connection(), there is:
In File (cinder-stable-queens/cinder/volume/drivers/dell_emc/vmax/fc.py Line 292)

        if connector:
            zoning_mappings = self._get_zoning_mappings(volume, connector)

        if zoning_mappings:
            self.common.terminate_connection(volume, connector)
            data = self._cleanup_zones(zoning_mappings)
        return data

If no zoning mappings are not found, then it skips calling the terminate_connection flow. So it responds like the detach worked, but it does not do anything.
The reason is that the _get_zoning_mappings() eventually gets to _get_masking_views_from_volume() to do the lookup. Since there is a shortened (mangled) 'host', this logic is happens for comparing hosts, but the host compare does not evaluate to true. The old masking veiw name had the full 26 character host name in it, and that does not compare to the now 16 char host name.

                if host_compare:
                    if host.lower() in mv.lower():
                        maskingview_list.append(mv)

This suggests the work around of passing an empty 'host' value in the connector. I tried this and the flow deadlocks with the following analysis:

common._remove_members()
  -->masking.remove_and_reset_members()
    -->masking._cleanup_deletion()
      Loop over storage groups because no 'host':
      -->masking.remove_volume_from_sg(storagegroup_name=OS-no_SLO-SG)
        -->do_remove_volume_from_sg(mv-sg) [lock on OS-no_SLO-SG]
          -->masking.multiple_vols_in_sg()
            -->masking.add_volume_to_default_storage_group(src_sg=<dft-sg>) [move=true flow]
              -->masking.get_or_create_default_storage_group()
                -->_move_vol_to_default_sg() [already there, deadlocks on OS-no_SLO-SG because that lock already held]
                  -->rest.move_volume_between_storage_groups()

It tries to operate on the default storage group in a nested fashion causing the deadlock.
Therefore, it appears the driver will need to be fixed for the original case of passing the host on the connector so that the terminate flow is not skipped.

summary: - Upgrading from SMI-S based driver to RESTAPI based driver with long
+ Upgrading from SMI-S based driver to RESTAPI based driver fails for long
hostnames
summary: - Upgrading from SMI-S based driver to RESTAPI based driver fails for long
- hostnames
+ Upgrading from SMI-S to RESTAPI based driver fails for long hostnames
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.