os-brick

Bug #1947370
Comment #2

Comment 2 for bug 1947370

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-12: Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/814139
Committed: https://opendev.org/openstack/os-brick/commit/6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Submitter: "Zuul (22348)"
Branch: master

commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

Use file locks in connectors

Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

- Backup/Restore if backup and volume services are running on the same
host.

- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.

- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

Closes-Bug: #1947370
Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68

Reviewed:  https://review.opendev.org/c/openstack/os-brick/+/814139
Committed: https://opendev.org/openstack/os-brick/commit/6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Submitter: "Zuul (22348)"
Branch:    master

commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Author: Gorka Eguileor <geguileo@redhat.com>
Date:   Fri Oct 15 14:33:57 2021 +0200

Use file locks in connectors
    
    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.
    
    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.
    
    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.
    
    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.
    
    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.
    
    But for others, such as  iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:
    
    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.
    
    - Backup/Restore if backup and volume services are running on the same
      host.
    
    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.
    
    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.
    
    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.
    
    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.
    
    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68