Errors on connect_volume due to race conditions

Bug #1947370 reported by Gorka Eguileor
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
High
Gorka Eguileor

Bug Description

Cinder can have race conditions between the connect_volume and the disconnect_volume method calls, at least in iSCSI with shared targets and NVMe-OF, that will prevent the connection from happening.

The problem happens when these 2 calls happen on different processes, such as in the following cases:

- Cinder has more than 1 cinder backend: Because we can have one backend doing a detach in a create volume from image and the other an attach for an offline migration.

- Backup/Restore if backup and volume services are running on the same host.

- HCI scenarios where cinder volume and nova compute are running on the same host, even if the same lock path if configured.

- Glance using Cinder as backend and is running on the same node as cinder-volume or cinder-backup.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-brick/+/814139

Changed in os-brick:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/814139
Committed: https://opendev.org/openstack/os-brick/commit/6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Submitter: "Zuul (22348)"
Branch: master

commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/os-brick/+/817671

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/os-brick/+/817708

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/os-brick/+/817711

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/os-brick/+/817712

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/817671
Committed: https://opendev.org/openstack/os-brick/commit/19a4820f5c4ccca10d50d08198873b1515ee257b
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 19a4820f5c4ccca10d50d08198873b1515ee257b
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68
    (cherry picked from commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/817708
Committed: https://opendev.org/openstack/os-brick/commit/ecaf7f8962e12b43f9759ddc1b608f30eb9f5ebb
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit ecaf7f8962e12b43f9759ddc1b608f30eb9f5ebb
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68
    (cherry picked from commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7)
    (cherry picked from commit 19a4820f5c4ccca10d50d08198873b1515ee257b)
    Conflicts: os_brick/initiator/connectors/iscsi.py

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/817712
Committed: https://opendev.org/openstack/os-brick/commit/9d3ce01eabc0b95d50302dfa9891ae6f0bf4cc6c
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 9d3ce01eabc0b95d50302dfa9891ae6f0bf4cc6c
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68
    (cherry picked from commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7)
    (cherry picked from commit 19a4820f5c4ccca10d50d08198873b1515ee257b)
    Conflicts: os_brick/initiator/connectors/iscsi.py
    (cherry picked from commit 08ddf69d648c5622a5a300a8636bda727b5b2722)
    Conflicts: os_brick/initiator/connectors/nvmeof.py
    (cherry picked from commit b03eca8353f1db7fed5dc630d225126b1edb1055)
    Conflicts: os_brick/tests/base.py

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/817711
Committed: https://opendev.org/openstack/os-brick/commit/b03eca8353f1db7fed5dc630d225126b1edb1055
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit b03eca8353f1db7fed5dc630d225126b1edb1055
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68
    (cherry picked from commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7)
    (cherry picked from commit 19a4820f5c4ccca10d50d08198873b1515ee257b)
    Conflicts: os_brick/initiator/connectors/iscsi.py
    (cherry picked from commit ecaf7f8962e12b43f9759ddc1b608f30eb9f5ebb)
    Conflicts: os_brick/initiator/connectors/nvmeof.py

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 3.0.8

This issue was fixed in the openstack/os-brick 3.0.8 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 5.1.0

This issue was fixed in the openstack/os-brick 5.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 5.0.2

This issue was fixed in the openstack/os-brick 5.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 4.3.3

This issue was fixed in the openstack/os-brick 4.3.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 4.0.5

This issue was fixed in the openstack/os-brick 4.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/os-brick/+/846957

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/train)

Reviewed: https://review.opendev.org/c/openstack/os-brick/+/846957
Committed: https://opendev.org/openstack/os-brick/commit/9090cb79ef24ff93c2455e0db0660de2664b270b
Submitter: "Zuul (22348)"
Branch: stable/train

commit 9090cb79ef24ff93c2455e0db0660de2664b270b
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200

    Use file locks in connectors

    Currently os-brick is using in-process locks that will only prevent concurrent
    access to critical sections to threads within a single process.

    But based on the comment from iSCSI it seems like the code assumed that
    these were file based locks that prevented concurrent access from
    multiple processes.

    Mentioned iSCSI comment is being removed because it's not correct that
    our current retry mechanism will work with connect and disconnect
    concurrency issues.

    The reason why we haven't seen errors in Nova is because it runs a
    single process and locks will be effective.

    This is probably also not an issue in some transport protocols, such as
    FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
    share targets.

    But for others, such as iSCSI with shared targets and NVMe-OF, not
    using file locks will create race conditions in the following cases:

    - More than 1 cinder backend: Because we can have one backend doing a
      detach in a create volume from image and the other an attach for an
      offline migration.

    - Backup/Restore if backup and volume services are running on the same
      host.

    - HCI scenarios where cinder volume and nova compute are running on the
      same host, even if the same lock path if configured.

    - Glance using Cinder as backend and is running on the same node as
      cinder-volume or cinder-backup.

    The problematic race conditions happen because the disconnect will do a
    logout of the iSCSI target once the connect call has already confirmed
    that the session to the target exists.

    We could just add the file locks to iSCSI and NVMe, but I think it's
    safer to add it to all the connectors and then, after proper testing, we
    can can change back the locks that can be changed, and remove or reduce
    the critical section in others.

    Closes-Bug: #1947370
    Change-Id: I6f7f7d19540361204d4ae3ead2bd6dcddb8fcd68
    (cherry picked from commit 6a43669edc583f8fbcfb4c0f1c7bf6cebad9abd7)
    (cherry picked from commit 19a4820f5c4ccca10d50d08198873b1515ee257b)
    Conflicts: os_brick/initiator/connectors/iscsi.py
    (cherry picked from commit 08ddf69d648c5622a5a300a8636bda727b5b2722)
    Conflicts: os_brick/initiator/connectors/nvmeof.py
    (cherry picked from commit b03eca8353f1db7fed5dc630d225126b1edb1055)
    Conflicts: os_brick/tests/base.py
    Conflicts: os_brick/initiator/connectors/iscsi.py

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick train-eol

This issue was fixed in the openstack/os-brick train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.