Errors on connect_volume due to race conditions
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
os-brick |
Fix Released
|
High
|
Gorka Eguileor |
Bug Description
Cinder can have race conditions between the connect_volume and the disconnect_volume method calls, at least in iSCSI with shared targets and NVMe-OF, that will prevent the connection from happening.
The problem happens when these 2 calls happen on different processes, such as in the following cases:
- Cinder has more than 1 cinder backend: Because we can have one backend doing a detach in a create volume from image and the other an attach for an offline migration.
- Backup/Restore if backup and volume services are running on the same host.
- HCI scenarios where cinder volume and nova compute are running on the same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as cinder-volume or cinder-backup.
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master) | #1 |
Changed in os-brick: | |
status: | New → In Progress |
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master) | #2 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 6a43669edc583f8
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
Changed in os-brick: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/xena) | #3 |
Fix proposed to branch: stable/xena
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/wallaby) | #4 |
Fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/victoria) | #5 |
Fix proposed to branch: stable/victoria
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/ussuri) | #6 |
Fix proposed to branch: stable/ussuri
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/xena) | #7 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/xena
commit 19a4820f5c4ccca
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
(cherry picked from commit 6a43669edc583f8
tags: | added: in-stable-xena |
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/wallaby) | #8 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit ecaf7f8962e12b4
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
(cherry picked from commit 6a43669edc583f8
(cherry picked from commit 19a4820f5c4ccca
Conflicts: os_brick/
tags: | added: in-stable-wallaby |
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/ussuri) | #9 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/ussuri
commit 9d3ce01eabc0b95
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
(cherry picked from commit 6a43669edc583f8
(cherry picked from commit 19a4820f5c4ccca
Conflicts: os_brick/
(cherry picked from commit 08ddf69d648c562
Conflicts: os_brick/
(cherry picked from commit b03eca8353f1db7
Conflicts: os_brick/
tags: | added: in-stable-ussuri |
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/victoria) | #10 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit b03eca8353f1db7
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
(cherry picked from commit 6a43669edc583f8
(cherry picked from commit 19a4820f5c4ccca
Conflicts: os_brick/
(cherry picked from commit ecaf7f8962e12b4
Conflicts: os_brick/
tags: | added: in-stable-victoria |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 3.0.8 | #11 |
This issue was fixed in the openstack/os-brick 3.0.8 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 5.1.0 | #12 |
This issue was fixed in the openstack/os-brick 5.1.0 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 5.0.2 | #13 |
This issue was fixed in the openstack/os-brick 5.0.2 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 4.3.3 | #14 |
This issue was fixed in the openstack/os-brick 4.3.3 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 4.0.5 | #15 |
This issue was fixed in the openstack/os-brick 4.0.5 release.
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/train) | #16 |
Fix proposed to branch: stable/train
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/train) | #17 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/train
commit 9090cb79ef24ff9
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 15 14:33:57 2021 +0200
Use file locks in connectors
Currently os-brick is using in-process locks that will only prevent concurrent
access to critical sections to threads within a single process.
But based on the comment from iSCSI it seems like the code assumed that
these were file based locks that prevented concurrent access from
multiple processes.
Mentioned iSCSI comment is being removed because it's not correct that
our current retry mechanism will work with connect and disconnect
concurrency issues.
The reason why we haven't seen errors in Nova is because it runs a
single process and locks will be effective.
This is probably also not an issue in some transport protocols, such as
FC and RBD, and it wouldn't be an issue in iSCSI connections that don't
share targets.
But for others, such as iSCSI with shared targets and NVMe-OF, not
using file locks will create race conditions in the following cases:
- More than 1 cinder backend: Because we can have one backend doing a
detach in a create volume from image and the other an attach for an
offline migration.
- Backup/Restore if backup and volume services are running on the same
host.
- HCI scenarios where cinder volume and nova compute are running on the
same host, even if the same lock path if configured.
- Glance using Cinder as backend and is running on the same node as
cinder-volume or cinder-backup.
The problematic race conditions happen because the disconnect will do a
logout of the iSCSI target once the connect call has already confirmed
that the session to the target exists.
We could just add the file locks to iSCSI and NVMe, but I think it's
safer to add it to all the connectors and then, after proper testing, we
can can change back the locks that can be changed, and remove or reduce
the critical section in others.
Closes-Bug: #1947370
Change-Id: I6f7f7d19540361
(cherry picked from commit 6a43669edc583f8
(cherry picked from commit 19a4820f5c4ccca
Conflicts: os_brick/
(cherry picked from commit 08ddf69d648c562
Conflicts: os_brick/
(cherry picked from commit b03eca8353f1db7
Conflicts: os_brick/
Conflicts: os_brick/
tags: | added: in-stable-train |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick train-eol | #18 |
This issue was fixed in the openstack/os-brick train-eol release.
Fix proposed to branch: master /review. opendev. org/c/openstack /os-brick/ +/814139
Review: https:/