Bug #1937084 “Nova thinks deleted volume is still attached” : Bugs : Cinder

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-22: Fix proposed to cinder (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/801913

Changed in cinder:
status:	New → In Progress

Sofia Enriquez (lsofia-enriquez) on 2021-08-18

Changed in cinder:
importance:	Undecided → Medium
tags:	added: attach cinder-csi nova
Changed in cinder:
importance:	Medium → High
tags:	added: volume

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2021-10-01:

#2

https://review.opendev.org/c/openstack/nova/+/812127 - Nova will address the 404 from c-api during an attachment delete here.

Changed in nova:
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Lee Yarwood (lyarwood)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-16: Fix merged to cinder (master)

#3

Download full text (3.9 KiB)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/801913
Committed: https://opendev.org/openstack/cinder/commit/2ec2222841f6116707fe25bdcdae6ad6c2b9beb7
Submitter: "Zuul (22348)"
Branch: master

commit 2ec2222841f6116707fe25bdcdae6ad6c2b9beb7
Author: Gorka Eguileor <email address hidden>
Date: Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion

There are cases where requests to delete an attachment made by Nova can
race other third-party requests to delete the overall volume.

    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.

This is a cinder race condition, and like most race conditions is not
simple to explain.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to
prevent concurrent request processing on the same volume.

    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.

    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.

The actual set of events that leads to the issue reported in this bug
are:

[Cinder-CSI]
- Requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**

    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"

    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)

[Cinder-API]

    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.

    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection

    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova

    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since the at...

Reviewed:  https://review.opendev.org/c/openstack/cinder/+/801913
Committed: https://opendev.org/openstack/cinder/commit/2ec2222841f6116707fe25bdcdae6ad6c2b9beb7
Submitter: "Zuul (22348)"
Branch:    master

commit 2ec2222841f6116707fe25bdcdae6ad6c2b9beb7
Author: Gorka Eguileor <geguileo@redhat.com>
Date:   Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion
    
    There are cases where requests to delete an attachment made by Nova can
    race other third-party requests to delete the overall volume.
    
    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.
    
    This is a cinder race condition, and like most race conditions is not
    simple to explain.
    
    Some context on the issue:
    
    - Cinder API uses the volume "status" field as a locking mechanism to
      prevent concurrent request processing on the same volume.
    
    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.
    
    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.
    
    The actual set of events that leads to the issue reported in this bug
    are:
    
    [Cinder-CSI]
    - Requests Nova to detach volume (Request R1)
    
    [Nova]
    - R1: Asks cinder-api to delete the attachment and **waits**
    
    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**
    
    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"
    
    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)
    
    [Cinder-API]
    
    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.
    
    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection
    
    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova
    
    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since the attachment delete
      failed
    
    At this point the Cinder and Nova DBs are out of sync, because Nova
    thinks that the attachment is connected and Cinder has detached the
    volume and even deleted it.
    
    Hardening is also being done on the Nova side [2] to accept that the
    volume attachment may be gone.
    
    This patch fixes the issue mentioned above, but there is a request on
    Cinder-CSI [1] to use Nova as the source of truth regarding its
    attachments that, when implemented, would also fix the issue.
    
    [1]: https://github.com/kubernetes/cloud-provider-openstack/issues/1645
    [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova
    
    Closes-Bug: #1937084
    Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614

Changed in cinder:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-23: Fix proposed to cinder (stable/wallaby)

#4

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/cinder/+/818886

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-14: Related fix merged to nova (master)

#5

Reviewed: https://review.opendev.org/c/openstack/nova/+/812126
Committed: https://opendev.org/openstack/nova/commit/10c7e718488a6daad5bcea97e00aece24179168e
Submitter: "Zuul (22348)"
Branch: master

commit 10c7e718488a6daad5bcea97e00aece24179168e
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 1 12:05:15 2021 +0100

Add regression test for bug #1937084

This regression test asserts the behaviour of Nova when Cinder raises a
404 during a DELETE request against an attachment.

    In the context of bug 1937084 this could happen if a caller attempted to
    DELETE a volume attachment through Nova's os-volume_attachments API and
    then made a separate DELETE request against the underlying volume in
    Cinder when it was marked as available.

Related-Bug: #1937084
Change-Id: I56106d16ed1d24793c4cddad0caa365a641ea4fd

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-17: Fix proposed to cinder (stable/xena)

#6

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/cinder/+/824979

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-18: Fix proposed to cinder (stable/victoria)

#7

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/cinder/+/825106

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-25: Fix merged to nova (master)

#8

Reviewed: https://review.opendev.org/c/openstack/nova/+/812127
Committed: https://opendev.org/openstack/nova/commit/067cd93424ea1e62c77744986a5479d1b99b0ffe
Submitter: "Zuul (22348)"
Branch: master

commit 067cd93424ea1e62c77744986a5479d1b99b0ffe
Author: Lee Yarwood <email address hidden>
Date: Fri Oct 1 12:21:57 2021 +0100

block_device: Ignore VolumeAttachmentNotFound during detach

    Bug #1937084 details a race condition within Cinder where requests to
    delete an attachment and later delete the underlying volume can race
    leading to the initial request returning a 404 if the volume delete
    completes first.

    This change attempts to handle this within Nova during a detach as we
    ultimately don't care that the volume and/or volume attachment are no
    longer available within Cinder. This allows Nova to complete its' own
    cleanup of the BlockDeviceMapping record resulting in the volume no
    longer appearing attached in Nova's APIs.

Closes-Bug: #1937084

Change-Id: I191552652d8ff5206abad7558c99bce27979dc84

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-28: Fix merged to cinder (stable/xena)

#9

Download full text (4.0 KiB)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/824979
Committed: https://opendev.org/openstack/cinder/commit/ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb
Submitter: "Zuul (22348)"
Branch: stable/xena

commit ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb
Author: Gorka Eguileor <email address hidden>
Date: Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion

There are cases where requests to delete an attachment made by Nova can
race other third-party requests to delete the overall volume.

    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.

This is a cinder race condition, and like most race conditions is not
simple to explain.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to
prevent concurrent request processing on the same volume.

    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.

    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.

The actual set of events that leads to the issue reported in this bug
are:

[Cinder-CSI]
- Requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**

    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"

    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)

[Cinder-API]

    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.

    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection

    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova

    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since t...

Reviewed:  https://review.opendev.org/c/openstack/cinder/+/824979
Committed: https://opendev.org/openstack/cinder/commit/ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb
Submitter: "Zuul (22348)"
Branch:    stable/xena

commit ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb
Author: Gorka Eguileor <geguileo@redhat.com>
Date:   Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion
    
    There are cases where requests to delete an attachment made by Nova can
    race other third-party requests to delete the overall volume.
    
    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.
    
    This is a cinder race condition, and like most race conditions is not
    simple to explain.
    
    Some context on the issue:
    
    - Cinder API uses the volume "status" field as a locking mechanism to
      prevent concurrent request processing on the same volume.
    
    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.
    
    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.
    
    The actual set of events that leads to the issue reported in this bug
    are:
    
    [Cinder-CSI]
    - Requests Nova to detach volume (Request R1)
    
    [Nova]
    - R1: Asks cinder-api to delete the attachment and **waits**
    
    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**
    
    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"
    
    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)
    
    [Cinder-API]
    
    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.
    
    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection
    
    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova
    
    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since the attachment delete
      failed
    
    At this point the Cinder and Nova DBs are out of sync, because Nova
    thinks that the attachment is connected and Cinder has detached the
    volume and even deleted it.
    
    Hardening is also being done on the Nova side [2] to accept that the
    volume attachment may be gone.
    
    This patch fixes the issue mentioned above, but there is a request on
    Cinder-CSI [1] to use Nova as the source of truth regarding its
    attachments that, when implemented, would also fix the issue.
    
    [1]: https://github.com/kubernetes/cloud-provider-openstack/issues/1645
    [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova
    
    Closes-Bug: #1937084
    Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614
    (cherry picked from commit 2ec2222841f6116707fe25bdcdae6ad6c2b9beb7)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-06: Fix merged to cinder (stable/wallaby)

#10

Download full text (4.2 KiB)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/818886
Committed: https://opendev.org/openstack/cinder/commit/7210c914c4ee08a06e6ec00e9a861b977d794ec8
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 7210c914c4ee08a06e6ec00e9a861b977d794ec8
Author: Gorka Eguileor <email address hidden>
Date: Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion

There are cases where requests to delete an attachment made by Nova can
race other third-party requests to delete the overall volume.

    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.

This is a cinder race condition, and like most race conditions is not
simple to explain.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to
prevent concurrent request processing on the same volume.

    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.

    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.

The actual set of events that leads to the issue reported in this bug
are:

[Cinder-CSI]
- Requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**

    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"

    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)

[Cinder-API]

    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.

    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection

    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova

    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, sinc...

Reviewed:  https://review.opendev.org/c/openstack/cinder/+/818886
Committed: https://opendev.org/openstack/cinder/commit/7210c914c4ee08a06e6ec00e9a861b977d794ec8
Submitter: "Zuul (22348)"
Branch:    stable/wallaby

commit 7210c914c4ee08a06e6ec00e9a861b977d794ec8
Author: Gorka Eguileor <geguileo@redhat.com>
Date:   Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion
    
    There are cases where requests to delete an attachment made by Nova can
    race other third-party requests to delete the overall volume.
    
    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.
    
    This is a cinder race condition, and like most race conditions is not
    simple to explain.
    
    Some context on the issue:
    
    - Cinder API uses the volume "status" field as a locking mechanism to
      prevent concurrent request processing on the same volume.
    
    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.
    
    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.
    
    The actual set of events that leads to the issue reported in this bug
    are:
    
    [Cinder-CSI]
    - Requests Nova to detach volume (Request R1)
    
    [Nova]
    - R1: Asks cinder-api to delete the attachment and **waits**
    
    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**
    
    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"
    
    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)
    
    [Cinder-API]
    
    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.
    
    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection
    
    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova
    
    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since the attachment delete
      failed
    
    At this point the Cinder and Nova DBs are out of sync, because Nova
    thinks that the attachment is connected and Cinder has detached the
    volume and even deleted it.
    
    Hardening is also being done on the Nova side [2] to accept that the
    volume attachment may be gone.
    
    This patch fixes the issue mentioned above, but there is a request on
    Cinder-CSI [1] to use Nova as the source of truth regarding its
    attachments that, when implemented, would also fix the issue.
    
    [1]: https://github.com/kubernetes/cloud-provider-openstack/issues/1645
    [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova
    
    Closes-Bug: #1937084
    Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614
    (cherry picked from commit 2ec2222841f6116707fe25bdcdae6ad6c2b9beb7)
    Conflicts:
            cinder/tests/unit/attachments/test_attachments_manager.py
            cinder/volume/manager.py
    (cherry picked from commit ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb)

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-02-10: Fix merged to cinder (stable/victoria)

#11

Download full text (4.3 KiB)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/825106
Committed: https://opendev.org/openstack/cinder/commit/399da0782da1c37cdda4b5e9f737b820a66c016f
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 399da0782da1c37cdda4b5e9f737b820a66c016f
Author: Gorka Eguileor <email address hidden>
Date: Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion

There are cases where requests to delete an attachment made by Nova can
race other third-party requests to delete the overall volume.

    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.

This is a cinder race condition, and like most race conditions is not
simple to explain.

Some context on the issue:

- Cinder API uses the volume "status" field as a locking mechanism to
prevent concurrent request processing on the same volume.

    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.

    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.

The actual set of events that leads to the issue reported in this bug
are:

[Cinder-CSI]
- Requests Nova to detach volume (Request R1)

[Nova]
- R1: Asks cinder-api to delete the attachment and **waits**

    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**

    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"

    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)

[Cinder-API]

    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.

    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection

    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova

    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, sin...

Reviewed:  https://review.opendev.org/c/openstack/cinder/+/825106
Committed: https://opendev.org/openstack/cinder/commit/399da0782da1c37cdda4b5e9f737b820a66c016f
Submitter: "Zuul (22348)"
Branch:    stable/victoria

commit 399da0782da1c37cdda4b5e9f737b820a66c016f
Author: Gorka Eguileor <geguileo@redhat.com>
Date:   Wed Jul 21 15:20:38 2021 +0200

Fix: Race between attachment and volume deletion
    
    There are cases where requests to delete an attachment made by Nova can
    race other third-party requests to delete the overall volume.
    
    This has been observed when running cinder-csi, where it first requests
    that Nova detaches a volume before itself requesting that the overall
    volume is deleted once it becomes `available`.
    
    This is a cinder race condition, and like most race conditions is not
    simple to explain.
    
    Some context on the issue:
    
    - Cinder API uses the volume "status" field as a locking mechanism to
      prevent concurrent request processing on the same volume.
    
    - Most cinder operations are asynchronous, so the API returns before the
      operation has been completed by the cinder-volume service, but the
      attachment operations such as creating/updating/deleting an attachment
      are synchronous, so the API only returns to the caller after the
      cinder-volume service has completed the operation.
    
    - Our current code **incorrectly** modifies the status of the volume
      both on the cinder-volume and the cinder-api services on the
      attachment delete operation.
    
    The actual set of events that leads to the issue reported in this bug
    are:
    
    [Cinder-CSI]
    - Requests Nova to detach volume (Request R1)
    
    [Nova]
    - R1: Asks cinder-api to delete the attachment and **waits**
    
    [Cinder-API]
    - R1: Checks the status of the volume
    - R1: Sends terminate connection request (R1) to cinder-volume and
      **waits**
    
    [Cinder-Volume]
    - R1: Ask the driver to terminate the connection
    - R1: The driver asks the backend to unmap and unexport the volume
    - R1: The last attachment is removed from the DB and the status of the
          volume is changed in the DB to "available"
    
    [Cinder-CSI]
    - Checks that there are no attachments in the volume and asks Cinder to
      delete it (Request R2)
    
    [Cinder-API]
    
    - R2: Check that the volume's status is valid. It doesn't have
      attachments and is available, so it can be deleted.
    - R2: Tell cinder-volume to delete the volume and return immediately.
    
    [Cinder-Volume]
    - R2: Volume is deleted and DB entry is deleted
    - R1: Finish the termination of the connection
    
    [Cinder-API]
    - R1: Now that cinder-volume has finished the termination the code
      continues
    - R1: Try to modify the volume in the DB
    - R1: DB layer raises VolumeNotFound since the volume has been deleted
      from the DB
    - R1: VolumeNotFound is converted to HTTP 404 status code which is
      returned to Nova
    
    [Nova]
    - R1: Cinder responds with 404 on the attachment delete request
    - R1: Nova leaves the volume as attached, since the attachment delete
      failed
    
    At this point the Cinder and Nova DBs are out of sync, because Nova
    thinks that the attachment is connected and Cinder has detached the
    volume and even deleted it.
    
    Hardening is also being done on the Nova side [2] to accept that the
    volume attachment may be gone.
    
    This patch fixes the issue mentioned above, but there is a request on
    Cinder-CSI [1] to use Nova as the source of truth regarding its
    attachments that, when implemented, would also fix the issue.
    
    [1]: https://github.com/kubernetes/cloud-provider-openstack/issues/1645
    [2]: https://review.opendev.org/q/topic:%2522bug/1937084%2522+project:openstack/nova
    
    Closes-Bug: #1937084
    Change-Id: Iaf149dadad5791e81a3c0efd089d0ee66a1a5614
    (cherry picked from commit 2ec2222841f6116707fe25bdcdae6ad6c2b9beb7)
    Conflicts:
            cinder/tests/unit/attachments/test_attachments_manager.py
            cinder/volume/manager.py
    (cherry picked from commit ed0be0c8fa1d26c3f366dd3d58ad4a8318695dcb)
    (cherry picked from commit 7210c914c4ee08a06e6ec00e9a861b977d794ec8)
    Conflicts:
            cinder/db/sqlalchemy/api.py

tags:

added: in-stable-victoria

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-04: Fix included in openstack/cinder 19.1.0

#12

This issue was fixed in the openstack/cinder 19.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-07: Fix included in openstack/cinder 18.2.0

#13

This issue was fixed in the openstack/cinder 18.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-07: Fix included in openstack/cinder 17.3.0

#14

This issue was fixed in the openstack/cinder 17.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-11: Fix included in openstack/nova 25.0.0.0rc1

#15

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-14: Fix included in openstack/cinder 20.0.0.0rc1

#16

This issue was fixed in the openstack/cinder 20.0.0.0rc1 release candidate.

Cinder

Nova thinks deleted volume is still attached

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
Cinder	Fix Released	High	Gorka Eguileor
OpenStack Compute (nova)	Fix Released	Medium	Lee Yarwood
Ussuri	New	Undecided	Unassigned
Victoria	New	Undecided	Unassigned
Wallaby	New	Undecided	Unassigned
Xena	New	Undecided	Unassigned