Vm reboot to Error state cause delete dangling bdms

Bug #2048154 reported by Sang Tran
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Amit Uniyal

Bug Description

Hi Community,

I've got the error with this patch when rebooting vm with volume source from the image, the result is Error with VolumeDeviceNotFound (Tested with Netapp & PowerStrore iSCSI SAN driver)

Environment:
OpenStack Bobcat Stable version
Cinder driver: Tested with both Netapp & Powerstore iSCSI driver

Reproduce:
- Create VM from image (volume source type image, des type volume)
- Hard reboot or soft reboot
- VM damage with state Error, the volume attachment from cinder.volume_attachment has been deleted
- The LUN mapping has been remove from the SAN storage (cause of volume_attachment record is deleted)

Workaround:
- Recover the attachment record by setting the column deleted to 0
- Manually set the LUN mapping on SAN storage to the corresponding LUN id from attachment record
- Hard reboot and VM running

Reference;
[1] https://review.opendev.org/c/openstack/nova/+/882284

        for bdm in bdms.objects:
            if bdm.volume_id and bdm.source_type == 'volume' and \ ==> This line lead to bug
                bdm.destination_type == 'volume':
                try:
                    self.volume_api.attachment_get(context, bdm.attachment_id)
                except exception.VolumeAttachmentNotFound:
                    LOG.info(
                        f"Removing stale volume attachment "
                        f"'{bdm.attachment_id}' from instance for "
                        f"volume '{bdm.volume_id}'.", instance=instance)
                    bdm.destroy()
                    bdms_to_delete.append(bdm)
                else:
                    nova_attachments.append(bdm.attachment_id)

Revision history for this message
Uggla (rene-ribaud) wrote :

Hello Sang,

Thanks for reporting this bug, can check if my below assumption is the correct one:

        # attachments present in nova DB, ones nova knows about
        nova_attachments = []
        bdms_to_delete = []
        for bdm in bdms.objects:
            if bdm.volume_id and bdm.source_type == 'volume' and \ <--- if bdm.source_type == image
                bdm.destination_type == 'volume':
                try:
                    self.volume_api.attachment_get(context, bdm.attachment_id)
                except exception.VolumeAttachmentNotFound:
                    LOG.info(
                        f"Removing stale volume attachment "
                        f"'{bdm.attachment_id}' from instance for "
                        f"volume '{bdm.volume_id}'.", instance=instance)
                    bdm.destroy()
                    bdms_to_delete.append(bdm)
                else:
                    nova_attachments.append(bdm.attachment_id) <---- we don't pass here

        cinder_attachments = [each['id'] for each in cinder_attachments]

        if len(set(cinder_attachments) - set(nova_attachments)): <--- so nova_attachements empty
            LOG.info(
                "Removing stale volume attachments of instance from "
                "Cinder", instance=instance)
        for each_attach in set(cinder_attachments) - set(nova_attachments):
            # delete only cinder known attachments, from cinder DB.
            LOG.debug(
                f"Removing attachment '{each_attach}'", instance=instance)
            self.volume_api.attachment_delete(context, each_attach) <--- we delete the cinder attachment!

        # refresh bdms object
        for bdm in bdms_to_delete:
            bdms.objects.remove(bdm)

Can you check you have "Removing attachment...." in your logs (need to be in debug mode).

So we need to fix:
if bdm.volume_id and bdm.source_type == 'volume' and \
                bdm.destination_type == 'volume':

to
if bdm.volume_id and (bdm.source_type == 'volume' or bdm.source_type == 'image') and \
                bdm.destination_type == 'volume':

or maybe removing the bdm.source_type check ?

Changed in nova:
status: New → Triaged
Revision history for this message
Uggla (rene-ribaud) wrote :

Just to show that source_type can be 'image':

mysql> select volume_id, source_type, destination_type, attachment_id from block_device_mapping where volume_id='e67a9299-eaf1-45bd-9e01-eccb1998d6eb';
+--------------------------------------+-------------+------------------+--------------------------------------+
| volume_id | source_type | destination_type | attachment_id |
+--------------------------------------+-------------+------------------+--------------------------------------+
| e67a9299-eaf1-45bd-9e01-eccb1998d6eb | image | volume | 46447930-2393-4690-b74f-9d6541597617 |
+--------------------------------------+-------------+------------------+--------------------------------------+
1 row in set (0.00 sec)

mysql> use cinder
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select id, volume_id from volume_attachment where id ='46447930-2393-4690-b74f-9d6541597617';
+--------------------------------------+--------------------------------------+
| id | volume_id |
+--------------------------------------+--------------------------------------+
| 46447930-2393-4690-b74f-9d6541597617 | e67a9299-eaf1-45bd-9e01-eccb1998d6eb |
+--------------------------------------+--------------------------------------+
1 row in set (0.00 sec)

Changed in nova:
status: Triaged → In Progress
Amit Uniyal (auniyal)
Changed in nova:
assignee: nobody → Amit Uniyal (auniyal)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/904817
Committed: https://opendev.org/openstack/nova/commit/b5173b419219437b50f49c88bce9727ed0ed1ee8
Submitter: "Zuul (22348)"
Branch: master

commit b5173b419219437b50f49c88bce9727ed0ed1ee8
Author: Amit Uniyal <email address hidden>
Date: Fri Jan 5 08:41:29 2024 +0000

    Fixes: bfv vm reboot ends up in an error state.

    we only need to verify if bdm has attachment id and it should be present in both nova and cinde DB.

    For tests coverage, added tests for bfv server to test different bdm source type.

    Closes-Bug: 2048154
    Closes-Bug: 2048184
    Change-Id: Icffcbad27d99a800e3f285565c0b823f697e388c

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/906089

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/906089
Committed: https://opendev.org/openstack/nova/commit/0a55ec33e6f541c9fc7d2e1b9f49d8d295a5b430
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 0a55ec33e6f541c9fc7d2e1b9f49d8d295a5b430
Author: Amit Uniyal <email address hidden>
Date: Fri Jan 5 08:41:29 2024 +0000

    Fixes: bfv vm reboot ends up in an error state.

    we only need to verify if bdm has attachment id and it should be present in both nova and cinde DB.

    For tests coverage, added tests for bfv server to test different bdm source type.

    Changes: master to 2023.2
       added datetime import
       which was added in master by I789eeae86947e9a3cbd7d5fcc58d2aabe3b8b84c

    Closes-Bug: 2048154
    Closes-Bug: 2048184
    Change-Id: Icffcbad27d99a800e3f285565c0b823f697e388c
    (cherry picked from commit b5173b419219437b50f49c88bce9727ed0ed1ee8)

Revision history for this message
Ehsan Aliakbar (ealiakbar) wrote :

I'd like to bring to your attention that this issue is impacting RBD volumes in another way. The way RBD connections are handled allows the instance to boot successfully after soft reboots. However, this bug alters the instance's volumes status from 'in-use' to 'available' following a soft/hard reboot.

We encountered a situation where a customer mistakenly deleted these 'available' volumes, assuming they were leftovers from instance deletion. Fortunately, due to deferred deletion in RBD (coupled with volumes having connected clients, preventing immediate deletion by Ceph), we hope this hasn't led to any data loss.

Our workaround involved the following steps:
- Restore the volume from the Ceph trash.
- Recover the volume in the Cinder database by setting 'deleted=0'.
- Shelve and then unshelve the instance (this process will recreate volume attachments).

We've tested the fix, and it has proven effective.
I believe this issue will impact numerous users. Could we include a "known issue" section in the release notes, highlighting this particular issue?"

Revision history for this message
Amit Uniyal (auniyal) wrote (last edit ):

Hello Ehsan,

I could not reproduce this in my local devstack setup with Ceph. Volume did not went to 'available' from 'in-use' ever.

I might not be testing it as expected, can you please provide steps to reproduce it?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/908688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/2023.1)

Change abandoned by "Vlad Gusev <email address hidden>" on branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/908688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 29.0.0.0rc1

This issue was fixed in the openstack/nova 29.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.1.0

This issue was fixed in the openstack/nova 28.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.