hitachi: one LDEV is assigned to multiple object

Bug #2072317 reported by Atsushi Kawai
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Unassigned

Bug Description

Bug Summary:

If one LDEV is assigned to multiple objects(volume or snapshot), there is a risk of data loss or data corruption.

Occurrence conditions:

The conditions under which one LDEV is assigned to multiple objects are shown below.

(1). Object deletion failed with the following message:

... ERROR cinder.volume.drivers.hitachi.hbsd_utils [...] MSGID0731-E: Failed to communicate with the REST API server. (exception: <class 'requests.exceptions.ReadTimeout'>, message: HTTPSConnectionPool(host='....', port=...): Read timed out. (read timeout=...), method: DELETE, url: https://.../ConfigurationManager/v1/objects/storages/.../ldevs/..., params: None, body: None): requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='...', port=...): Read timed out. (read timeout=...)

(2). If (1) is occurred and another object is created, new LDEV with the same ID as the LDEV assigned to (1) is assigned. One LDEV is assigned to both objects: deleted in (1) and created in (2).

(3). If (2) is occurred and the delete command for the object that failed to be deleted in (1) is re-executed, the object and assigned LDEV are deleted, either. The deleted LDEV was also assigned to the object created in (2), so the data by the object created in (2) is lost.

(4). If a new object is created after data lost, the same LDEV ID as the LDEV assigned to the object created in (2) is assigned. The LDEV is also assigned to both objects: data lost in (3) and created in (4)

(5). If the delete command is executed for the object created in (2), the same failure as in (3) occurs.

How to check:

If count of ``provider_location`` is 2 or more, this phenomenon has occurred.
Note that the higher the ``provider_location`` count, the possibility of data loss or data corruption is higher.
Use following command to know which ``provider_location`` is 2 or more:

$ podman exec galera-bundle-podman-0 mysql --table -e "select provider_location, count(provider_location) provider_location_count from (select provider_location from cinder.volumes where deleted = 0 union all select provider_location from cinder.snapshots where deleted = 0) cinder_objects group by provider_location having provider_location_count > 1 order by provider_location_count, provider_location;"

output example of the command
$ podman exec galera-bundle-podman-0 mysql --table -e "select provider_location, count(provider_location) provider_location_count from (select provider_location from cinder.volumes where deleted = 0 union all select provider_location from cinder.snapshots where deleted = 0) cinder_objects group by provider_location having provider_location_count > 1 order by provider_location_count, provider_location;"
+-------------------+-------------------------+
| provider_location | provider_location_count |
+-------------------+-------------------------+
| 2165 | 2 |
| 49001 | 3 |
| 2164 | 4 |
| 2163 | 5 |
| 27163 | 6 |
| 2160 | 8 |
| 2161 | 8 |
| 2162 | 8 |
+-------------------+-------------------------+

Recovery steps:

Overview of the steps:
1. If “provider_location” whose count is 2 or more, keep the latest object and destruct other objects.
2. Data in the LDEV which is assigned to destructed objects can’t recovery.

Details of recovery steps:
(1) finding “provider_location” whose count is 2 or more
run the command in the section "How to check"

(2) extracting objects information
Extract object ID and created date for each “provider_location” whose count is 2 or more found in step (1). The list is sorted in descending order, so keep the object which is at the top of the list(the latest) and others should be all destructed.

Command:
podman exec galera-bundle-podman-0 mysql --table -e "select 'volume' object_type, id, created_at from cinder.volumes where provider_location = <provider_location> and deleted = 0 union all select 'snapshot', id, created_at from cinder.snapshots where provider_location = <provider_location> and deleted = 0 order by created_at desc;"

output example of the command:
$ podman exec galera-bundle-podman-0 mysql --table -e "select 'volume' object_type, id, created_at from cinder.volumes where provider_location = 2164 and deleted = 0 union all select 'snapshot', id, created_at from cinder.snapshots where provider_location = 2164 and deleted = 0 order by created_at desc;"
+-------------+--------------------------------------+---------------------+
| object_type | id | created_at |
+-------------+--------------------------------------+---------------------+
| volume | ae023696-f779-45fd-8142-9618179b21f4 | 2024-04-11 03:43:16 |
| snapshot | 2fde155f-1f28-44a2-89c1-9a75874cfc8f | 2024-04-11 03:11:48 |
| snapshot | 66aa1ee4-82ba-44c5-a79c-1b735f1d2a39 | 2024-04-10 00:47:13 |
| volume | b827bba1-44be-42e5-841c-ba777af840e8 | 2024-04-09 20:31:17 |
+-------------+--------------------------------------+---------------------+
$

In above example, the object whose id is ``ae023696-f779-45fd-8142-9618179b21f4`` is kept, and the others should be destructed.

(3) Destructing the objects
To destruct the objects which are extracted at step (2), set “1” to “delete” field in the DB.
Note:
Don’t use ``delete`` command in cinder, because ``delete`` command will delete LDEV, even other objects use it.

Following commands and examples destruct each volume(s) or snapshot(s):

(A) Destructing volume(s)
Commands:
cinder list | grep <volume id>
podman exec galera-bundle-podman-0 mysql --table -e "update cinder.volumes set deleted = 1 where id = '<volume id>';"
cinder list | grep <volume id>

Running the commands “cinder list | grep <volume id>” are to confirm the volume is destructed successfully.

output example of the commands:
$ cinder list | grep b827bba1-44be-42e5-841c-ba777af840e8
| b827bba1-44be-42e5-841c-ba777af840e8 | available | full_clone_of_thin_clone_of_unused | 1 | type2 | true | |
$ podman exec galera-bundle-podman-0 mysql --table -e "update cinder.volumes set deleted = 1 where id = 'b827bba1-44be-42e5-841c-ba777af840e8';"
$ cinder list | grep b827bba1-44be-42e5-841c-ba777af840e8
$

(B) Destructing a snapshot(s)
Commands:
cinder snapshot-list | grep <snapshot id>
podman exec galera-bundle-podman-0 mysql --table -e "update cinder.snapshots set deleted = 1 where id = '<snapshot id>';"
cinder snapshot-list | grep <snapshot id>

output example of the commands:
$ cinder snapshot-list | grep 66aa1ee4-82ba-44c5-a79c-1b735f1d2a39
| 66aa1ee4-82ba-44c5-a79c-1b735f1d2a39 | c79e7bcf-6b1f-4307-b843-1c9d40b84e40 | available | data-snapshot | 1 | 73f5a710e4284a609729f57d0796091f |
$ podman exec galera-bundle-podman-0 mysql --table -e "update cinder.snapshots set deleted = 1 where id = '66aa1ee4-82ba-44c5-a79c-1b735f1d2a39';"
$ cinder snapshot-list | grep 66aa1ee4-82ba-44c5-a79c-1b735f1d2a39
$

Workaround:

If deleting volume or snapshot is failed with the message contains ``MSGID0731-E: Failed to communicate with the REST API server`` in the logs ``/var/log/containers/cinder/cinder-volume.log.*`` on the controller node, do following steps before creating another volume or snapshot:

(1) Getting objects whose “provider_location” is same as the volume (or snapshot) which is failed in deleting
Command:
podman exec galera-bundle-podman-0 mysql --table -e "select object_type, id from (select 'volume' object_type, id, provider_location from cinder.volumes where deleted = 0 union all select 'snapshot', id, provider_location from cinder.snapshots where deleted = 0) provider_location_list where provider_location = (select provider_location from (select id, provider_location from cinder.volumes where deleted = 0 union all select id, provider_location from cinder.snapshots where deleted = 0) provider_location_list2 where id = '<volume id or snapshot id>') order by object_type;"

output example of the command:
$ podman exec galera-bundle-podman-0 mysql --table -e "select object_type, id from (select 'volume' object_type, id, provider_location from cinder.volumes where deleted = 0 union all select 'snapshot', id, provider_location from cinder.snapshots where deleted = 0) provider_location_list where provider_location = (select provider_location from (select id, provider_location from cinder.volumes where deleted = 0 union all select id, provider_location from cinder.snapshots where deleted = 0) provider_location_list2 where id = 'b827bba1-44be-42e5-841c-ba777af840e8') order by object_type;"
+-------------+--------------------------------------+
| object_type | id |
+-------------+--------------------------------------+
| snapshot | 2fde155f-1f28-44a2-89c1-9a75874cfc8f |
| snapshot | 66aa1ee4-82ba-44c5-a79c-1b735f1d2a39 |
| volume | ae023696-f779-45fd-8142-9618179b21f4 |
| volume | b827bba1-44be-42e5-841c-ba777af840e8 |
+-------------+--------------------------------------+

(2) Deleting the object(s)
(a) when multiple objects are listed
Destruct objects along the steps in the section “Recovery steps”.

(b) when an object is listed
Delete the volume or the snapshot via Cinder. Reset the state of the volume or snapshot to “error” before deleting, because a volume or a snapshot in the state “error_deleting” can’t be deleted.

(b)-1 Deleting a volume
Commands:
cinder list | grep <volume>
cinder reset-state --state error <volume>
cinder list | grep <volume>
cinder delete <volume>
cinder list | grep <volume>

output example of the command:
$ cinder list | grep 7a6601f8-d03d-4317-807a-da16af1b4bfd
| 7a6601f8-d03d-4317-807a-da16af1b4bfd | error_deleting | volume | 1 | 422222_FC | false | |
$ cinder reset-state --state error 7a6601f8-d03d-4317-807a-da16af1b4bfd
$ cinder list | grep 7a6601f8-d03d-4317-807a-da16af1b4bfd
| 7a6601f8-d03d-4317-807a-da16af1b4bfd | error | volume | 1 | 422222_FC | false | |
$ cinder delete 7a6601f8-d03d-4317-807a-da16af1b4bfd
Request to delete volume 7a6601f8-d03d-4317-807a-da16af1b4bfd has been accepted.
$ cinder list | grep 7a6601f8-d03d-4317-807a-da16af1b4bfd
$

(b)-2 Deleting a snapshot
Commands:
cinder snapshot-list | grep <snapshot>
cinder snapshot-reset-state --state error <snapshot>
cinder snapshot-list | grep <snapshot>
cinder snapshot-delete <snapshot>
cinder snapshot-list | grep <snapshot>

output example of the command:
$ cinder snapshot-list | grep c6898d6b-3ed0-4222-8375-1db2c2408a2f
| c6898d6b-3ed0-4222-8375-1db2c2408a2f | 7a6601f8-d03d-4317-807a-da16af1b4bfd | error_deleting | snapshot | 1 | 73f5a710e4284a609729f57d0796091f |
$ cinder snapshot-reset-state --state error c6898d6b-3ed0-4222-8375-1db2c2408a2f
$ cinder snapshot-list | grep c6898d6b-3ed0-4222-8375-1db2c2408a2f
| c6898d6b-3ed0-4222-8375-1db2c2408a2f | 7a6601f8-d03d-4317-807a-da16af1b4bfd | error | snapshot | 1 | 73f5a710e4284a609729f57d0796091f |
$ cinder snapshot-delete c6898d6b-3ed0-4222-8375-1db2c2408a2f
$ cinder snapshot-list | grep c6898d6b-3ed0-4222-8375-1db2c2408a2f

Fixing driver:

By driver fixing, when deleting an object which is created under the following conditions, check whether the LDEV is created when own object is created. If so, delete both the object and the LDEV. If not so, delete only the object.

- Volumes created by upstream drivers since the Wallaby release
- Snapshots created by the driver after applying the patch

The fixed driver will check that by comparing own object ID to be deleted and the object ID stored in LDEV nickname. The driver will store an object ID(*) in LDEV nickname when creating an object.

(*) Stored value is omitted "-" from the object ID, because maximum length of LDEV nickname is 32 digits characters.

Current upstream driver has already implemented to store object ID into LDEV nickname, when creating a volume. The fixed driver will be added features:

- Storing an object ID when creating a snapshot
- Comparing both object IDs when deleting an object.

On fixed driver and if LDEV nickname is not stored object ID(32digits hex), the driver will delete both the object and LDEV, because it avoid a risk for disk full.

Note:

- Do not change LDEV nickname
  If user changed the nickname, driver could not compare correctly.

description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/923618

Changed in cinder:
status: New → In Progress
Eric Harney (eharney)
Changed in cinder:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/923618
Committed: https://opendev.org/openstack/cinder/commit/d04db6fe8874525a34e44c63b4c7a81c468c7ef9
Submitter: "Zuul (22348)"
Branch: master

commit d04db6fe8874525a34e44c63b4c7a81c468c7ef9
Author: Atsushi Kawai <email address hidden>
Date: Thu Aug 8 16:44:49 2024 +0900

    Hitachi: Prevent to delete a LDEV assigned to multi objects

    This patch prevents to delete a LDEV that is unexpectedly assigned to
    two or more objects(volumes or snapshots).

    In the unexpected situation, if ``delete`` command for one of objects
    is run again, the data which is used by other objects is lost.

    In order to prevent the data-loss, when creating an object,
    the driver creates a LDEV and stores a value obtained by omitting
    the hyphen from the object ID(*1) to ``LDEV nickname``.
    When deleting an object, the driver compares the own object ID and
    the object ID in ``LDEV nickname``, then, the object and the LDEV is
    deleted only if both object IDs are same.
    On the other hand, if both object IDs are not same, only the object
    is deleted and the LDEV is kept, to prevent data-loss.

    If format of ``LDEV nickname`` is not object ID(*2), both the object
    and the LDEV is deleted without comparison, because it avoids disk
    full risk, due to not deleting any LDEVs.
    This patch implements only the object ID storing while creating a
    snapshot and comparing IDs while deleting, because the feature to
    store the object ID while creating a volume has already been
    implemented.
    (*1) Max length of ``LDEV nickname`` is 32 digits characters on
    Hitachi storage.
    (*2) 32 digits hexadecimal

    Closes-Bug: #2072317
    Change-Id: I7c6bd9a75dd1d7165d4f8614abb3d59fa642212d

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/cinder/+/926572

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/926572
Committed: https://opendev.org/openstack/cinder/commit/83399aceb0025b12baa0bcd82f04706cbda8d18c
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 83399aceb0025b12baa0bcd82f04706cbda8d18c
Author: Atsushi Kawai <email address hidden>
Date: Thu Aug 8 16:44:49 2024 +0900

    Hitachi: Prevent to delete a LDEV assigned to multi objects

    This patch prevents to delete a LDEV that is unexpectedly assigned to
    two or more objects(volumes or snapshots).

    In the unexpected situation, if ``delete`` command for one of objects
    is run again, the data which is used by other objects is lost.

    In order to prevent the data-loss, when creating an object,
    the driver creates a LDEV and stores a value obtained by omitting
    the hyphen from the object ID(*1) to ``LDEV nickname``.
    When deleting an object, the driver compares the own object ID and
    the object ID in ``LDEV nickname``, then, the object and the LDEV is
    deleted only if both object IDs are same.
    On the other hand, if both object IDs are not same, only the object
    is deleted and the LDEV is kept, to prevent data-loss.

    If format of ``LDEV nickname`` is not object ID(*2), both the object
    and the LDEV is deleted without comparison, because it avoids disk
    full risk, due to not deleting any LDEVs.
    This patch implements only the object ID storing while creating a
    snapshot and comparing IDs while deleting, because the feature to
    store the object ID while creating a volume has already been
    implemented.
    (*1) Max length of ``LDEV nickname`` is 32 digits characters on
    Hitachi storage.
    (*2) 32 digits hexadecimal

    Closes-Bug: #2072317
    Change-Id: I7c6bd9a75dd1d7165d4f8614abb3d59fa642212d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/cinder/+/926734

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/cinder/+/926742

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.