If volume-attach API is failed, Block Device Mapping record will remain

Bug #1377161 reported by Rikimaru Honjo
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Undecided
haruka tanizawa
OpenStack Compute (nova)
Invalid
Medium
Unassigned
oslo.messaging
Invalid
Undecided
Unassigned
python-cinderclient
Invalid
Undecided
haruka tanizawa

Bug Description

I executed volume-attach API(nova V2 API) when RabbitMQ was down.
As result of above API execution, volume-attach API was failed and volume's status is still available.
But, block device mapping record remains on nova DB.
This condition is inconsistency.

And, remained block device mapping record maybe cause some problems. (I'm researching now.)

I used openstack juno-3.

------------------------------------------------------------------------------------------------------------------
* Before executing volume-attach API:

$ nova list
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| 0b529526-4c8d-4650-8295-b7155a977ba7 | testVM | ACTIVE | - | Running | private=10.0.0.104 |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| e93478bf-ee37-430f-93df-b3cf26540212 | available | None | 1 | None | false | |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
devstack@ubuntu-14-04-01-64-juno3-01:~$

mysql> select * from block_device_mapping where instance_uuid = '0b529526-4c8d-4650-8295-b7155a977ba7';
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+-----------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id |
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+-----------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
| 2014-10-02 18:36:08 | 2014-10-02 18:36:10 | NULL | 145 | /dev/vda | 1 | NULL | NULL | NULL | NULL | NULL | 0b529526-4c8d-4650-8295-b7155a977ba7 | 0 | image | local | NULL | disk | NULL | 0 | c1d264fd-c559-446e-9b94-934ba8249ae1 |
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+-----------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
1 row in set (0.00 sec)

* After executing volume-attach API:
$ nova list --all-t
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
| 0b529526-4c8d-4650-8295-b7155a977ba7 | testVM | ACTIVE | - | Running | private=10.0.0.104 |
+--------------------------------------+--------+--------+------------+-------------+--------------------+
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+
| e93478bf-ee37-430f-93df-b3cf26540212 | available | None | 1 | None | false | |
+--------------------------------------+-----------+--------------+------+-------------+----------+-------------+

mysql> select * from block_device_mapping where instance_uuid = '0b529526-4c8d-4650-8295-b7155a977ba7';
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+--------------------------------------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id |
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+--------------------------------------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
| 2014-10-02 18:36:08 | 2014-10-02 18:36:10 | NULL | 145 | /dev/vda | 1 | NULL | NULL | NULL | NULL | NULL | 0b529526-4c8d-4650-8295-b7155a977ba7 | 0 | image | local | NULL | disk | NULL | 0 | c1d264fd-c559-446e-9b94-934ba8249ae1 |
| 2014-10-02 18:45:28 | NULL | NULL | 146 | /dev/vdb | 0 | NULL | 0b529526-4c8d-4650-8295-b7155a977ba7 | NULL | NULL | NULL | 0b529526-4c8d-4650-8295-b7155a977ba7 | 0 | volume | volume | NULL | NULL | NULL | NULL | NULL |
+---------------------+---------------------+------------+-----+-------------+-----------------------+-------------+--------------------------------------+-------------+-----------+-----------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+--------------------------------------+
2 rows in set (0.00 sec)

------------------------------------------------------------------------------------------------------------------

Tags: volumes
Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

Do you happen to still have the nova api and nova compute logs? If so could you attach them?

Changed in nova:
status: New → Incomplete
Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

Hi Christopher,

I detected a problem that is caused by remaining block-device-mapping.

1. RabbitMQ is down.
2. Bob executes Volume-attach API.
$ nova volume-attach testVM1 51765f50-5803-4a13-b111-799ca5f8a948 /dev/vdb
3. Volume-attach API is failed because RabbitMQ is down.
4. RabbitMQ is restored.
5. Volume's status is still 'available', but blcok device mapping record remains on nova DB.
6. Alice attach the same volume to other VM instance.
$ nova volume-attach testVM2 51765f50-5803-4a13-b111-799ca5f8a948 /dev/vdb
7. Above API execution succeed. Volume's status is changed to "in-use" and attach to testVM2.
    But, connection_info is saved as testVM1's block device mapping!
8. Bob delete testVM1. As result of this deleting, volume's status is changed to "available"!

Attached file is console-log & DB information about above problem.

Revision history for this message
haruka tanizawa (h-tanizawa) wrote :

I think this problem is reproducible.
If compute receives reserve_block_device_name RPC message, bdm is made and it fails into inconsistency.

Changed in nova:
assignee: nobody → haruka tanizawa (h-tanizawa)
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/132957

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by haruka tanizawa (<email address hidden>) on branch: master
Review: https://review.openstack.org/132957
Reason: I will re-think of way to modify.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/151135

Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/162893

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by haruka tanizawa (<email address hidden>) on branch: master
Review: https://review.openstack.org/151135
Reason: This is not appropriate patch from point of deleting instance of which does not have connection_info.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by haruka tanizawa (<email address hidden>) on branch: master
Review: https://review.openstack.org/162893
Reason: Considering race condition ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/166691

Changed in cinder:
assignee: nobody → haruka tanizawa (h-tanizawa)
status: New → In Progress
Changed in python-cinderclient:
assignee: nobody → haruka tanizawa (h-tanizawa)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-cinderclient (master)

Fix proposed to branch: master
Review: https://review.openstack.org/166692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/166695

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-cinderclient (master)

Change abandoned by haruka tanizawa (<email address hidden>) on branch: master
Review: https://review.openstack.org/166692
Reason: Try to modify just only in nova code.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by haruka tanizawa (<email address hidden>) on branch: master
Review: https://review.openstack.org/166691
Reason: Try to modify just only in nova code.

Changed in nova:
assignee: haruka tanizawa (h-tanizawa) → Mitsuhiro Tanino (mitsuhiro-tanino)
Changed in cinder:
status: In Progress → Invalid
Changed in python-cinderclient:
status: In Progress → Invalid
Changed in nova:
importance: Low → Medium
Revision history for this message
Mitsuhiro Tanino (mitsuhiro-tanino) wrote :

We can solve this issue only for the nova patch.

Fix proposed to nova (master)
Review: https://review.openstack.org/166695

Revision history for this message
Mark Wu (wudx05) wrote :

I don't understand why the reserve_block_device_name call can be done even if rabbitmq is down. Could you please explain it? Thanks!

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

So as commented on the patch - I really think that we need to make sure that whatever gets created, also gets cleaned up on errors - while the patch https://review.openstack.org/166695 has some good ideas.

What I also noticed (when I was testing this some time ago) is that what really happens here is that the rpc client does not time out - the reason you see the failure is likely because the Nova API times out the request. This might actually be an issue with oslo.messaging rabbitmq driver (which would never time out a request) or the fact that we assume it would.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184045

Changed in nova:
assignee: Mitsuhiro Tanino (mitsuhiro-tanino) → Abhijeet Malawade (abhijeet-malawade)
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Invalid
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

So I was not able to reporduce this.

Nova at present initiates all of it's RPC clients without a timeout meaning the reply thread will wait forever, and that the sender thread will too.

If the API accepted the request while the MQ was down, it will keep on retrying until either the process is killed, or rabbit comes back up in which case the message succeeds.

In case we use a client timeout - it appears that even when the timeout is reached, the spawned greenthread in the server still lives on (this might be a bug in the eventlet.WSGI server) and once connections are back, they keep doing their thing - i.e attaching volumes.

if then for some reason the actual attach fails (the volume is gone for example) the bdm is destroyed by the API error handling code.

Changed in nova:
status: In Progress → Incomplete
tags: added: volumes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/166695
Reason: This patch seems to have stalled, lets abandon it.
Please restore the patch if that is no longer true.

Any questions, please catch me via email or on IRC johnthetubaguy

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/184045
Reason: This patch has been stalled for quite a while, so I am going to abandon it to keep the code review queue sane. Please restore the change when it is ready for review.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Abhijeet Malawade:
I haven't found any open patches from you, that's why I'm removing you as assignee. Comment #20 says it's not reproducible, but it would be nice if you could double-check it, because I couldn't find a commit which could have solved this issue.

Changed in nova:
assignee: Abhijeet Malawade (abhijeet-malawade) → nobody
Revision history for this message
Lee Yarwood (lyarwood) wrote :

Moving to invalid given that a valid reproducer has not been provided for some time now. Please feel free to reopen if one is found against a supported release in the future.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.