Forget to change volume's status when an error occurs in swap volume

Bug #1222656 reported by yasunori jitsukawa
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Vladik Romanovsky

Bug Description

If the blockrebase(swap volume) fails, the status of the attached volume remains of detaching.

Tried Commit ID:b037993984229bb698050f20e8719b8c06ff2be3

1.Before Swap Volume
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 66230802-aed6-4a90-9dec-42fec910d13d | in-use | vol01 | 1 | None | False | 86d8d79c-be27-43c4-b148-b5637c435899 |
| c3d51a52-764a-4007-976a-136b544c561b | available | vol02 | 1 | None | False | |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

2.Swap volume running
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 66230802-aed6-4a90-9dec-42fec910d13d | detaching | vol01 | 1 | None | False | 86d8d79c-be27-43c4-b148-b5637c435899 |
| c3d51a52-764a-4007-976a-136b544c561b | attaching | vol02 | 1 | None | False | |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

3.An error occurs in swap volume
For example, cancel a blockrebase job.
libvirtError: virDomainGetBlockJobInfo() failed

4.After an error occurs
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 66230802-aed6-4a90-9dec-42fec910d13d | detaching | vol01 | 1 | None | False | 86d8d79c-be27-43c4-b148-b5637c435899 |
| c3d51a52-764a-4007-976a-136b544c561b | available | vol02 | 1 | None | False | |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

Tags: volumes
tags: added: volumes
Changed in nova:
assignee: nobody → Vladik Romanovsky (vladik-romanovsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/49395

Changed in nova:
status: New → In Progress
Revision history for this message
yasunori jitsukawa (y-jitsukawa) wrote :

In this patch, status of volume is still 'detaching', however nova can handle exception of Libvirt.

Now, when nova swap volume, nova-api calls cinder API which change status of volume to 'detaching'.
  nova-api swap_volume()
   self.volume_api.begin_detaching(context, old_volume['id'])

When Libvirt raise exception, nova-compute doesn't revert status of source swap volume, however nova-compute kicks Cinder API which change status of migrate.
At terminate_connection() of Cinder, status of dest swap volume back to 'available'.
  nova-compute swap_volume()
   self.volume_api.terminate_connection(context, new_volume_id, connector)
   self.volume_api.migrate_volume_completion(context, old_volume_id, new_volume_id, error=True)

IMO, not only status of migrate, but also nova should revert status of source swap volume to 'in-use'.
To be concrete, when Libvirt raise exception, Nova needs to kick roll_detaching() of Cinder.

Revision history for this message
Vladik Romanovsky (vladik-romanovsky) wrote :

Hi,

roll_detaching is being called in nova.compute.api.swap_volume() on exception, since

self.volume_api.terminate_connection() in compute.manager.swap_volume,
is being called in context of:
            with excutils.save_and_reraise_exception()

However, you are right, the problem still exist.
The obvious problem is being caused by compute.rpcapi.swap_volume sending case instead of call, not waiting for the answer.
I will send a tested patch soon today.

Unfortunately, there is another problem that I see that is not related to libvirt exception handling, (also, blockinfo --abort doesn't produce a libvirt error.)

The problem is that domain.blockJobInfo may return an empty status in some cases in _wait_for_block_job()
and will always return False, because of

        try:
            cur = status.get('cur', 0)
            end = status.get('end', 0)
        except Exception:
            return False

making driver._swap_volume stuck forever on

while self._wait_for_block_job(domain, disk_path):
                time.sleep(0.5)

I was concerned with something else in my patch, but I think that its not valid.

Thanks,
Vladik

Revision history for this message
Vladik Romanovsky (vladik-romanovsky) wrote :
Download full text (4.5 KiB)

[vladikr@localhost devstack]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+-------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-----------+--------+------------+-------------+-------------------+
| 5da0f8be-c6d8-4992-a6c6-bbf2da3aefa9 | nhljkdsf | ACTIVE | deleting | Running | private=10.0.0.31 |
| bd3912c8-6c7f-4297-84b3-3fde73a51297 | test_swap | ACTIVE | None | Running | private=10.0.0.45 |
+--------------------------------------+-----------+--------+------------+-------------+-------------------+
[vladikr@localhost devstack]$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+
| 21f0808c-53f0-4eaf-805f-1fef5d4d3f89 | in-use | | 2 | None | True | bd3912c8-6c7f-4297-84b3-3fde73a51297 |
| b30af6f7-d957-437e-be9d-e7e70e9426d1 | available | | 2 | None | True | |
+--------------------------------------+-----------+--------------+------+-------------+----------+--------------------------------------+

From the log:

2013-10-16 11:13:16.409 ERROR nova.api.openstack [req-86547e4f-789d-47d7-b22e-84a10c8f55d9 admin admin] Caught error: Volumes swap failed
Traceback (most recent call last):

  File "/home/vladikr/devel/openstack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_data
    **args)

  File "/home/vladikr/devel/openstack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch
    result = getattr(proxyobj, method)(ctxt, **kwargs)

  File "/home/vladikr/devel/openstack/nova/nova/exception.py", line 90, in wrapped
    payload)

  File "/home/vladikr/devel/openstack/nova/nova/exception.py", line 73, in wrapped
    return f(self, context, *args, **kw)

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 244, in decorated_function
    pass

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 230, in decorated_function
    return function(self, context, *args, **kwargs)

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 272, in decorated_function
    e, sys.exc_info())

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 259, in decorated_function
    return function(self, context, *args, **kwargs)

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 3835, in swap_volume
    error=True)

  File "/home/vladikr/devel/openstack/nova/nova/compute/manager.py", line 3820, in swap_volume
    self.driver.swap_volume(old_cinfo, new_cinfo, instance, mountpoint)

  File "/home/vladikr/devel/openstack/nova/nova/virt/libvirt/driver.py", line 1174, in swap_volume
    self._sw...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/49395
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=724493d21fdfcbb4c095b54975c0c1d612f0a856
Submitter: Jenkins
Branch: master

commit 724493d21fdfcbb4c095b54975c0c1d612f0a856
Author: Vladik Romanovsky <email address hidden>
Date: Mon Sep 30 13:47:58 2013 -0400

    Clean swap_volume rollback, on libvirt exception.

    Handling swap_volume command's rollback on the manager side
    instead of the API, to allow clean rollback.

    Change-Id: I7d5c1e66bf01fd12fcaa783c1c3c90d92b009aea
    Closes-Bug: #1222656

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-3
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-3 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.