Gate jobs are not detaching volumes and eventually fail with no devices left

Bug #1530847 reported by Valeriy Ponomaryov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
OpenStack Shared File Systems Service (Manila)
Fix Released
Critical
Valeriy Ponomaryov

Bug Description

Manila project has been facing following error since "2016, january 3, ~16:00+" of ZUUL's timezone:

Paste: http://paste.openstack.org/show/482932/

Logs: http://logs.openstack.org/07/263207/1/check/gate-manila-tempest-dsvm-neutron-no-share-servers-multibackend/6717bb8/logs/screen-n-api.txt.gz?level=TRACE#_2016-01-04_11_47_11_087

Raw:
2016-01-04 11:47:11.087 ERROR nova.api.openstack.extensions [req-d3ea820b-5b7e-4174-aa92-60b5d6283ee9 nova service] Unexpected exception in API method
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/extensions.py", line 478, in wrapped
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/compute/volumes.py", line 283, in create
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions volume_id, device)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 235, in wrapped
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return func(self, context, target, *args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 224, in inner
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return function(self, context, instance, *args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 205, in inner
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return f(self, context, instance, *args, **kw)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 3082, in attach_volume
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions disk_bus, device_type)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 3055, in _attach_volume
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions device_type=device_type)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/rpcapi.py", line 813, in reserve_block_device_name
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions volume_bdm = cctxt.call(ctxt, 'reserve_block_device_name', **kw)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions retry=self.retry)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions timeout=timeout, retry=retry)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions retry=retry)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 455, in _send
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions raise result
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions TypeError: __init__() takes at most 2 arguments (3 given)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 143, in _dispatch_and_reply
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions executor_callback))
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 189, in _dispatch
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions executor_callback)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions result = func(ctxt, **new_args)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/exception.py", line 110, in wrapped
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions payload)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions six.reraise(self.type_, self.value, self.tb)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/exception.py", line 89, in wrapped
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return f(self, context, *args, **kw)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 355, in decorated_function
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions LOG.warning(msg, e, instance=instance)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions six.reraise(self.type_, self.value, self.tb)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 328, in decorated_function
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return function(self, context, *args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 383, in decorated_function
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions kwargs['instance'], e, sys.exc_info())
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions six.reraise(self.type_, self.value, self.tb)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 371, in decorated_function
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return function(self, context, *args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 4649, in reserve_block_device_name
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return do_reserve()
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 4643, in do_reserve
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions instance, bdms, new_bdm)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/manager.py", line 1663, in _get_device_name_for_instance
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions instance, bdms, block_device_obj)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7249, in get_device_name_for_instance
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions block_device_obj, mapping=instance_info['mapping'])
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/virt/libvirt/blockinfo.py", line 397, in get_info_from_bdm
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions device_name = find_disk_dev_for_disk_bus(padded_mapping, bdm_bus)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/virt/libvirt/blockinfo.py", line 201, in find_disk_dev_for_disk_bus
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions dev_prefix)
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions
2016-01-04 11:47:11.087 26423 ERROR nova.api.openstack.extensions TypeError: __init__() takes at most 2 arguments (3 given)

Currently observed things:

1) Manila project has no recent changes that could cause it.
2) Error appears on attempt to attach Cinder volume to Nova VM.
3) Amount of attachments increased comparing to "ok" results of other runs before error appearance. Example: was 55, became 98 and then job timeout reached, so, it is not final count of attach attempts.
4) Nova fails to dedicate device ID for volume starting from 26th device. This limit should not have been reached in normal situation.

Suspect cause: reserved device IDs for Nova are not released as it was and expected. Nova has bug handling case when device ID cannot be reserved.

https://github.com/openstack/nova/commit/9f9711578f6cca05b3b50b934d261f9b50a1e377 ?

How to reproduce:
1) Create instance
And then perform following in cycle 26 times:
2) Create volume
3) Attach volume to instance (1)
4) Detach volume (2)

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is this a new test in manila? If so, when was it introduced?

tags: added: libvirt volumes
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/263333

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: TypeError attaching volume to instance

It's not really clear to me what would have caused this regression in the last 24 hours. The PCI device change seems like it wouldn't be a problem since that's just updating the DB schema, not actually using it yet.

The libvirt volume attach/detach operations are asynchronous, so we could just be hitting an intermittent timing issue, but that wouldn't explain why this just started spiking in the last 24 hours really. I can't really blame the nodes this is tested on since it's a mix of RAX, HP and OVH nodes, which all perform differently.

Changed in nova:
assignee: Matt Riedemann (mriedem) → nobody
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/263333/ fixes the TypeError so we'll see what the device bus prefix is that is failing here.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This adds a retry when detaching devices in the libvirt driver, I'm wondering if this could be causing some issues, but it merged a few weeks ago:

https://github.com/openstack/nova/commit/3a3fb3cfb2c41ad182545e47649ff12a4f3a743e

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/263338

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/263333
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d3a6797254da3213107a8340ddb79cb10b3bc1d3
Submitter: Jenkins
Branch: master

commit d3a6797254da3213107a8340ddb79cb10b3bc1d3
Author: Matt Riedemann <email address hidden>
Date: Mon Jan 4 08:30:59 2016 -0800

    libvirt: fix TypeError in find_disk_dev_for_disk_bus

    The NovaException class only takes the message and kwargs,
    so we can't pass substitution variables into the __init__, else
    we get a TypeError.

    This was apparently untested so a unit test is added for
    covering the error case.

    Change-Id: If5090f30be8985c4a8e99139dc5be8bf93ce1d12
    Partial-Bug: #1530847

Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote : Re: TypeError attaching volume to instance
Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote :
Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote :

Latest information:

Manila CI jobs exceed limit of devices attached to single VM because of following scenario:

Tempest initiates creation of a share, it initiates creation of volume, its attach to VM, mount of device and its formatting. And this operation started taking more than 500 seconds, so , tempest fails and tries to delete created resource but gets answer "forbidden" because resources in transient "creating" state. So, Tempest gives up to release resources and runs other tests, and, at some point creates 26th volume that causes impossibility to dedicate new device name. So, problem in suddenly slowed down performance.
"Ok" and "failed" nodes had the same code and python+system packages.

Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote :

On my lab, with latest code, VM gets "ACTIVE" for more than 5 minutes (300+) seconds. using manila command of share creation with boot up of Nova VM.

Machine:
CPU 4x3,2
RAM: 4Gb

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm noticing that shortly before this fails, there are a lot of iscsiadm sessions:

http://logs.openstack.org/64/263664/5/check/gate-manila-tempest-dsvm-neutron/835bf51/logs/screen-n-cpu.txt.gz#_2016-01-06_14_05_11_976

I'm also seeing a bunch of this in the n-cpu logs:

2016-01-06 14:05:11.495 DEBUG nova.virt.libvirt.driver [req-a08d2d27-328a-4da6-b852-66c78ea505ce None None] skipping disk /dev/disk/by-path/ip-158.69.74.121:3260-iscsi-iqn.2010-10.org.openstack:volume-f1b24b44-cd61-4050-982a-22c191cabd0f-lun-1 (vdb) - unable to determine if volume _get_instance_disk_info /opt/stack/new/nova/nova/virt/libvirt/driver.py:6551

I noticed it in the libvirtd logs first, but if you look at the n-cpu logs, os-attach is called many times (so many that we exhaust the devices to use).

But os-detach is only called twice:

http://logs.openstack.org/64/263664/5/check/gate-manila-tempest-dsvm-neutron/835bf51/logs/screen-n-cpu.txt.gz#_2016-01-06_13_58_01_522

And here:

http://logs.openstack.org/64/263664/5/check/gate-manila-tempest-dsvm-neutron/835bf51/logs/screen-n-cpu.txt.gz#_2016-01-06_14_13_13_062

So the devices aren't getting detached.

It looks like the tests are never calling detach besides those 2 times, but keep doing attach, and eventually that blows up.

So I'm thinking this is some issue with the way the manila job/tests are configured with tempest and running in parallel without tenant isolation, and they aren't doing the detach as expected.

Changed in nova:
status: In Progress → Invalid
summary: - TypeError attaching volume to instance
+ Gate jobs are not detaching volumes and eventually fail with no devices
+ left
Revision history for this message
Matt Riedemann (mriedem) wrote :

BTW, from what I was seeing in the n-cpu logs above, a volume attach operation was taking nova ~15 seconds, not >500 seconds.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is also in the tempest logs:

http://logs.openstack.org/64/263664/5/check/gate-manila-tempest-dsvm-neutron/835bf51/logs/tempest.txt.gz#_2016-01-06_13_57_08_106

2016-01-06 13:57:08.106 32541 ERROR manila_tempest_tests.tests.api.base [-] Share 'b8c53c97-94be-490a-9d9c-aabea9e71ba5' failed to be built. Trying create another.
2016-01-06 13:57:08.106 32541 ERROR manila_tempest_tests.tests.api.base [-] Request timed out
Details: Share tempest-share-name-1122195370 failed to reach available status within the required time (500 s).

Revision history for this message
Matt Riedemann (mriedem) wrote :

I suspect this is more or less a duplicate of bug 1531049 in some way or other.

Changed in manila:
assignee: nobody → Valeriy Ponomaryov (vponomaryov)
status: New → In Progress
Revision history for this message
Valeriy Ponomaryov (vponomaryov) wrote :

Matt, root cause found and fix proposed, see https://review.openstack.org/#/c/263664

Changed in manila:
importance: Undecided → Critical
milestone: none → mitaka-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila (master)

Reviewed: https://review.openstack.org/263664
Committed: https://git.openstack.org/cgit/openstack/manila/commit/?id=f4a00c59655ec3b7d65cec79b6e88bee8a5d601d
Submitter: Jenkins
Branch: master

commit f4a00c59655ec3b7d65cec79b6e88bee8a5d601d
Author: vponomaryov <email address hidden>
Date: Tue Jan 5 12:29:31 2016 +0200

    Fix CI Tempest jobs

    After merge of change [1] to devstack project, preconfigured creds for
    Tempest cannot be used anymore. But Manila requires such.
    For the moment, latest Tempest interfaces cannot allow us to use
    preconfigured creds because of bug #1524717.
    So, until this bug fixed, we should workaround Devstack changes and
    set legacy opts explicitly.

    Also, redefine new Nutron option 'dnsmasq_local_resolv' to 'False' value,
    that is 'True' by default and was added in change [2].
    Default value causes Nova operations initiated by Manila be very slow.
    That leads to breakage of CI.

    [1] I65b56ff681d6c27094380693c953fbc3664eceb0
    [2] I17a884f467d307432a06f67a9dd93ed2fa6081a3

    Change-Id: I516a6c9ab4396b14e3984d5f810210e4fcf7ec85
    Related-Bug: #1531049
    Closes-Bug: #1530847

Changed in manila:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/263338
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=039602b4e57b1283aa96ab2e7d98f9c087b62597
Submitter: Jenkins
Branch: master

commit 039602b4e57b1283aa96ab2e7d98f9c087b62597
Author: Matt Riedemann <email address hidden>
Date: Mon Jan 4 08:47:52 2016 -0800

    libvirt: sort block_device_list in volume_in_mapping log

    If we're going to log the block_device_list device names here
    we should sort the list, since finding which device might be
    missing in a big list, e.g. 25 entries, is not fun.

    Note that we have to filter None entries from the list before
    calling sorted because py34 will raise a TypeError if it is
    asked to compare a str to None.

    Change-Id: Id912666f3c435bd284ed40b3de29d01d070b68c4
    Related-Bug: #1530847

Revision history for this message
Matt Riedemann (mriedem) wrote :

Marking as fixed for nova b/c the nova side TypeError was fixed as part of this bug.

Changed in nova:
status: Invalid → Fix Released
importance: High → Medium
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/manila 2.0.0.0b2

This issue was fixed in the openstack/manila 2.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.