LibvirtRbdEvacuateTest and LibvirtFlatEvacuateTest tests race fail

Bug #1806126 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/22/606122/7/check/nova-tox-functional/1f3126b/testr_results.html.gz

There are different failures, but the tests were introduced with this change:

https://review.openstack.org/#/c/591733/

For example, in nova.tests.functional.regressions.test_bug_1550919.LibvirtRbdEvacuateTest.test_evacuate_shared_instancedir I'm seeing:

2018-11-30 17:03:58,424 ERROR [nova.virt.libvirt.driver] Failed to start libvirt guest

And this (during evacuate from compute0 to compute1 in the test):

Traceback (most recent call last):
  File "nova/compute/manager.py", line 7897, in _error_out_instance_on_exception
    yield
  File "nova/compute/manager.py", line 3040, in rebuild_instance
    migration, request_spec)
  File "nova/compute/manager.py", line 3102, in _do_rebuild_instance_with_claim
    self._do_rebuild_instance(*args, **kwargs)
  File "nova/compute/manager.py", line 3265, in _do_rebuild_instance
    self._rebuild_default_impl(**kwargs)
  File "nova/compute/manager.py", line 2929, in _rebuild_default_impl
    block_device_info=new_block_device_info)
  File "nova/virt/libvirt/driver.py", line 3079, in spawn
    destroy_disks_on_failure=True)
  File "nova/virt/libvirt/driver.py", line 5648, in _create_domain_and_network
    destroy_disks_on_failure)
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "nova/virt/libvirt/driver.py", line 5607, in _create_domain_and_network
    self.plug_vifs(instance, network_info)
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 1062, in __call__
    return _mock_self._mock_call(*args, **kwargs)
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/mock/mock.py", line 1118, in _mock_call
    raise effect
TestingException

I'm guessing something is mocking global state and causing that failure...we likely need a fake virt driver that extends the libvirt driver and does what we want for these tests to avoid global mocks.

In the other test, I see a timeout:

nova.tests.functional.regressions.test_bug_1550919.LibvirtFlatEvacuateTest.test_evacuate_nonshared_instancedir

And it seems to be stuck here:

2018-11-30 17:03:05,111 INFO [nova.virt.libvirt.driver] Creating image

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22INFO%20%5Bnova.virt.libvirt.driver%5D%20Creating%20image%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

32 hits in the last 48 hours, check and gate, all failures.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/621289

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/621289
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=90d16c270a6aa9acf4bb445c4666ed2631bb1267
Submitter: Zuul
Branch: master

commit 90d16c270a6aa9acf4bb445c4666ed2631bb1267
Author: Matt Riedemann <email address hidden>
Date: Fri Nov 30 21:15:33 2018 +0000

    Revert "Add regression test for bug 1550919"

    This reverts commit bbe88786fc90c2106f9fae0156ee7b09ece9a83b.

    The new tests are racy and causing a modest amount of
    failures in the gate since the change merged, so it is
    probably best to just revert the tests so they can be
    robustified.

    Change-Id: I18bd68ba6e59aba4c450eb85e6f4450d7044b1e9
    Related-Bug: #1806126

Revision history for this message
Matthew Booth (mbooth-9) wrote :

The TestingException raised above during evacuate from compute0 to compute1 is intended behaviour. It's mocked by the test itself in order to raise that.

Of all the failures I looked at which didn't just look like lots of unrelated random things going wrong (e.g. http://logs.openstack.org/22/606122/7/check/nova-tox-functional/1f3126b/job-output.txt) the failure looks like a timeout during initial instance create using the Flat backend. We _wait_for_server_parameter() there with the default 5 second timeout. I'll bet it's just taking longer than that for some reason. It's hard to know without DEBUG enabled, though.

Interesting that the failures (I looked at) were all on the flat backend. The flat backend touches the filesystem, but rbd does not, but they're both mocked. Flat creates an image by touching it with:

  open(_self.path, 'a').close()

I guess that could theoretically be slow if we were experiencing horrendous IO lag, but it seems unlikely. Similarly exists for flat is an os.path.exists(), whereas for Rbd it's a dict lookup. Again, that seems unlikely to be slow. The only obvious difference of moderate complexity between the two is resolve_driver_format().

I obviously haven't reproduced this locally, and probably can't, but I'll see if I can profile it somehow to see where most of the time goes on my system, at least.

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → Fix Released
assignee: nobody → Matt Riedemann (mriedem)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.