OSTF test 'Create volume and boot instance from it' failure on swarm

Bug #1607808 reported by Alexey. Kalashnikov on 2016-07-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Medium
Unassigned
Mitaka
Medium
Unassigned
Newton
Medium
Vladimir Khlyunev

Bug Description

Detailed bug description:
 During swarm test run, one of the test had failed cause ostf tests check had failure test
'Create volume and boot instance from it'
Steps to reproduce:

Expected results:
 ostf check pass
Actual result:
 Failed 1 OSTF tests; should fail 0 tests. Names of failed tests:
  - Create volume and boot instance from it (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.

ostf_log_cut
http://paste.openstack.org/show/543907/

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/10/testReport/(root)/change_pacemaker_parameter_not_break_rabbitmq/change_pacemaker_parameter_not_break_rabbitmq/

Reproducibility:
 Unknown. When I try to check it second time, during debug session, it had success.
Impact:
 Swarm test had failed
Description of the environment:
9.1 snapshot #64

Sergey Shevorakov (sshevorakov) wrote :

Importance = High, since it fails one Swarm test case.

tags: added: swarm-fail
Changed in fuel:
importance: Undecided → High
Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
status: New → Confirmed
Oleksiy Molchanov (omolchanov) wrote :

Marking as Incomplete, somehow diagnostic snapshot is not available for this fail.

Dmitry Guryanov (dguryanov) wrote :

It seems there are enough information now

Oleksiy Molchanov (omolchanov) wrote :

Aug 8 08:16:14 node-3 nova-compute: 2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager [req-5503ce08-2598-4f12-9438-e56d71a6557a 8f59b011483e4e1ab3687b32c4f66a39 3ef4407954294a398eef479c24a961a4 - - -] Instance failed network setup after 1 attempt(s)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager Traceback (most recent call last):
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager bind_host_id=bind_host_id)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 668, in allocate_for_instance
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager self._delete_ports(neutron, instance, created_port_ids)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager self.force_reraise()
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager six.reraise(self.type_, self.value, self.tb)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 657, in allocate_for_instance
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager security_group_ids, available_macs, dhcp_opts)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 259, in _create_port
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager raise exception.PortBindingFailed(port_id=port_id)
2016-08-08 08:16:14.348 30765 ERROR nova.compute.manager PortBindingFailed: Binding failed for port ae087047-cb9b-40e2-ac54-5303c177ee9e, please

Dmitry Pyzhov (dpyzhov) on 2016-08-10
no longer affects: fuel/newton
Oleksiy Molchanov (omolchanov) wrote :

Nova team, please take a look.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → MOS Nova (mos-nova)
Sergey Nikitin (snikitin) wrote :
Download full text (4.7 KiB)

Maybe related to https://review.openstack.org/#/c/347708/2
looks similar with bug:
https://bugs.launchpad.net/neutron/+bug/1606827

I think Neutron team should look at it

node-2/var/log/neutron/neutron-openvswitch-agent.log

2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 322, in _report_state
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 86, in report_state
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 155, in call
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent time.sleep(wait)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.force_reraise()
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 136, in call
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return self._original_context.call(ctxt, method, **kwargs)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 91, in _send
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry)
2016-08-08 08:16:09.394 25878 ERROR neutron.plugins.ml2.drivers.openvswitch.agent...

Read more...

Changed in fuel:
assignee: MOS Nova (mos-nova) → MOS Neutron (mos-neutron)
Oleg Bondarev (obondarev) wrote :

Lots of rabbitmq failures in logs during the time instance was booted. PortBindingFailed (and VM getting to ERROR state) is expected when rabbit is unstable and neutron agents go down. Reassigning to mos-oslo.

Changed in fuel:
assignee: MOS Neutron (mos-neutron) → MOS Oslo (mos-oslo)
Oleg Bondarev (obondarev) wrote :

https://review.fuel-infra.org/#/c/24209/ has "bf5b8ca Don't use exponential back-off for report_state" which might help. Merged on Aug 10, please check if OSTF still failing with this issue after Aug 10.

Alexey, Oleg Bondarev and me conducted an investigation and we think that the fix Oleg provided above fixes the issue. The fix is merged on Aug 10, after both occurrences referenced in the current bug happened. Could you please check if the issue disappear after Aug 10?

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Alexey. Kalashnikov (akalashnikov)
status: Confirmed → Incomplete
Tatyana Kuterina (tkuterina) wrote :

This issue is reproduced on CI:

   Test: Check after repeatable failover rabbit cluster is healthy
   Test Group: ha_rabbitmq_stability_check
   Trace: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_neutron_destructive/28/consoleFull

Diagnostic snapshot:
https://drive.google.com/a/mirantis.com/file/d/0Bz15vbpS5ZPNcFk4TUhjNnlvZE0/view?usp=sharing

Changed in fuel:
assignee: Alexey. Kalashnikov (akalashnikov) → MOS Oslo (mos-oslo)

I have reviewed the latest failure provided by Tatyana. The problem is that OSTF test waits for server to be deleted in 30 seconds, while that operation took a little more (around 31 seconds). Here is the time when OSTF test timed out (/var/log/ostf.log from master node):

2016-08-15 08:30:09 ERROR (nose_storage_plugin) fuel_health.tests.smoke.test_create_volume.VolumesTest.test_create_boot_volume

The instance name was 'ost1_test-boot-volume-instance945151122' and its id was '2546a578-5958-4474-9d8f-d89d1bc64ef6'

One can find that the instance was deleted just a little bit later after 08:30:09. Here is the latest entry from node-2/var/log/nova/nova-compute.log:

2016-08-15 08:30:10.594 30118 DEBUG oslo_concurrency.lockutils [req-4fd05c88-6027-4c7d-8d57-cd0243627d2e 6adacf51939c4e86ad5053da09aaa0d6 b0b0f4335c
1246228d4299fdd4fe48d0 - - -] Lock "2546a578-5958-4474-9d8f-d89d1bc64ef6" released by "nova.compute.manager.do_terminate_instance" :: held 18.887s i
nner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282

meaning the instance was deleted around 08:30:10.

Inspecting further one can find that instance deletion took so much time because OpenStack processes had to reconnect to RabbitMQ, which takes some time. For instance, inspect request 'req-ff447d59-a7c4-45ad-b77a-6c25dadd743d' in ./node-3/var/log/neutron/server.log. It is a request to delete the instance's port and it took 13 seconds to complete because neutron-server had to reconnect to RabbitMQ. There is little we can do to improve reconnection - it takes time for to discover that network peer is down, so I suggest to increase timeout for server deletion instead.

Fuel QA team, please increase server deletion timeout to 1 minute in OSTF test. I am pretty sure that this should be enough.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Fuel QA Team (fuel-qa)
status: Incomplete → Confirmed

@Dmitry Mescheryakov,

could you please, provide any proofs that there is no performance degradation in your's team code, related to the oslo?

If there is no such issue we will increase a timeout, like you have suggested.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Dmitry Mescheryakov (dmitrymex)

As far as I recall, 'Create volume and boot instance from it' pretty frequently failed on instance deletion if the test is run after RabbitMQ failover. I mean, that happened pretty frequently on custom destructive tests for me in previous releases (7.0, 8.0).

The reason is that after RabbitMQ failover OpenStack needs a 'warm-up' load to clean broken connections. Until that load passes, OpenStack lags a little as it clears broken connections and establishes new one on the fly. So either the failed test serves as a 'warm-up' for later tests (the test comes first in the list). Or it just executes almost hitting the limit under normal conditions and lags beyond the limit after failover.

Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Fuel QA Team (fuel-qa)
Nastya Urlapova (aurlapova) wrote :

@Dima, unfortunately I didn't find a way how we can fix it, so why you moved it to QA?

Nastya, my suggestion in comment #14 is just piled up by discussion. Here it is for convenience:

Fuel QA team, please increase server deletion timeout to 1 minute in OSTF test.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Fuel QA Team (fuel-qa)
Nastya Urlapova (aurlapova) wrote :

Moved to Incomplete, I cannot find a failure on the last three runs.

Changed in fuel:
status: Confirmed → Won't Fix
status: Won't Fix → Confirmed
Nastya Urlapova (aurlapova) wrote :

I wasn't able to find the failure on the last runs(from snapshots №299), priority was decreased to Medium.

Changed in fuel:
importance: High → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/389217

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Maksym Strukov (unbelll)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/389217
Committed: https://git.openstack.org/cgit/openstack/fuel-ostf/commit/?id=ddd691205d7dccfc9a9b52cc2681cdb367c82d2a
Submitter: Jenkins
Branch: master

commit ddd691205d7dccfc9a9b52cc2681cdb367c82d2a
Author: Maksym Strukov <email address hidden>
Date: Thu Oct 20 17:11:21 2016 +0300

    Increase delete_server timeout to 60 seconds

    Change-Id: If40772482d72267b0113b773537156b52a141f5b
    Closes-Bug: #1607808

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/390000
Committed: https://git.openstack.org/cgit/openstack/fuel-ostf/commit/?id=79e7b4fe6dc51ab42db83c41d237062cf7737f00
Submitter: Jenkins
Branch: stable/mitaka

commit 79e7b4fe6dc51ab42db83c41d237062cf7737f00
Author: Maksym Strukov <email address hidden>
Date: Thu Oct 20 17:11:21 2016 +0300

    Increase delete_server timeout to 60 seconds

    Change-Id: If40772482d72267b0113b773537156b52a141f5b
    Closes-Bug: #1607808

Not reproduced

Dmitry Pyzhov (dpyzhov) on 2017-02-01
Changed in fuel:
milestone: 10.0 → 11.0

Reviewed: https://review.openstack.org/427685
Committed: https://git.openstack.org/cgit/openstack/fuel-ostf/commit/?id=0cd75f47ce062757d109108399aa16f02e068a1c
Submitter: Jenkins
Branch: stable/newton

commit 0cd75f47ce062757d109108399aa16f02e068a1c
Author: Maksym Strukov <email address hidden>
Date: Thu Oct 20 17:11:21 2016 +0300

    Increase delete_server timeout to 60 seconds

    Change-Id: If40772482d72267b0113b773537156b52a141f5b
    Closes-Bug: #1607808
    (cherry picked from commit ddd691205d7dccfc9a9b52cc2681cdb367c82d2a)

This issue was fixed in the openstack/fuel-ostf 11.0.0.0rc1 release candidate.

Maksym Strukov (unbelll) on 2017-04-06
Changed in fuel:
assignee: Maksym Strukov (unbelll) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers