Several tempest tests failed because of timed out request

Bug #1652934 reported by Sofiia Andriichenko
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Confirmed
Medium
MOS QA Team

Bug Description

Configuration:
    ISO: 9.2 snapshot #688
Settings:
Compute - QEMU.
Network - Neutron with VLAN segmentation.
Storage Backends - LVM
Additional services - Install Ironic, Install Sahara

In tab Settings->Compute check Nova quotas
In tab Settings->OpenStack Services check enable Install Ceilometer and Aodh
In tab Networks->Other check enable Neutron DVR

Nodes: controller, compute, ironic,cinder, Telemetry - MongoDB

Trace test_snapshot_pattern:
http://paste.openstack.org/show/593517/

Failed tests:
test_server_connectivity_reboot[compute,id-7b6860c2-afa3-4846-9522-adeb38dfbe08,network]
test_server_connectivity_rebuild[compute,id-88a529c2-1daa-4c85-9aec-d541ba3eb699,network]
test_server_connectivity_suspend_resume[compute,id-5cdf9499-541d-4923-804e-b9a60620a7f0,network]
test_hotplug_nic[compute,id-c5adff73-e961-41f1-b4a9-343614f18cfa,network]
test_mtu_sized_frames[compute,id-b158ea55-472e-4086-8fa9-c64ac0c6c1d0,network]
test_port_security_macspoofing_port[compute,id-7c0bb1a2-d053-49a4-98f9-ca1a1d849f63,network]
test_preserve_preexisting_port[compute,id-759462e1-8535-46b0-ab3a-33aa45c55aaa,network]
test_subnet_details[compute,id-d8bb918e-e2df-48b2-97cd-b73c95450980,network]
test_update_instance_port_admin_state[compute,id-f5dfcc22-45fd-409f-954c-5bd500d7890b,network]
test_dualnet_dhcp6_stateless_from_os[compute,id-76f26acd-9688-42b4-bc3e-cd134c4cb09e,network,slow]
test_dualnet_multi_prefix_dhcpv6_stateless[compute,id-cf1c4425-766b-45b8-be35-e2959728eb00,network]
test_dualnet_multi_prefix_slaac[compute,id-9178ad42-10e4-47e9-8987-e02b170cc5cd,network]
test_multi_prefix_dhcpv6_stateless[compute,id-7ab23f41-833b-4a16-a7c9-5b42fe6d4123,network,slow]
test_multi_prefix_slaac[compute,id-dec222b1-180c-4098-b8c5-cc1b8342d611,network,slow]
test_slaac_from_os[compute,id-2c92df61-29f0-4eaa-bee3-7c65bef62a43,network,slow]
test_cross_tenant_traffic[compute,id-e79f879e-debb-440c-a7e4-efeda05b6848,network]
test_in_tenant_traffic[compute,id-63163892-bbf6-4249-aa12-d5ea1f8f421b,network]
test_port_update_new_security_group[compute,id-f4d556d7-1526-42ad-bafb-6bebf48568f6,network]
test_shelve_volume_backed_instance[compute,id-c1b6318c-b9da-490b-9c67-9339b627271f,image,network,volume]
test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,smoke,volume]
test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,smoke,volume]
test_port_security_disable_security_group[compute,id-7c811dcc-263b-49a3-92d2-1b4d8405f50c,network]
test_resize_volume_backed_server_confirm[compute,id-e6c28180-7454-4b59-b188-0257af08a63b,volume]
test_server_sequence_suspend_resume[compute,id-949da7d5-72c8-4808-8802-e3d70df98e2c]
test_server_basic_ops[compute,id-7fff3fb3-91d8-4fd0-bd7d-0204f1f180ba,network,smoke]
test_snapshot_pattern[compute,id-608e604b-1d63-4a82-8e3e-91bc665c90b4,image,network]

Tags: area-qa
Changed in mos:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → MOS Neutron (mos-neutron)
tags: added: blocker-for-qa
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Please, attach Neutron and Nova logs - it can not be investigated without it.

Changed in mos:
status: Confirmed → Incomplete
assignee: MOS Neutron (mos-neutron) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :
Changed in mos:
status: Incomplete → Confirmed
assignee: Sofiia Andriichenko (sandriichenko) → MOS Neutron (mos-neutron)
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Thanks for sharing snapshot!

Unfortunately, snapshot does not contain tempest logs, so comparing time when tests were executed and what has happened at that time is hard.

Does all the tests failed with http://paste.openstack.org/show/593517/?

I found several failures on nova side, for example http://paste.openstack.org/show/593621/, so it will be good if Nova team take a look at this, too.

Also, there was error http://paste.openstack.org/show/593622/ - which means that some of the tests was wrong configured or cleanup does not work properly. I can not say which one was that, as it is not clear what tests were executed.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

>>> I found several failures on nova side, for example http://paste.openstack.org/show/593621/, so it will be good if Nova team take a look at this, too.

it must be ok: it's a valid error in Nova when we try to create a live snapshot without support from the guest OS (a qemu agent must be running in the guest), we then retry using a simpler approach - https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L1898-L1913

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Incomplete - Sofia, please provide complete logs of tempest run so Neutron team could investigate and find out cause of the issue.

Changed in mos:
status: Confirmed → Incomplete
assignee: MOS Neutron (mos-neutron) → Sofiia Andriichenko (sandriichenko)
Revision history for this message
Yury Tregubov (ytregubov) wrote :
Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Console log looks bad: http://paste.openstack.org/show/594426/ - 2 tests failed because of this.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Other tests failed either with
 - "failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: networking."

or:
 - "failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: scheduling."

An example of nova-compute log when launching server 3f70c24d-c77d-4cd5-b1c8-d9fd581be98f: http://paste.openstack.org/show/594429/ - so it stops for some reason even before calling to neutron. Probably some general problem with the env.

Nova team please take a look.

Changed in mos:
assignee: Sofiia Andriichenko (sandriichenko) → MOS Nova (mos-nova)
Changed in mos:
assignee: MOS Nova (mos-nova) → Yury Tregubov (ytregubov)
status: Confirmed → Incomplete
assignee: Yury Tregubov (ytregubov) → MOS Nova (mos-nova)
status: Incomplete → Confirmed
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
Revision history for this message
Yury Tregubov (ytregubov) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Looks like an OVS problem (probably caused by some underlying system problem) on compute node 3: http://paste.openstack.org/show/594509/

Leads to servers being unable to spawn: http://paste.openstack.org/show/594508/

This causes tests to fail with "Server 13babaaa-a1a9-4e82-9141-fcf4bd2e1277 failed to build and is in ERROR status" or "Found multiple IPv4 addresses"

Looks similar to bug https://bugs.launchpad.net/mos/+bug/1606546 (see comment #16)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Agreed with Oleg on this: the latest repro seems to be a different problem with OVS.

I took a look at the original diagnostic snapshot and looks like all the failures are consolidated on the node-3. It's not entirely clear to me what went wrong around 00:52:40 on December 28, but looks like after that nova-compute became buggy, e.g. the periodic tasks were not executed anymore. At the same time atop logs look fine, although there are some weird errors in syslog:

http://paste.openstack.org/show/594532/

Waiting for another repro and a live environment.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Anna Babich (ababich) wrote :

It reproduced with https://paste.mirantis.net/show/2936/
Feel free and contact me to take an environment with it

Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

In https://bugs.launchpad.net/mos/+bug/1652934/comments/7 we saw a level 2 guest (as this lab is deployed on VMs) crashing on do_async_page_fault as well

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

QA folks, can we give it a try with swap disabled on the host or with L1 guests (i.e. deployed openstack compute nodes) with disabled VMX extension?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

MOS Linux, what's your take on this?

Changed in mos:
assignee: Roman Podoliaka (rpodolyaka) → MOS Linux (mos-linux)
Revision history for this message
Anna Babich (ababich) wrote :

@rpodolyaka, it will be checked with nested_kvm disabled

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

In the last run OVS is not the only process that goes into D state (i.e. the problem is not OVS-specific), now it's mcollective:

http://paste.openstack.org/show/594815/
http://paste.openstack.org/show/594816/

This does not confirm that it's https://www.spinics.net/lists/kvm/msg142390.html for sure, but it very much look like so.

Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Roman, I'm afraid your fears about https://www.spinics.net/lists/kvm/msg142390.html are true.

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

There are the following ways to solve the issue:
1. pass 'no-kvmapf' kernel option in L1. This option will disable async pf from L2 to be passed to L0, because it disables it on L1. This will lead to some performance degradation though.
2. there is a kernel patch https://www.spinics.net/lists/kvm/msg142390.html which is not merged (and we can't guess when it will be), and we don't maintain kernel anyway.
3. it's possible though to try to build kvm module (e.g. as dkms) but this might bring more issues than it fixes.

Solution #1 looks best, we're waiting for confirmation that it works.

If it does then we should decide should it be some tunable passed to master node from fuel-qa or some logic in fuel-agent (it generates kernel options for slave nodes). This option can't be passed to any node, because on L0 compute it will lead to performance degradation. Controller nodes will not be affected (no matter HW or VM).

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

So it seems that setting 'no-kvmapf' on L1 works, tests with that option doesn't fail with any of the symptoms observed earlier in this bug. There are failures anyway, however they doesn't seem related. New failures are reproducable (in two runs it was reproduced) and look like:

-----
    Traceback (most recent call last):
      File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc
        return func(*args, **kwargs)
      File "tempest/services/object_storage/container_client.py", line 53, in delete_container
        resp, body = self.delete(url)
      File "tempest/lib/common/rest_client.py", line 307, in delete
        return self.request('DELETE', url, extra_headers, headers, body)
      File "tempest/lib/common/rest_client.py", line 664, in request
        self._error_checker(resp, resp_body)
      File "tempest/lib/common/rest_client.py", line 776, in _error_checker
        raise exceptions.Conflict(resp_body, resp=resp)
    tempest.lib.exceptions.Conflict: An object with that identifier already exists
    Details: <html><h1>Conflict</h1><p>There was a conflict when trying to complete your request.</p></html>
-----

More log available [1]

[1] http://paste.openstack.org/show/595048/

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Passing to fuel-qa as the solution for this bug requires passing additional option to environment before deploying. This option should enable 'no-kvmapf' option in kernel boot parameters on compute node (L1 guest).

Changed in mos:
assignee: MOS Linux (mos-linux) → Fuel QA Team (fuel-qa)
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Dima, unfortunately fuel-qa team doesn't responsible for Tempest testing, so issue was move to mos-qa

Changed in mos:
assignee: Fuel QA Team (fuel-qa) → MOS QA Team (mos-qa)
Revision history for this message
Anna Babich (ababich) wrote :

@aurlapova, for tests running, the Tempest jobs use environments deployed by fuel-qa scripts, not their owns. And solution which @teselkin-d told about, needs updates on fuel-qa's side exactly.

tags: added: area-qa
removed: area-neutron
Revision history for this message
Yury Tregubov (ytregubov) wrote :

The problem was not seen on runs for 9.2 snapshots #771 and #776 so far.
And the only reasonable way to implement solution with 'no-kvmapf' option seems to be to add reconfiguration actions on the environment right after the deployment. That can be done within CI that runs tempest suites.

tags: removed: blocker-for-qa
Changed in mos:
importance: High → Medium
milestone: 9.2 → 9.3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.