Bug #1812552 “tempest-slow tests fails often” : Bugs : neutron

OpenStack Infra (hudson-openstack) on 2019-01-20

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-01-21:

#1

Maybe https://review.openstack.org/#/c/631584/ isn't culprit of this issue. I will need to investigate more if that will still happen so often.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-01-21:

#2

After some more digging into results of jobs, it looks that it might be that this patch https://review.openstack.org/#/c/631584/ is causing this issue.
I have no idea why it happens like that but tempest-slow job is running fine each time on revert of this patch (https://review.openstack.org/#/c/631944/) and failing on most of the other patches.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-01-21:

#3

I'm trying to reproduce it on my local env but with no luck so far :/

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-01-21:

#4

Using the following query:

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22,%20in%20test_attach_volume_shelved_or_offload_server%5C%22&build_name:%5C%22tempest-slow%5C%22&build_status:%5C%22FAILURE%5C%22&from=7d

We can see that the errors in Neutron started to show up at the time https://review.openstack.org/#/c/631584 merged. Prior to that, the failures in this test case were mostly in Nova and some Cinder, with the failure being:

tempest.exceptions.BuildErrorException: Server 3da71d4f-e884-4487-a692-be6671fed440 failed to build and is in ERROR status

After the merge of https://review.openstack.org/#/c/631584, the failures in Neutron, Nova, Cinder and Tempest all become:

tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.160 via SSH timed out

And we can see that the affected instance doesn't get ip address lease from DHCP:

Starting network...
udhcpc (v1.23.2) started
Sending discover...
Sending discover...
Sending discover...

That indicates the instance is not connected to the network

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2019-01-22:

#5

In agent.log the following exception seems several times:
"OVSFWPortNotFound: Port <> is not managed by this agent"

If I understand well this patch: https://review.openstack.org/630910 (Bug: https://bugs.launchpad.net/neutron/+bug/1811405 ) handle another case with tag not found exception.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-01-22:

#6

One more info which I found related to this one. It always (at least in cases where I was checking it) failing after unshelve instance. And also it looks for me from logs that instance is unshelved on different host than it was before shelve but I'm not sure if that is ok or not.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-22: Related fix proposed to neutron (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/632403

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-01-22:

#8

I pushed to gate some DNM patch with some additional logging in OVSDBMonitor class. It is here: https://review.openstack.org/#/c/632403/
tempest-slow job was failed on this patch so I analyzed logs. for one of instances (910177fa-711f-4c97-9bdc-d8d954d0fb32 from test test_attach_volume_shelved_or_offload_server in http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/testr_results.html.gz)

Port of this instance was 74ee60ef-5570-43c8-8eb9-887a2fbf6cd9.
Instance was spawned first on controller node and later unshelved on compute-1 node.
So I analyzed logs from neutron-ovs-agent on compute-1 node.

Port of this instance was indeed skipped by ovsdb_monitor class here: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_868196 but there was also even for this port with correct ofport in almost same time.
So it ended up that this port was in both "added" and "removed" events dictionary: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_870840 and that isn't in fact something very unexpected because neutron_ovs_agent can handle it: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1357

So I continue looking at neutron-openvswitch-agent logs and this port was in fact configured properly few seconds later: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_882998

Now I checked DHCP logs as in VM's console log there is info that VM didn't have network configured properly with DHCP. And what I found is that there were DHCP requests send from VM properly and DHCP server replied also: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/controller/logs/screen-q-dhcp.txt.gz#_Jan_22_12_04_31_177806 - this happened 3 times in 3 minutes which is correct according to VM's console log too.

Based on this analysis I still have no idea why those tests are failing, why it happens so often after patch https://review.openstack.org/#/c/631584/ and if this patch really have any impact on those failures or maybe it's just coincidence.

I pushed to gate some DNM patch with some additional logging in OVSDBMonitor class. It is here: https://review.openstack.org/#/c/632403/
tempest-slow job was failed on this patch so I analyzed logs. for one of instances (910177fa-711f-4c97-9bdc-d8d954d0fb32 from test test_attach_volume_shelved_or_offload_server in http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/testr_results.html.gz)

Port of this instance was 74ee60ef-5570-43c8-8eb9-887a2fbf6cd9.
Instance was spawned first on controller node and later unshelved on compute-1 node.
So I analyzed logs from neutron-ovs-agent on compute-1 node.

Port of this instance was indeed skipped by ovsdb_monitor class here: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_868196 but there was also even for this port with correct ofport in almost same time.
So it ended up that this port was in both "added" and "removed" events dictionary: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_870840 and that isn't in fact something very unexpected because neutron_ovs_agent can handle it: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1357

So I continue looking at neutron-openvswitch-agent logs and this port was in fact configured properly few seconds later: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/compute1/logs/screen-neutron-agent.txt.gz#_Jan_22_12_04_11_882998

Now I checked DHCP logs as in VM's console log there is info that VM didn't have network configured properly with DHCP. And what I found is that there were DHCP requests send from VM properly and DHCP server replied also: http://logs.openstack.org/03/632403/2/check/tempest-slow/de2a3c7/controller/logs/screen-q-dhcp.txt.gz#_Jan_22_12_04_31_177806 - this happened 3 times in 3 minutes which is correct according to VM's console log too.

Based on this analysis I still have no idea why those tests are failing, why it happens so often after patch https://review.openstack.org/#/c/631584/ and if this patch really have any impact on those failures or maybe it's just coincidence.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-22:

#9

Seems to align with when this started spiking the failures in tempest-slow:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22in%20120%20sec%20and%20the%20ping%20result%20is%20unexpected%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1&from=7d

The special thing about the tempest-slow job is it runs 2 nodes and it runs slow tests so tests that don't normally get run in other jobs like tempest-full.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-24: Fix merged to neutron (master)

#10

Reviewed: https://review.openstack.org/631944
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f1fa9529ee6753e2a3f47e4a6eb0286118520d25
Submitter: Zuul
Branch: master

commit f1fa9529ee6753e2a3f47e4a6eb0286118520d25
Author: Slawek Kaplonski <email address hidden>
Date: Sat Jan 19 23:14:41 2019 +0000

Revert "Skip ovsdb_monitor events for ports with ofport=-1"

    This patch cause some race condition in neutron-ovs-agent
    and tempest-slow job was failing quite often.
    Please see related bug report for details.

Closes-Bug: #1812552

This reverts commit f8e0a497ad6a36fa0d9510cda9e7344c2ed73763.

Change-Id: Id51f2abaf3c8d57abdd06f024120da526ed40185

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-24: Change abandoned on neutron (master)

#11

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/632403

Bernard Cafarelli (bcafarel) on 2019-02-15

tags:

added: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-26: Fix included in openstack/neutron 14.0.0.0b2

#12

This issue was fixed in the openstack/neutron 14.0.0.0b2 development milestone.

Slawek Kaplonski (slaweq) on 2019-04-01

tags:

removed: neutron-proactive-backport-potential

neutron

tempest-slow tests fails often

Bug Description

Other bug subscribers

Remote bug watches