Bug #1567668 “Functional job sometimes hits global 2 hour limit ...” : Bugs : neutron

Assaf Muller (amuller) on 2016-04-07

Changed in neutron:
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08: Fix proposed to neutron (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/303320

Changed in neutron:
assignee:	nobody → Jakub Libosvar (libosvar)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08: Related fix proposed to neutron (master)

#2

Related fix proposed to branch: master
Review: https://review.openstack.org/303588

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-08:

#3

Related fix proposed to branch: master
Review: https://review.openstack.org/303594

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-04-11:

#4

Luckily, the issue was reproduced here and we do have logs!
http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/

After looking into the logs, I think the culprit is not the mentioned https://review.openstack.org/#/c/298056/ and this patch deserves an apology . Looking at ps output after job was finished we can see the stuck processes: one of them has pid 30938.

The last message in logs from this process is at 2016-04-08 20:27:59.522 where neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_ancillary_port_creation_and_deletion_native job failed and likely got stuck: http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_ancillary_port_creation_and_deletion_native_.txt.gz#_2016-04-08_20_27_59_522

The other stuck process 30940 has the same exception: http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_ovs_lib.OVSBridgeTestCase.test_delete_port_if_exists_false_native_.txt.gz#_2016-04-08_20_40_05_584

Both from the above links are the last messages from that process.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-04-11:

#5

Looking at logstash (build_name:"gate-neutron-dsvm-functional" AND build_status:"FAILURE" AND message:"Killed timeout -s 9") one can see that the problem started 3 full days after https://review.openstack.org/#/c/298056/ merged, so it doesn't look like that patch is the root cause.

description:

updated

Revision history for this message

Assaf Muller (amuller) wrote on 2016-04-11:

#6

* 3 full days before, not after.

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-04-11:

#7

I checked that we didn't bump either eventlet or ryu version before we started seeing this issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-11: Change abandoned on neutron (master)

#8

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/304198
Reason: So this wasn't the culprit. Sorry OVS2.5, you're just fine.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-12: Related fix proposed to neutron (master)

#9

Related fix proposed to branch: master
Review: https://review.openstack.org/304600

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-13:

#10

Related fix proposed to branch: master
Review: https://review.openstack.org/305203

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-13: Related fix merged to neutron (master)

#11

Reviewed: https://review.openstack.org/305203
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7d5b1f8dddd0556220ea9f3baae87e3d69b3020f
Submitter: Jenkins
Branch: master

commit 7d5b1f8dddd0556220ea9f3baae87e3d69b3020f
Author: Jakub Libosvar <email address hidden>
Date: Wed Apr 13 12:40:15 2016 +0000

Skip l2_ovs_agent functional tests

    Let's skip the flaky tests until we find the root cause. The debugging
    will happen with another patch that unblocks these tests so we still
    have reproducer.

Change-Id: Icad4c1fb7a7ef9d6f5c34f48deb46ae286cb536b
Related-Bug: 1567668

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-04: Change abandoned on neutron (master)

#12

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/303320

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-04: Related fix merged to neutron (master)

#13

Reviewed: https://review.openstack.org/303588
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=474b1e7e2f9618a74db3d1359bf2b8f8beffc221
Submitter: Jenkins
Branch: master

commit 474b1e7e2f9618a74db3d1359bf2b8f8beffc221
Author: Assaf Muller <email address hidden>
Date: Fri Apr 8 15:16:54 2016 -0400

Add logging for some functional tests

    Functional tests log to a file only if they inherit from
    the Sudo tests base class. This patch changes the base
    class for some test cases to make them log.

Related-Bug: #1567668
Change-Id: I494ad5410e48489f1fb3689cec44c5a29bbc42f3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-04:

#14

Reviewed: https://review.openstack.org/303594
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8adc737cd17bbc4244e235b4c9343c6e8a08a2ab
Submitter: Jenkins
Branch: master

commit 8adc737cd17bbc4244e235b4c9343c6e8a08a2ab
Author: Assaf Muller <email address hidden>
Date: Fri Apr 8 15:34:28 2016 -0400

Delete post_test_hook.generate_test_logs

    Log files as .txt files, don't zip them, and put them where
    they need to be instead of copy them there in the post gate
    hook. The benefit to doing this is that we'll get logs
    for tests even if the job timed out.

Change-Id: I4bfd27534c827aed3cbd7b43d7d1289480ea4806
Related-Bug: #1567668

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-05-09:

#15

Update: After investigation I finally found the root cause of why our jobs get stuck.

When using ovsdb native interface, we use ovs python library that encapsulates connection from our code to ovsdb. Part of this connection is keepalive feature, that during inactivity phase sends every 5 seconds "inactivity probe" in form of echo request. Connection to ovsdb is a FSM, which after sending the probe comes to idle state waiting for echo reply from Neutron (also part of ovs library). In the logs I see that after 5 seconds no reply was received and ovsdb closes connection.

2016-05-08T04:43:21.031Z|109380|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:21.031Z|109381|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:21.031Z|109382|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:21.032Z|109384|jsonrpc|DBG|tcp:127.0.0.1:42822: received reply, result=[], id="echo"
2016-05-08T04:43:21.032Z|109385|reconnect|DBG|tcp:127.0.0.1:42822: entering ACTIVE
2016-05-08T04:43:26.031Z|112393|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:26.031Z|112394|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:26.031Z|112395|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:31.032Z|113938|reconnect|ERR|tcp:127.0.0.1:42822: no response to inactivity probe after 5 seconds, disconnecting
2016-05-08T04:43:31.032Z|113939|reconnect|DBG|tcp:127.0.0.1:42822: connection dropped
2016-05-08T04:43:31.032Z|113940|reconnect|DBG|tcp:127.0.0.1:42822: entering VOID

This leads to blocking behavior in ovsdb native interface (part of Neutron code), where we expect communication from ovsdb server that actually never happens. We wait indefinitely for ovsdb and thus after 180 seconds test case raises correctly TimeoutException. Unfortunately we keep using ovsdb in a stuck state without any timeout fixture which leads to overall job timeout (wrote this just for clarity on why our jobs hang).

I reckon bug in both, ovs library and Neutron code. ovs library probably contains a regression in keepalive feature. Neutron shouldn't use blocking behavior in case connection to ovsdb is lost.

I'll go over ovs patches to see whether I can find the culprit.

Update: After investigation I finally found the root cause of why our jobs get stuck.

When using ovsdb native interface, we use ovs python library that encapsulates connection from our code to ovsdb. Part of this connection is keepalive feature, that during inactivity phase sends every 5 seconds "inactivity probe" in form of echo request. Connection to ovsdb is a FSM, which after sending the probe comes to idle state waiting for echo reply from Neutron (also part of ovs library). In the logs I see that after 5 seconds no reply was received and ovsdb closes connection.

2016-05-08T04:43:21.031Z|109380|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:21.031Z|109381|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:21.031Z|109382|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:21.032Z|109384|jsonrpc|DBG|tcp:127.0.0.1:42822: received reply, result=[], id="echo"
2016-05-08T04:43:21.032Z|109385|reconnect|DBG|tcp:127.0.0.1:42822: entering ACTIVE
2016-05-08T04:43:26.031Z|112393|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:26.031Z|112394|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:26.031Z|112395|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:31.032Z|113938|reconnect|ERR|tcp:127.0.0.1:42822: no response to inactivity probe after 5 seconds, disconnecting
2016-05-08T04:43:31.032Z|113939|reconnect|DBG|tcp:127.0.0.1:42822: connection dropped
2016-05-08T04:43:31.032Z|113940|reconnect|DBG|tcp:127.0.0.1:42822: entering VOID

This leads to blocking behavior in ovsdb native interface (part of Neutron code), where we expect communication from ovsdb server that actually never happens. We wait indefinitely for ovsdb and thus after 180 seconds test case raises correctly TimeoutException. Unfortunately we keep using ovsdb in a stuck state without any timeout fixture which leads to overall job timeout (wrote this just for clarity on why our jobs hang).

I reckon bug in both, ovs library and Neutron code. ovs library probably contains a regression in keepalive feature. Neutron shouldn't use blocking behavior in case connection to ovsdb is lost.

I'll go over ovs patches to see whether I can find the culprit.

Revision history for this message

Ryan Moats (rmoats) wrote on 2016-05-10:

#16

There are two semi-independent options here:

First, yes, neutron should not block if ovsdb connection is lost, it should re-establish the connection and continue.

Once #1 is in place, then we can look at turning off the probe port timer - this should prevent
the connection from being declared dead.

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-05-10:

#17

@Ryan: Re-establishing of connection is taken care at ovs library level (where we have this bug) so if we implement the mechanism in Neutron, we'll have it twice. Just stating a fact, I'm also in favor of re-establishing it in our code and re-run transaction.

IIUC the probe is there to implement keepalive connections and to disconnect in case peer is dead. Do you know implications what happens to ovsdb if it would have a connection without probes and peer dies?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-11: Fix proposed to neutron (master)

#18

Fix proposed to branch: master
Review: https://review.openstack.org/315042

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-05-11:

#19

UPD: We introduced this bug by switching from ovs2.0 to ovs2.5 as with former version probe wasn't used for ptcp connections: https://github.com/openvswitch/ovs/commit/f5cd74291d02f3723e175048db87b2914cd99e34

With probes and busy server, greenthread that processes echo requests from server might not get processed in time. Which means that server disconnects after not getting echo reply.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2016-05-12:

#20

But why doesn't Neutron recover from that disconnect? Even if you disable the probes, disconnects may happen for some other reason.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-17: Related fix proposed to neutron (master)

#21

Related fix proposed to branch: master
Review: https://review.openstack.org/317369

Ihar Hrachyshka (ihar-hrachyshka) on 2016-05-24

Changed in neutron:
milestone:	none → newton-1

OpenStack Infra (hudson-openstack) on 2016-05-25

Changed in neutron:
assignee:	Jakub Libosvar (libosvar) → Henry Gessau (gessau)

OpenStack Infra (hudson-openstack) on 2016-05-26

Changed in neutron:
assignee:	Henry Gessau (gessau) → Jakub Libosvar (libosvar)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-28: Fix merged to neutron (master)

#22

Reviewed: https://review.openstack.org/317369
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c13d722f3913137945c27fcc74371d3316129f30
Submitter: Jenkins
Branch: master

commit c13d722f3913137945c27fcc74371d3316129f30
Author: Jakub Libosvar <email address hidden>
Date: Mon May 16 18:07:32 2016 +0000

ovsdb: Don't let block() wait indefinitely

Poller.block() calls select() on defined file descriptors. This patch
adds timeout to select() so it doesn't get stuck in case no fd is ready.

Also timeout was added when reading transaction results from queue.

Closes-Bug: 1567668
Change-Id: I7dbddd01409430ce93d8c23f04f02c46fb2a68c4

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2016-05-31:

#23

The incidence of the bug has been decreased, but it's still seems to be happening by the logstash query.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-06-03: Fix included in openstack/neutron 9.0.0.0b1

#24

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-29: Change abandoned on neutron (master)

#25

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/315042

Ihar Hrachyshka (ihar-hrachyshka) on 2016-06-30

tags:

added: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-05:

#26

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/304600
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-01: Fix proposed to neutron (stable/mitaka)

#27

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/349415

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-03: Fix merged to neutron (stable/mitaka)

#28

Reviewed: https://review.openstack.org/349415
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f49a25c088392068e8a82f43cbab21d4a8981b10
Submitter: Jenkins
Branch: stable/mitaka

commit f49a25c088392068e8a82f43cbab21d4a8981b10
Author: Jakub Libosvar <email address hidden>
Date: Mon May 16 18:07:32 2016 +0000

ovsdb: Don't let block() wait indefinitely

Poller.block() calls select() on defined file descriptors. This patch
adds timeout to select() so it doesn't get stuck in case no fd is ready.

Also timeout was added when reading transaction results from queue.

Conflicts:
neutron/tests/functional/agent/test_l2_ovs_agent.py

    Closes-Bug: 1567668
    Change-Id: I7dbddd01409430ce93d8c23f04f02c46fb2a68c4
    (cherry picked from commit c13d722f3913137945c27fcc74371d3316129f30)

tags:

added: in-stable-mitaka

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix included in openstack/neutron 8.3.0

#29

This issue was fixed in the openstack/neutron 8.3.0 release.

neutron

Functional job sometimes hits global 2 hour limit and fails

Bug Description

Other bug subscribers

Remote bug watches