Functional job sometimes hits global 2 hour limit and fails

Bug #1567668 reported by Assaf Muller
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Jakub Libosvar

Bug Description

Here's an example:
http://logs.openstack.org/13/302913/1/check/gate-neutron-dsvm-functional/91dd537/console.html

Logstash query:
build_name:"gate-neutron-dsvm-functional" AND build_status:"FAILURE" AND message:"Killed timeout -s 9"

45 hits in the last 7 days.

Ihar and I checked the timing, and it started happening as we merged: https://review.openstack.org/#/c/298056/ (EDIT: After some investigating, this doesn't look like the root cause).

There's a few problems here:
1) It appears like a test is freezing up. We have a per-test timeout defined. The timeout is defined by OS_TEST_TIMEOUT in tox.ini, and is enforced via a fixtures.Timeout fixture set up in the oslotest base class. It looks like that timeout doesn't always work.
2) When the global 2 hours job timeout is hit, it doesn't perform post-tests tasks such as copying over log files, which makes these problems a lot harder to troubleshoot.
3) And of course, there is some sort of issue with likely https://review.openstack.org/#/c/298056/.

We can fix via a revert, which will increase the failure rate of fullstack. Since I've been unable to reproduce this issue locally, I'd like to hold off on a revert and try to get some more information by tackling some combination of problems 1 and 2, and then adding more logging to figure it out.

Assaf Muller (amuller)
Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/303320

Changed in neutron:
assignee: nobody → Jakub Libosvar (libosvar)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/303588

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/303594

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Luckily, the issue was reproduced here and we do have logs!
http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/

After looking into the logs, I think the culprit is not the mentioned https://review.openstack.org/#/c/298056/ and this patch deserves an apology . Looking at ps output after job was finished we can see the stuck processes: one of them has pid 30938.

The last message in logs from this process is at 2016-04-08 20:27:59.522 where neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_ancillary_port_creation_and_deletion_native job failed and likely got stuck: http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_ancillary_port_creation_and_deletion_native_.txt.gz#_2016-04-08_20_27_59_522

The other stuck process 30940 has the same exception: http://logs.openstack.org/94/303594/2/check/gate-neutron-dsvm-functional/1f86f1d/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_ovs_lib.OVSBridgeTestCase.test_delete_port_if_exists_false_native_.txt.gz#_2016-04-08_20_40_05_584

Both from the above links are the last messages from that process.

Revision history for this message
Assaf Muller (amuller) wrote :

Looking at logstash (build_name:"gate-neutron-dsvm-functional" AND build_status:"FAILURE" AND message:"Killed timeout -s 9") one can see that the problem started 3 full days after https://review.openstack.org/#/c/298056/ merged, so it doesn't look like that patch is the root cause.

description: updated
Revision history for this message
Assaf Muller (amuller) wrote :

* 3 full days before, not after.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I checked that we didn't bump either eventlet or ryu version before we started seeing this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/304198
Reason: So this wasn't the culprit. Sorry OVS2.5, you're just fine.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/304600

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/305203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/305203
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7d5b1f8dddd0556220ea9f3baae87e3d69b3020f
Submitter: Jenkins
Branch: master

commit 7d5b1f8dddd0556220ea9f3baae87e3d69b3020f
Author: Jakub Libosvar <email address hidden>
Date: Wed Apr 13 12:40:15 2016 +0000

    Skip l2_ovs_agent functional tests

    Let's skip the flaky tests until we find the root cause. The debugging
    will happen with another patch that unblocks these tests so we still
    have reproducer.

    Change-Id: Icad4c1fb7a7ef9d6f5c34f48deb46ae286cb536b
    Related-Bug: 1567668

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/303320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/303588
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=474b1e7e2f9618a74db3d1359bf2b8f8beffc221
Submitter: Jenkins
Branch: master

commit 474b1e7e2f9618a74db3d1359bf2b8f8beffc221
Author: Assaf Muller <email address hidden>
Date: Fri Apr 8 15:16:54 2016 -0400

    Add logging for some functional tests

    Functional tests log to a file only if they inherit from
    the Sudo tests base class. This patch changes the base
    class for some test cases to make them log.

    Related-Bug: #1567668
    Change-Id: I494ad5410e48489f1fb3689cec44c5a29bbc42f3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/303594
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8adc737cd17bbc4244e235b4c9343c6e8a08a2ab
Submitter: Jenkins
Branch: master

commit 8adc737cd17bbc4244e235b4c9343c6e8a08a2ab
Author: Assaf Muller <email address hidden>
Date: Fri Apr 8 15:34:28 2016 -0400

    Delete post_test_hook.generate_test_logs

    Log files as .txt files, don't zip them, and put them where
    they need to be instead of copy them there in the post gate
    hook. The benefit to doing this is that we'll get logs
    for tests even if the job timed out.

    Change-Id: I4bfd27534c827aed3cbd7b43d7d1289480ea4806
    Related-Bug: #1567668

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Update: After investigation I finally found the root cause of why our jobs get stuck.

When using ovsdb native interface, we use ovs python library that encapsulates connection from our code to ovsdb. Part of this connection is keepalive feature, that during inactivity phase sends every 5 seconds "inactivity probe" in form of echo request. Connection to ovsdb is a FSM, which after sending the probe comes to idle state waiting for echo reply from Neutron (also part of ovs library). In the logs I see that after 5 seconds no reply was received and ovsdb closes connection.

2016-05-08T04:43:21.031Z|109380|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:21.031Z|109381|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:21.031Z|109382|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:21.032Z|109384|jsonrpc|DBG|tcp:127.0.0.1:42822: received reply, result=[], id="echo"
2016-05-08T04:43:21.032Z|109385|reconnect|DBG|tcp:127.0.0.1:42822: entering ACTIVE
2016-05-08T04:43:26.031Z|112393|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:26.031Z|112394|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:26.031Z|112395|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:31.032Z|113938|reconnect|ERR|tcp:127.0.0.1:42822: no response to inactivity probe after 5 seconds, disconnecting
2016-05-08T04:43:31.032Z|113939|reconnect|DBG|tcp:127.0.0.1:42822: connection dropped
2016-05-08T04:43:31.032Z|113940|reconnect|DBG|tcp:127.0.0.1:42822: entering VOID

This leads to blocking behavior in ovsdb native interface (part of Neutron code), where we expect communication from ovsdb server that actually never happens. We wait indefinitely for ovsdb and thus after 180 seconds test case raises correctly TimeoutException. Unfortunately we keep using ovsdb in a stuck state without any timeout fixture which leads to overall job timeout (wrote this just for clarity on why our jobs hang).

I reckon bug in both, ovs library and Neutron code. ovs library probably contains a regression in keepalive feature. Neutron shouldn't use blocking behavior in case connection to ovsdb is lost.

I'll go over ovs patches to see whether I can find the culprit.

Revision history for this message
Ryan Moats (rmoats) wrote :

There are two semi-independent options here:

First, yes, neutron should not block if ovsdb connection is lost, it should re-establish the connection and continue.

Once #1 is in place, then we can look at turning off the probe port timer - this should prevent
the connection from being declared dead.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

@Ryan: Re-establishing of connection is taken care at ovs library level (where we have this bug) so if we implement the mechanism in Neutron, we'll have it twice. Just stating a fact, I'm also in favor of re-establishing it in our code and re-run transaction.

IIUC the probe is there to implement keepalive connections and to disconnect in case peer is dead. Do you know implications what happens to ovsdb if it would have a connection without probes and peer dies?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/315042

Revision history for this message
Jakub Libosvar (libosvar) wrote :

UPD: We introduced this bug by switching from ovs2.0 to ovs2.5 as with former version probe wasn't used for ptcp connections: https://github.com/openvswitch/ovs/commit/f5cd74291d02f3723e175048db87b2914cd99e34

With probes and busy server, greenthread that processes echo requests from server might not get processed in time. Which means that server disconnects after not getting echo reply.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

But why doesn't Neutron recover from that disconnect? Even if you disable the probes, disconnects may happen for some other reason.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/317369

Changed in neutron:
milestone: none → newton-1
Changed in neutron:
assignee: Jakub Libosvar (libosvar) → Henry Gessau (gessau)
Changed in neutron:
assignee: Henry Gessau (gessau) → Jakub Libosvar (libosvar)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/317369
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c13d722f3913137945c27fcc74371d3316129f30
Submitter: Jenkins
Branch: master

commit c13d722f3913137945c27fcc74371d3316129f30
Author: Jakub Libosvar <email address hidden>
Date: Mon May 16 18:07:32 2016 +0000

    ovsdb: Don't let block() wait indefinitely

    Poller.block() calls select() on defined file descriptors. This patch
    adds timeout to select() so it doesn't get stuck in case no fd is ready.

    Also timeout was added when reading transaction results from queue.

    Closes-Bug: 1567668
    Change-Id: I7dbddd01409430ce93d8c23f04f02c46fb2a68c4

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

The incidence of the bug has been decreased, but it's still seems to be happening by the logstash query.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Jakub Libosvar (<email address hidden>) on branch: master
Review: https://review.openstack.org/315042

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/304600
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/349415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/349415
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f49a25c088392068e8a82f43cbab21d4a8981b10
Submitter: Jenkins
Branch: stable/mitaka

commit f49a25c088392068e8a82f43cbab21d4a8981b10
Author: Jakub Libosvar <email address hidden>
Date: Mon May 16 18:07:32 2016 +0000

    ovsdb: Don't let block() wait indefinitely

    Poller.block() calls select() on defined file descriptors. This patch
    adds timeout to select() so it doesn't get stuck in case no fd is ready.

    Also timeout was added when reading transaction results from queue.

    Conflicts:
     neutron/tests/functional/agent/test_l2_ovs_agent.py

    Closes-Bug: 1567668
    Change-Id: I7dbddd01409430ce93d8c23f04f02c46fb2a68c4
    (cherry picked from commit c13d722f3913137945c27fcc74371d3316129f30)

tags: added: in-stable-mitaka
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.