Trunk scenario test test_trunk_subport_lifecycle fails from time to time

Bug #1795870 reported by Slawek Kaplonski on 2018-10-03
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Miguel Lavalle

Bug Description

Example of failure: http://logs.openstack.org/85/606385/4/check/neutron-tempest-plugin-dvr-multinode-scenario/e7a983b/logs/testr_results.html.gz

As You can see in logstash: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22line%20143%2C%20in%20test_trunk_subport_lifecycle%5C%22

it's most common in neutron-tempest-plugin-dvr-multinode-scenario job but it's not only failing job. Other scenario jobs are impacted too.

Miguel Lavalle (minsel) on 2018-10-09
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
Miguel Lavalle (minsel) wrote :

We got 33 hits over the past 7 days, out of which 29 are with job neutron-tempest-plugin-dvr-multinode-scenario. So it seems that we should focus on isolating the problem in that job.

The test case creates two instances with a trunk and a floating ip each. Before doing any operations with the trunk. the test case attempts to ssh to both instances. The failure occurs in the second ssh. Here's the debugging data I have so far: http://paste.openstack.org/show/732300/

Miguel Lavalle (minsel) wrote :

For the time being, adding it also to the dvr backlog as well

tags: added: l3-dvr-backlog
Miguel Lavalle (minsel) wrote :

As of October 29th, 0 hits over the past 7 days. I'll continue checking it

Miguel Lavalle (minsel) wrote :

Removing the code line number from the query in the original filing and #4, so code changes don't render the query useless:

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22,%20in%20test_trunk_subport_lifecycle%5C%22&from=7d

Fix proposed to branch: master
Review: https://review.openstack.org/624271

Changed in neutron:
status: Confirmed → In Progress
Miguel Lavalle (minsel) wrote :

Analyzing the failure of test_trunk_subport_lifecycle in http://logs.openstack.org/90/627990/1/check/neutron-tempest-plugin-dvr-multinode-scenario/c760de2/testr_results.html.gz i find the following:

1) One of the instances (instance id 0580b44c-c56d-4683-b567-0008dbbe04a1, fixed ip address: 10.1.0.5, port id: 1cb39de5-6f27-4dc9-ab0d-8709b682ebd7 bound in host: ubuntu-bionic-rax-dfw-0001460688 Compute1, router id 50456504-e7e2-45c4-8b9d-2c8a7ee93c4e) gets metadata service correctly through the metadata proxy: http://paste.openstack.org/show/740465/

2) The other instance (instance id 35814d32-0769-4904-8332-638eb729e11f, fixed ip address: 10.1.0.13, port id: a685e1a1-0d79-4c54-8cea-5bf7883fcb93, bound in host ubuntu-bionic-rax-dfw-0001460687 controller, router id 50456504-e7e2-45c4-8b9d-2c8a7ee93c4e) cannot get metadata service. Metadata proxy lines for that router cannot be found in the L3 agent log file in the controller. In fact the L3 agent log file in the controller has no lines referencing router 50456504-e7e2-45c4-8b9d-2c8a7ee93c4e. As a consequence, we find failure to contact the metadata service in the instance's console log: http://paste.openstack.org/show/740467/

3) With other routers the metadata proxy works in both nodes. For example router d258390e-a30e-4302-a26a-2a03510bb1d3, that is created by test_connectivity_through_2_routers. These are the ha-proxy logs for that router from the L3 agents log files in both controller and compute1: http://paste.openstack.org/show/740472/ and http://paste.openstack.org/show/740471.

Based on these findings, next step is to investigate how the routers are being scheduled

Fix proposed to branch: master
Review: https://review.openstack.org/630778

Miguel Lavalle (minsel) wrote :

Neutron server is losing contact with the L3 agent running in the controller node. One example is:

Jan 26 18:27:23.664199 ubuntu-bionic-rax-iad-0002168118 neutron-server[6878]: WARNING neutron.db.agents_db [None req-96b7c5e3-0c74-48ca-92a2-6b43a9ef6544 None None] Agent healthcheck: found 1 dead agents out of 8:
Jan 26 18:27:23.664199 ubuntu-bionic-rax-iad-0002168118 neutron-server[6878]: Type Last heartbeat host
Jan 26 18:27:23.664199 ubuntu-bionic-rax-iad-0002168118 neutron-server[6878]: L3 agent 2019-01-26 18:25:44 ubuntu-bionic-rax-iad-0002168118

Checking in the L3 agent log around the time the first instance of the above message is seen, we can find this traceback: http://paste.openstack.org/show/744001/. Please note that this traceback takes place at Jan 26 18:25:56.559883, whereas the Neutron server starts reporting loosing contact with the L3 agent (see message above) at Jan 26 18:27:23.664199, having received the last heartbeat at 2019-01-26 18:25:44. In fact, this is the last time the L3 agent reports receiving a router update:

Jan 26 18:25:56.399748 ubuntu-bionic-rax-iad-0002168118 neutron-l3-agent[8618]: DEBUG neutron.agent.l3.agent [None req-296cf80d-5b44-4c99-914d-499ec949394b tempest-NetworkMigrationFromHA-1759813396 tempest-NetworkMigrationFromHA-1759813396] Got routers updated notification :['e6e7911c-a3e0-4331-abe4-580aaf5ba2fc'] {{(pid=8618) routers_updated /opt/stack/neutron/neutron/agent/l3/agent.py:444}}

The router with uuid e6e7911c-a3e0-4331-abe4-580aaf5ba2fc is being migrated from HA to DVR by test case NetworkMigrationFromHA:test_from_ha_to_dvr.

I have confirmed a similar pattern takes place in several occurrences of this bug. In all cases, a router is being migrated from HA to DVR or legacy.

Nest step is to dig deeper in the traceback http://paste.openstack.org/show/744001/

Fix proposed to branch: master
Review: https://review.openstack.org/639375

Akihiro Motoki (amotoki) on 2019-03-12
Changed in neutron:
milestone: none → stein-rc1

Reviewed: https://review.openstack.org/636710
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=25c432a05a57f794dcbb4f17ce224d914c65e071
Submitter: Zuul
Branch: master

commit 25c432a05a57f794dcbb4f17ce224d914c65e071
Author: Miguel Lavalle <email address hidden>
Date: Wed Feb 13 12:29:36 2019 -0600

    Add rootwrap filters to kill state change monitor

    When deleting HA routers, the keepalived state change monitor has to be
    deleted. This patch adds rootwrap filters to allow deleting the state
    change monitor.

    Change-Id: Icfb208d9b51eaa41cf01af81f1ede7420a19cc93
    Partial-Bug: #1795870
    Partial-Bug: #1789434

Miguel Lavalle (minsel) on 2019-03-14
Changed in neutron:
status: In Progress → Fix Committed

Change abandoned by Miguel Lavalle (<email address hidden>) on branch: master
Review: https://review.openstack.org/630778

Change abandoned by Miguel Lavalle (<email address hidden>) on branch: master
Review: https://review.openstack.org/624271
Reason: Fix to bug was released in Neutron: https://review.openstack.org/#/c/636710/

tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential

Reviewed: https://review.openstack.org/645283
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6f3620aa88feef9527f8c9599dec049a831b49fa
Submitter: Zuul
Branch: stable/queens

commit 6f3620aa88feef9527f8c9599dec049a831b49fa
Author: Miguel Lavalle <email address hidden>
Date: Wed Feb 13 12:29:36 2019 -0600

    Add rootwrap filters to kill state change monitor

    When deleting HA routers, the keepalived state change monitor has to be
    deleted. This patch adds rootwrap filters to allow deleting the state
    change monitor.

    Change-Id: Icfb208d9b51eaa41cf01af81f1ede7420a19cc93
    Partial-Bug: #1795870
    Partial-Bug: #1789434
    (cherry picked from commit 25c432a05a57f794dcbb4f17ce224d914c65e071)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/645282
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8b7955dade3e388ad030ce2291651cef72d55108
Submitter: Zuul
Branch: stable/rocky

commit 8b7955dade3e388ad030ce2291651cef72d55108
Author: Miguel Lavalle <email address hidden>
Date: Wed Feb 13 12:29:36 2019 -0600

    Add rootwrap filters to kill state change monitor

    When deleting HA routers, the keepalived state change monitor has to be
    deleted. This patch adds rootwrap filters to allow deleting the state
    change monitor.

    Change-Id: Icfb208d9b51eaa41cf01af81f1ede7420a19cc93
    Partial-Bug: #1795870
    Partial-Bug: #1789434
    (cherry picked from commit 25c432a05a57f794dcbb4f17ce224d914c65e071)

tags: added: in-stable-rocky
tags: removed: neutron-easy-proactive-backport-potential neutron-proactive-backport-potential

Reviewed: https://review.openstack.org/650255
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=672a4328a97d6dae98cec208851198040530bd35
Submitter: Zuul
Branch: stable/pike

commit 672a4328a97d6dae98cec208851198040530bd35
Author: Miguel Lavalle <email address hidden>
Date: Wed Feb 13 12:29:36 2019 -0600

    Add rootwrap filters to kill state change monitor

    When deleting HA routers, the keepalived state change monitor has to be
    deleted. This patch adds rootwrap filters to allow deleting the state
    change monitor.

    Change-Id: Icfb208d9b51eaa41cf01af81f1ede7420a19cc93
    Partial-Bug: #1795870
    Partial-Bug: #1789434
    (cherry picked from commit 25c432a05a57f794dcbb4f17ce224d914c65e071)
    (cherry picked from commit 6f3620aa88feef9527f8c9599dec049a831b49fa)

tags: added: in-stable-pike

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/639375
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers