test_network_basic_ops to fail with SSHTimeout

Bug #1623732 reported by Armando Migliaccio
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Armando Migliaccio

Bug Description

http://logs.openstack.org/63/370363/3/gate/gate-grenade-dsvm-neutron-ubuntu-trusty/5581f78/

http://logs.openstack.org/09/368709/3/gate/gate-grenade-dsvm-neutron-ubuntu-trusty/d1195ea/

Allegedly errors in network startup:

Initializing random number generator... done.
Starting acpid: OK
cirros-ds 'local' up at 5.68
no results found for mode=local. up 5.99. searched: nocloud configdrive ec2
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.1.0.10...
Lease of 10.1.0.10 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
cirros-ds 'net' up at 7.01

Leads to usual:

tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.9 via SSH timed out.

Tags: gate-failure
Changed in neutron:
milestone: none → newton-rc1
importance: Undecided → High
tags: added: gate-failure
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
importance: High → Critical
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This seems to be a persistent failure or a very stubborn one.

summary: - failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
+ failed: route add -net "0.0.0.0/0" gw "10.1.0.1" leads to
+ test_network_basic_ops to fail with SSHTimeout
Changed in neutron:
status: New → Confirmed
Revision history for this message
Brian Haley (brian-haley) wrote : Re: failed: route add -net "0.0.0.0/0" gw "10.1.0.1" leads to test_network_basic_ops to fail with SSHTimeout

route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"

So what was the default route that existed already? Is there some problem with the cirros image being booted?

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I tracked it down to malfunctioned metadata service due to errors in l3 agent: http://logs.openstack.org/09/368709/3/check/gate-grenade-dsvm-neutron-ubuntu-trusty/dc9b80f/logs/new/screen-q-l3.txt.gz#_2016-09-15_01_17_49_627

For some reason it ignores the namespace and attempts to set mac of qg- device in a root namespace.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I went through the code and we attach the qg- device to the namespace later, so it's expected to have the qg- at this point in the root namespace.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

It looks like it failed to create the qg- port in br-ex:
From l3-agent log:
2016-09-15 01:17:49.519 18771 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddPortCommand(bridge=br-ex, may_exist=False, port=qg-b131ae03-60) do_commit /opt/stack/new/neutron/neutron/agent/ovsdb/impl_idl.py:98
2016-09-15 01:17:49.519 18771 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=1): DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'b131ae03-6030-41e6-88b4-fe6f8d26da30', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:10:6c:4b'})), record=qg-b131ae03-60) do_commit /opt/stack/new/neutron/neutron/agent/ovsdb/impl_idl.py:98

vswtichd log:
2016-09-15T01:17:49.569Z|00162|netdev_linux|WARN|Dropped 5 log messages in last 12 seconds (most recently, 6 seconds ago) due to excessive rate
2016-09-15T01:17:49.569Z|00163|netdev_linux|WARN|query qg-b131ae03-60 qdisc failed (No such device)

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Nice find. At this point I am looking for potential culprits, but I have not found any yet. It's way too suspicious that the error rate of the grenade job started to spike in the last 24 hours.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Brian: Kevin mentioned that that error might be a red herring.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

So my finding led to https://bugs.launchpad.net/neutron/+bug/1618987

I'm still looking why neutron-ns-metadata-proxy for that router is not spawned.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
summary: - failed: route add -net "0.0.0.0/0" gw "10.1.0.1" leads to
test_network_basic_ops to fail with SSHTimeout
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

The bot forgot to link the review: https://review.openstack.org/#/c/371032/

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Jakub: I'll track down why the proxy failed to spawn.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

If we get a failure in [1], as we do, then the AFTER_CREATE notify is not emitted, and since the metadata driver has no UPDATE callbacks, the metadata proxy is never spawned.

[1] http://git.openstack.org/cgit/openstack/neutron/tree/neutron/agent/l3/agent.py#n447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/371085

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

To sum it up, it seems the resync logic on the L3 agent is not robust enough to handle intermittent setup failures for routers, in that metadata proxy fails to get spawned even on router resyncs. For this reason, the instance is unable to connect to the metadata server and thus unable to get the keys and ultimately unable to be ssh-ed into.

This is I believe what's happening especially on the grenade runs where metadata is still used.

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Kevin Benton (kevinbenton)
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/371032
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a75ce6850f3954edafbb0c128750e39b57875743
Submitter: Jenkins
Branch: master

commit a75ce6850f3954edafbb0c128750e39b57875743
Author: Kevin Benton <email address hidden>
Date: Thu Sep 15 01:33:00 2016 -0700

    Retry setting mac address on new OVS port 10 times

    We've seen several times in the gate an OVS add port call
    succeed and then have the mac address set fail to find it
    seconds later. The vswitch log frequently shows that it
    returns milliseconds later. Until we get to the bottom of
    it, we should just retry several times before giving up
    and raising.

    Closes-Bug: #1618987
    Partial-Bug: #1623732
    Change-Id: Ia73a9be047093c02f61e3e9ce13d98dccd49dfeb

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/371085
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a60c2de881a806e041190732d4def209c583aac1
Submitter: Jenkins
Branch: master

commit a60c2de881a806e041190732d4def209c583aac1
Author: Armando Migliaccio <email address hidden>
Date: Thu Sep 15 13:06:06 2016 -0700

    Add metadata proxy router_update callback handler

    This patch implements the callback handler for router update events;
    This checks if the proxy process monitor is active, and if not, starts
    the proxy.

    This is particularly important if the metadata driver misses to receive
    a create notification due to failures, which in turn generates an update
    event because of a resync step.

    Closes-bug: #1623732

    Change-Id: I296a37daff1e5f018ae11eb8661c77ad346b8075

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.