test_network_basic_ops fails waiting for network to become available

Bug #1224001 reported by Peter Portante on 2013-09-11
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Undecided
Unassigned
neutron
High
Salvatore Orlando
tempest
Undecided
Unassigned

Bug Description

See http://logs.openstack.org/25/43125/4/gate/gate-tempest-devstack-vm-neutron/40f3725/console.html

2013-09-11 16:20:48.981 | ======================================================================
2013-09-11 16:20:48.982 | FAIL: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops[gate,smoke]
2013-09-11 16:20:48.982 | tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops[gate,smoke]
2013-09-11 16:20:48.982 | ----------------------------------------------------------------------
2013-09-11 16:20:48.982 | _StringException: Empty attachments:
2013-09-11 16:20:48.982 | stderr
2013-09-11 16:20:48.982 | stdout
2013-09-11 16:20:48.982 |
2013-09-11 16:20:48.983 | pythonlogging:'': {{{
2013-09-11 16:20:48.983 | 2013-09-11 16:05:44,209 Starting new HTTP connection (1): 127.0.0.1
2013-09-11 16:20:48.983 | 2013-09-11 16:05:44,309 Starting new HTTP connection (1): 127.0.0.1
2013-09-11 16:20:48.983 | 2013-09-11 16:05:52,724 Tenant networks not configured to be reachable.
2013-09-11 16:20:48.983 | }}}
2013-09-11 16:20:48.983 |
2013-09-11 16:20:48.983 | Traceback (most recent call last):
2013-09-11 16:20:48.984 | File "tempest/scenario/test_network_basic_ops.py", line 262, in test_network_basic_ops
2013-09-11 16:20:48.984 | self._check_public_network_connectivity()
2013-09-11 16:20:48.984 | File "tempest/scenario/test_network_basic_ops.py", line 251, in _check_public_network_connectivity
2013-09-11 16:20:48.984 | self._check_vm_connectivity(ip_address, ssh_login, private_key)
2013-09-11 16:20:48.984 | File "tempest/scenario/manager.py", line 579, in _check_vm_connectivity
2013-09-11 16:20:48.984 | timeout=self.config.compute.ssh_timeout),
2013-09-11 16:20:48.984 | File "tempest/scenario/manager.py", line 569, in _is_reachable_via_ssh
2013-09-11 16:20:48.985 | return ssh_client.test_connection_auth()
2013-09-11 16:20:48.985 | File "tempest/common/ssh.py", line 148, in test_connection_auth
2013-09-11 16:20:48.985 | connection = self._get_ssh_connection()
2013-09-11 16:20:48.985 | File "tempest/common/ssh.py", line 76, in _get_ssh_connection
2013-09-11 16:20:48.985 | password=self.password)
2013-09-11 16:20:48.985 | SSHTimeout: Connection to the 172.24.4.229 via SSH timed out.
2013-09-11 16:20:48.986 | User: cirros, Password: None

Dolph Mathews (dolph) wrote :

This happens to be failing on the same test as bug 1231775 but the traceback is different (TempestConfig instance has no attribute 'debug').

Sean Dague (sdague) on 2013-09-29
summary: - test_network_basic_ops fails sporadically in swift check jobs
+ test_network_basic_ops fails waiting for network to become available
Changed in neutron:
importance: Undecided → Critical
milestone: none → havana-rc1

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1224001] Re:
test_network_basic_ops fails sporadically in swift check jobs" sent on
09/29/2013 20:17:08.

This is the only notification you will receive while this person is away.

Hans Lindgren (hanlind) wrote :

The log referenced in the bug report also have an instance of "RuntimeError: Second simultaneous read on fileno XX detected" indicating this is the same as bug 1229475.

Changed in neutron:
assignee: nobody → Mark McClain (markmcclain)
Mark McClain (markmcclain) wrote :

The incidence of this bug started to increase after this change (https://review.openstack.org/#/c/47956/3) was merged on Sept 27th just prior to a dramatic rise in this error.

Changed in neutron:
status: New → In Progress
Mark McClain (markmcclain) wrote :

Oops.. wrong commit. https://review.openstack.org/#/c/46314/ was merged around 2013-09-27 02:04. Failures then began to increase starting with the first at 2013-09-27 02:10:36 (http://logs.openstack.org/13/46313/2/gate/gate-tempest-devstack-vm-postgres-full/b1d50c4/console.html.gz).

Sean Dague (sdague) wrote :

Ok, my new proposed approach:

Add a devstack change to revert that in the general case - https://review.openstack.org/#/c/49143/

Turn on additional neutron jobs for tenant issolation, so the neutron team can continue to work through the issues - https://review.openstack.org/#/c/49144/

Related fix proposed to branch: master
Review: https://review.openstack.org/49424

Changed in swift:
status: New → Invalid
Changed in tempest:
status: New → Invalid
Thierry Carrez (ttx) wrote :

RC1 was signed-off, moving to havana-rc-potential

Changed in neutron:
milestone: havana-rc1 → none
tags: added: havana-rc-potential

The related job https://jenkins01.openstack.org/job/check-tempest-devstack-vm-neutron-pg-isolated/ appears not to be failing anymore. There are 3 failures in the history, and none of them is a manifestation of bug 1224001.

The failure trend stopped even before merging https://review.openstack.org/49233
I will keep investigating this bug, but I reckon we can decrease the priority to high (it must however be rc-potential)

Changed in neutron:
importance: Critical → High
assignee: Mark McClain (markmcclain) → Salvatore Orlando (salvatore-orlando)

What I said above should be discarded if then it turns out that this gate job is not actually enabling tenant isolation in tempest.

Can't find any trace of log statement "Acquired isolated creds" in http://logs.openstack.org/24/49424/4/check/check-tempest-devstack-vm-neutron-pg-isolated/aca07fc/logs/tempest.txt.gz

And TEMPEST_ALLOW_TENANT_ISOLATION seems false in http://logs.openstack.org/24/49424/4/check/check-tempest-devstack-vm-neutron-pg-isolated/aca07fc/logs/devstacklog.txt.gz

Reviewed: https://review.openstack.org/49233
Committed: http://github.com/openstack/neutron/commit/a67bce77e12b5af7fd5fae7946c96d483ecede92
Submitter: Jenkins
Branch: master

commit a67bce77e12b5af7fd5fae7946c96d483ecede92
Author: Jakub Libosvar <email address hidden>
Date: Tue Oct 1 18:27:44 2013 +0000

    Switch agent report_interval option to float

    Being able to set report_interval > 0 and < 1 makes
    it easier to consistently reproduce rpc race
    conditions.

    Change-Id: I4d907159147177e98b043e7e800aea5cf8c23103
    Related-Bug: #1224001

Changed in neutron:
importance: High → Critical
Matthew Treinish (treinish) wrote :

So with tenant isolation turned back on for the isolated jobs the failures stared up again. Here are the logs from a failure with the extra debug information from tempest:

http://logs.openstack.org/44/49644/2/check/check-tempest-devstack-vm-neutron-pg-isolated/1767d05/

In some cases we have a situation where rpc_loop or _sync_routers_task block. From log observations this happens always will executing subprocess.communicate, and the root cause could be this: https://github.com/eventlet/eventlet/pull/24

Which is a bit strange since this popen.communicate is used also in common.processutils and not other block issue has been reported. Perhaps neutron.agent.linux.utils.execute should leverage openstack.common

Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.

In other cases instead the following exception is raised (and probably it shouldn't):

2013-10-04 12:28:21.360 1259 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 730, in _rpc_loop
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._process_router_delete()
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 739, in _process_router_delete
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._router_removed(router_id)
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 313, in _router_removed
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent ri = self.router_info[router_id]
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent KeyError: u'b17f5fe6-8354-4af7-b271-a4ab0896dcb7'
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent

This triggers a full synchronization, which has the following effects:
- blocks rpc loops; so the update for the floating IP is delayed. With many routers (and tenant isolation jobs added many routers) this might mean that the floating IP is applied after the tempest timeout is elapsed.
- doing many execute operations increases the chance of the thread blocking

Current approach is to 'blindly' ignore the router_removed error to avoid it triggering the full router synchronization.
If it works, this should be regarded only as the first step of a more complex fix aimed at getting the gate going again.

Maru Newby (maru) wrote :

Miguel Lavalle has found that the connectivity check can be made to pass consistently if password-based auth is used instead of key-based auth. Since all cirros vm's have the same password, but the key would vary by instance, this suggests that the vm being targeted is not the one being created for the test. Mark McClain suggests that the issue could be arp related? Or maybe something wrong with routing?

At this stage I feel enough confident to say that the gate is failing because rpc_loop or sync_routers_task hang on subprocess.popen.communicate
This happens especially when executing arping. The actual reason is unknown.
One possibility is that arping is returning a very large output (unlikely), whereas another is that arping itself does not complete.
subprocess.popen.communicate will not stop in this case causing the hang.

I suggest to do the same mitigation openstack.common for execute() and tweaking the configuration option for not executing arping on the gate.

and it seems arping crashes the kernel.
Whenever arping hangs, the following stack trace is found: http://paste.openstack.org/show/48056/

Changed in neutron:
milestone: none → havana-rc2

Reviewed: https://review.openstack.org/49424
Committed: http://github.com/openstack/neutron/commit/591ee00a67fbbe5f106ba12140b9f9420dee5907
Submitter: Jenkins
Branch: master

commit 591ee00a67fbbe5f106ba12140b9f9420dee5907
Author: Salvatore Orlando <email address hidden>
Date: Wed Oct 2 12:14:14 2013 -0700

    Prevent L3 agent looping calls from hanging

    This patch adopts several measures to prevent _sync_routers_task
    and _rpc_loop from hanging because of subprocess.Popen.communicate
    not returning.

    1) Perform a sleep everytime a command is completed, similarly to
    what is done in openstack.common.processutils.execute
    2) Disable by default GARP, as kernel crashes caused by arping
    have been observed
    3) Prevent a non-critical keyerror in _router_removed from triggering
    again a full sync, which might put the system under significant load.

    This patch also adds debug log statements aimed at improving the
    ability of debugging similar failures.

    Change-Id: I003316bce0f38b7d2ea7d563b5a0a58676834398
    Partial-Bug: 1224001

Reviewed: https://review.openstack.org/50424
Committed: http://github.com/openstack/neutron/commit/bc507645f1b570e048bd0d2616d5243b4ca5534f
Submitter: Jenkins
Branch: milestone-proposed

commit bc507645f1b570e048bd0d2616d5243b4ca5534f
Author: Salvatore Orlando <email address hidden>
Date: Wed Oct 2 12:14:14 2013 -0700

    Prevent L3 agent looping calls from hanging

    This patch adopts several measures to prevent _sync_routers_task
    and _rpc_loop from hanging because of subprocess.Popen.communicate
    not returning.

    1) Perform a sleep everytime a command is completed, similarly to
    what is done in openstack.common.processutils.execute
    2) Disable by default GARP, as kernel crashes caused by arping
    have been observed
    3) Prevent a non-critical keyerror in _router_removed from triggering
    again a full sync, which might put the system under significant load.

    This patch also adds debug log statements aimed at improving the
    ability of debugging similar failures.

    Change-Id: I003316bce0f38b7d2ea7d563b5a0a58676834398
    Partial-Bug: 1224001
    (cherry picked from commit 591ee00a67fbbe5f106ba12140b9f9420dee5907)

Thierry Carrez (ttx) wrote :

Hoping it's fixed now

Changed in neutron:
status: In Progress → Fix Released

Elastic recheck is still marking some failures as bug 1224001, but the root cause appears to be different from the one we fixed.
I think we should keep this bug open and target I-1.

The 'arping' issue causing the failure has been solved with the patch we merged.

Thierry Carrez (ttx) on 2013-10-17
Changed in neutron:
milestone: havana-rc2 → 2013.2
Changed in neutron:
milestone: 2013.2 → icehouse-1
Changed in neutron:
status: Fix Released → In Progress

There are nomore L3 agent issues here. The floating IP is correctly configured about 30 seconds before the timeout.
The internal router interface is also correctly set up, and configured a few seconds before the floating IP.

The only possible cause for failure of this routine is the VM booting but non getting an IP.

I checked 6 logs - 3 failing and 3 non failing.

In each failure scenario I always find the kill -HUP <dnsmasq_pid> being executed at least 1 second after the VM has booted
In each success scenario instead kill -HUP is executed always before the VM boots.

The only hint that this gives me is to check how the cirros instance tries to get an IP from DHCP., and whether it gives up soon.

This failure has apparently 20 hits in a week. I would therefore decrease the priority to high.

Changed in neutron:
importance: Critical → High

Reviewed: https://review.openstack.org/57290
Committed: http://github.com/openstack/neutron/commit/70096ae4e74ec96ac18b236b547510422a3be70a
Submitter: Jenkins
Branch: master

commit 70096ae4e74ec96ac18b236b547510422a3be70a
Author: Salvatore Orlando <email address hidden>
Date: Tue Nov 19 09:32:35 2013 -0800

    Enable polling minimization

    In some cases the OVS agent might end up wiring a port for a VM
    several seconds after the VM has booted. As a result, there is a
    risk that the VM won't receive an IP address from DHCP in time.

    This patch changes the default value for agent.minimize_polling to
    True. This change should reduce the time needed for configuring
    an interface on br-int consistently.

    Change-Id: I009f606fd34a132376f1d50f8ccda9d35d064bfa
    Related-bug: 1224001

We merely reduced its occurence; I frankly expected that.

This bug is happening for the same reason which prevent parallel tests from working.
I expect it to go away once all the patches for enabling parallel testing are merged.

Thierry Carrez (ttx) on 2013-12-03
Changed in neutron:
milestone: icehouse-1 → icehouse-2

For the sake of discussion, remaining manifestations of this failures are tracked under bug 1253896.
I suggest we either mark this as fix committed or duplicate or invalid or whatever, but it's not worth keeping it open.

It just makes things look worse than they are!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers