OpenStack Object Storage (swift)

test_network_basic_ops fails waiting for network to become available

Bug #1224001 reported by Peter Portante on 2013-09-11

This bug report is a duplicate of: Bug #1253896: Attempts to verify guests are running via SSH fails. SSH connection to guest does not work.. Edit Remove

This bug affects 7 people

	Status	Importance	Assigned to	Milestone
OpenStack Object Storage (swift)	Invalid	Undecided	Unassigned
neutron	In Progress	High	Salvatore Orlando	neutron icehouse-2 "i2"
tempest	Invalid	Undecided	Unassigned

Bug Description

See http://logs.openstack.org/25/43125/4/gate/gate-tempest-devstack-vm-neutron/40f3725/console.html

2013-09-11 16:20:48.981 | ======================================================================
2013-09-11 16:20:48.982 | FAIL: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops[gate,smoke]
2013-09-11 16:20:48.982 | tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops[gate,smoke]
2013-09-11 16:20:48.982 | ----------------------------------------------------------------------
2013-09-11 16:20:48.982 | _StringException: Empty attachments:
2013-09-11 16:20:48.982 | stderr
2013-09-11 16:20:48.982 | stdout
2013-09-11 16:20:48.982 |
2013-09-11 16:20:48.983 | pythonlogging:'': {{{
2013-09-11 16:20:48.983 | 2013-09-11 16:05:44,209 Starting new HTTP connection (1): 127.0.0.1
2013-09-11 16:20:48.983 | 2013-09-11 16:05:44,309 Starting new HTTP connection (1): 127.0.0.1
2013-09-11 16:20:48.983 | 2013-09-11 16:05:52,724 Tenant networks not configured to be reachable.
2013-09-11 16:20:48.983 | }}}
2013-09-11 16:20:48.983 |
2013-09-11 16:20:48.983 | Traceback (most recent call last):
2013-09-11 16:20:48.984 | File "tempest/scenario/test_network_basic_ops.py", line 262, in test_network_basic_ops
2013-09-11 16:20:48.984 | self._check_public_network_connectivity()
2013-09-11 16:20:48.984 | File "tempest/scenario/test_network_basic_ops.py", line 251, in _check_public_network_connectivity
2013-09-11 16:20:48.984 | self._check_vm_connectivity(ip_address, ssh_login, private_key)
2013-09-11 16:20:48.984 | File "tempest/scenario/manager.py", line 579, in _check_vm_connectivity
2013-09-11 16:20:48.984 | timeout=self.config.compute.ssh_timeout),
2013-09-11 16:20:48.984 | File "tempest/scenario/manager.py", line 569, in _is_reachable_via_ssh
2013-09-11 16:20:48.985 | return ssh_client.test_connection_auth()
2013-09-11 16:20:48.985 | File "tempest/common/ssh.py", line 148, in test_connection_auth
2013-09-11 16:20:48.985 | connection = self._get_ssh_connection()
2013-09-11 16:20:48.985 | File "tempest/common/ssh.py", line 76, in _get_ssh_connection
2013-09-11 16:20:48.985 | password=self.password)
2013-09-11 16:20:48.985 | SSHTimeout: Connection to the 172.24.4.229 via SSH timed out.
2013-09-11 16:20:48.986 | User: cirros, Password: None

Tags:

Revision history for this message

Dolph Mathews (dolph) wrote on 2013-09-27:

This happens to be failing on the same test as bug 1231775 but the traceback is different (TempestConfig instance has no attribute 'debug').

Sean Dague (sdague) on 2013-09-29

summary:	- test_network_basic_ops fails sporadically in swift check jobs + test_network_basic_ops fails waiting for network to become available
Changed in neutron:
importance:	Undecided → Critical
milestone:	none → havana-rc1

Revision history for this message

Sean Dague (sdague) wrote on 2013-09-29:

The logstash query for this is: "tempest.scenario.test_network_basic_ops AssertionError: Timed out waiting for"

http://logstash.openstack.org/#eyJzZWFyY2giOiJcInRlbXBlc3Quc2NlbmFyaW8udGVzdF9uZXR3b3JrX2Jhc2ljX29wcyBBc3NlcnRpb25FcnJvcjogVGltZWQgb3V0IHdhaXRpbmcgZm9yXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6Ijg2NDAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM4MDQ1NzY5NzMxOX0=

The tempest debug trace was a red herring, this is really a core neutron issue.

Revision history for this message

Zhikun Liu (zhikunliu) wrote on 2013-09-29: AUTO: Zhi Kun ZK Liu is on vacation

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1224001] Re:
test_network_basic_ops fails sporadically in swift check jobs" sent on
09/29/2013 20:17:08.

This is the only notification you will receive while this person is away.

Revision history for this message

Hans Lindgren (hanlind) wrote on 2013-09-30:

The log referenced in the bug report also have an instance of "RuntimeError: Second simultaneous read on fileno XX detected" indicating this is the same as bug 1229475.

Mark McClain (markmcclain) on 2013-09-30

Changed in neutron:
assignee:	nobody → Mark McClain (markmcclain)

Revision history for this message

Mark McClain (markmcclain) wrote on 2013-10-01:

The incidence of this bug started to increase after this change (https://review.openstack.org/#/c/47956/3) was merged on Sept 27th just prior to a dramatic rise in this error.

Changed in neutron:
status:	New → In Progress

Revision history for this message

Mark McClain (markmcclain) wrote on 2013-10-01:

Oops.. wrong commit. https://review.openstack.org/#/c/46314/ was merged around 2013-09-27 02:04. Failures then began to increase starting with the first at 2013-09-27 02:10:36 (http://logs.openstack.org/13/46313/2/gate/gate-tempest-devstack-vm-postgres-full/b1d50c4/console.html.gz).

Revision history for this message

Sean Dague (sdague) wrote on 2013-10-01:

Ok, my new proposed approach:

Add a devstack change to revert that in the general case - https://review.openstack.org/#/c/49143/

Turn on additional neutron jobs for tenant issolation, so the neutron team can continue to work through the issues - https://review.openstack.org/#/c/49144/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-01: Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/49233

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-02:

Related fix proposed to branch: master
Review: https://review.openstack.org/49424

Matthew Treinish (treinish) on 2013-10-02

Changed in swift:
status:	New → Invalid
Changed in tempest:
status:	New → Invalid

Revision history for this message

Thierry Carrez (ttx) wrote on 2013-10-03:

#10

RC1 was signed-off, moving to havana-rc-potential

Changed in neutron:
milestone:	havana-rc1 → none
tags:	added: havana-rc-potential

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-03:

#11

The related job https://jenkins01.openstack.org/job/check-tempest-devstack-vm-neutron-pg-isolated/ appears not to be failing anymore. There are 3 failures in the history, and none of them is a manifestation of bug 1224001.

The failure trend stopped even before merging https://review.openstack.org/49233
I will keep investigating this bug, but I reckon we can decrease the priority to high (it must however be rc-potential)

Changed in neutron:
importance:	Critical → High
assignee:	Mark McClain (markmcclain) → Salvatore Orlando (salvatore-orlando)

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-03:

#12

What I said above should be discarded if then it turns out that this gate job is not actually enabling tenant isolation in tempest.

Can't find any trace of log statement "Acquired isolated creds" in http://logs.openstack.org/24/49424/4/check/check-tempest-devstack-vm-neutron-pg-isolated/aca07fc/logs/tempest.txt.gz

And TEMPEST_ALLOW_TENANT_ISOLATION seems false in http://logs.openstack.org/24/49424/4/check/check-tempest-devstack-vm-neutron-pg-isolated/aca07fc/logs/devstacklog.txt.gz

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-03: Related fix merged to neutron (master)

#13

Reviewed: https://review.openstack.org/49233
Committed: http://github.com/openstack/neutron/commit/a67bce77e12b5af7fd5fae7946c96d483ecede92
Submitter: Jenkins
Branch: master

commit a67bce77e12b5af7fd5fae7946c96d483ecede92
Author: Jakub Libosvar <email address hidden>
Date: Tue Oct 1 18:27:44 2013 +0000

Switch agent report_interval option to float

    Being able to set report_interval > 0 and < 1 makes
    it easier to consistently reproduce rpc race
    conditions.

Change-Id: I4d907159147177e98b043e7e800aea5cf8c23103
Related-Bug: #1224001

Salvatore Orlando (salvatore-orlando) on 2013-10-03

Changed in neutron:
importance:	High → Critical

Revision history for this message

Matthew Treinish (treinish) wrote on 2013-10-03:

#14

So with tenant isolation turned back on for the isolated jobs the failures stared up again. Here are the logs from a failure with the extra debug information from tempest:

http://logs.openstack.org/44/49644/2/check/check-tempest-devstack-vm-neutron-pg-isolated/1767d05/

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-04:

#15

In some cases we have a situation where rpc_loop or _sync_routers_task block. From log observations this happens always will executing subprocess.communicate, and the root cause could be this: https://github.com/eventlet/eventlet/pull/24

Which is a bit strange since this popen.communicate is used also in common.processutils and not other block issue has been reported. Perhaps neutron.agent.linux.utils.execute should leverage openstack.common

Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.

In other cases instead the following exception is raised (and probably it shouldn't):

2013-10-04 12:28:21.360 1259 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 730, in _rpc_loop
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._process_router_delete()
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 739, in _process_router_delete
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._router_removed(router_id)
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 313, in _router_removed
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent ri = self.router_info[router_id]
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent KeyError: u'b17f5fe6-8354-4af7-b271-a4ab0896dcb7'
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent

This triggers a full synchronization, which has the following effects:
- blocks rpc loops; so the update for the floating IP is delayed. With many routers (and tenant isolation jobs added many routers) this might mean that the floating IP is applied after the tempest timeout is elapsed.
- doing many execute operations increases the chance of the thread blocking

Current approach is to 'blindly' ignore the router_removed error to avoid it triggering the full router synchronization.
If it works, this should be regarded only as the first step of a more complex fix aimed at getting the gate going again.

Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.

In other cases instead the following exception is raised (and probably it shouldn't):

2013-10-04 12:28:21.360 1259 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent   File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 730, in _rpc_loop
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent     self._process_router_delete()
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent   File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 739, in _process_router_delete
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent     self._router_removed(router_id)
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent   File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 313, in _router_removed
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent     ri = self.router_info[router_id]
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent KeyError: u'b17f5fe6-8354-4af7-b271-a4ab0896dcb7'
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent

Revision history for this message

Maru Newby (maru) wrote on 2013-10-07:

#16

Miguel Lavalle has found that the connectivity check can be made to pass consistently if password-based auth is used instead of key-based auth. Since all cirros vm's have the same password, but the key would vary by instance, this suggests that the vm being targeted is not the one being created for the test. Mark McClain suggests that the issue could be arp related? Or maybe something wrong with routing?

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-07:

#17

At this stage I feel enough confident to say that the gate is failing because rpc_loop or sync_routers_task hang on subprocess.popen.communicate
This happens especially when executing arping. The actual reason is unknown.
One possibility is that arping is returning a very large output (unlikely), whereas another is that arping itself does not complete.
subprocess.popen.communicate will not stop in this case causing the hang.

I suggest to do the same mitigation openstack.common for execute() and tweaking the configuration option for not executing arping on the gate.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-08:

#18

and it seems arping crashes the kernel.
Whenever arping hangs, the following stack trace is found: http://paste.openstack.org/show/48056/

Mark McClain (markmcclain) on 2013-10-08

Changed in neutron:
milestone:	none → havana-rc2

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-10-08:

#19

logstash query:
@message:"kernel BUG at"

Direct URL:
http://logstash.openstack.org/#eyJzZWFyY2giOiJAbWVzc2FnZTpcImtlcm5lbCBCVUcgYXRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiYWxsIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM4MTI0NjU3ODEwN30=

However, the *last* timestamp of occurance of this message is 2013-09-26T11:01:30.000

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-08: Fix merged to neutron (master)

#20

Reviewed: https://review.openstack.org/49424
Committed: http://github.com/openstack/neutron/commit/591ee00a67fbbe5f106ba12140b9f9420dee5907
Submitter: Jenkins
Branch: master

commit 591ee00a67fbbe5f106ba12140b9f9420dee5907
Author: Salvatore Orlando <email address hidden>
Date: Wed Oct 2 12:14:14 2013 -0700

Prevent L3 agent looping calls from hanging

    This patch adopts several measures to prevent _sync_routers_task
    and _rpc_loop from hanging because of subprocess.Popen.communicate
    not returning.

    1) Perform a sleep everytime a command is completed, similarly to
    what is done in openstack.common.processutils.execute
    2) Disable by default GARP, as kernel crashes caused by arping
    have been observed
    3) Prevent a non-critical keyerror in _router_removed from triggering
    again a full sync, which might put the system under significant load.

This patch also adds debug log statements aimed at improving the
ability of debugging similar failures.

Change-Id: I003316bce0f38b7d2ea7d563b5a0a58676834398
Partial-Bug: 1224001

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-08: Fix proposed to neutron (milestone-proposed)

#21

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/50424

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-08: Fix merged to neutron (milestone-proposed)

#22

Reviewed: https://review.openstack.org/50424
Committed: http://github.com/openstack/neutron/commit/bc507645f1b570e048bd0d2616d5243b4ca5534f
Submitter: Jenkins
Branch: milestone-proposed

commit bc507645f1b570e048bd0d2616d5243b4ca5534f
Author: Salvatore Orlando <email address hidden>
Date: Wed Oct 2 12:14:14 2013 -0700

Prevent L3 agent looping calls from hanging

    This patch adopts several measures to prevent _sync_routers_task
    and _rpc_loop from hanging because of subprocess.Popen.communicate
    not returning.

This patch also adds debug log statements aimed at improving the
ability of debugging similar failures.

    Change-Id: I003316bce0f38b7d2ea7d563b5a0a58676834398
    Partial-Bug: 1224001
    (cherry picked from commit 591ee00a67fbbe5f106ba12140b9f9420dee5907)

Revision history for this message

Thierry Carrez (ttx) wrote on 2013-10-08:

#23

Hoping it's fixed now

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-10-14:

#24

Elastic recheck is still marking some failures as bug 1224001, but the root cause appears to be different from the one we fixed.
I think we should keep this bug open and target I-1.

The 'arping' issue causing the failure has been solved with the patch we merged.

Thierry Carrez (ttx) on 2013-10-17

Changed in neutron:
milestone:	havana-rc2 → 2013.2

Salvatore Orlando (salvatore-orlando) on 2013-10-31

Changed in neutron:
milestone:	2013.2 → icehouse-1

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-11-18:

#25

http://logstash.openstack.org/#eyJmaWVsZHMiOiBbXSwgInNlYXJjaCI6ICJtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzIEFzc2VydGlvbkVycm9yOiBUaW1lZCBvdXQgd2FpdGluZyBmb3JcIiBBTkQgZmlsZW5hbWU6XCJjb25zb2xlLmh0bWxcIlxuIiwgInRpbWVmcmFtZSI6ICI2MDQ4MDAiLCAiZ3JhcGhtb2RlIjogImNvdW50IiwgIm9mZnNldCI6IDB9

Mark McClain (markmcclain) on 2013-11-18

Changed in neutron:
status:	Fix Released → In Progress

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-19:

#26

There are nomore L3 agent issues here. The floating IP is correctly configured about 30 seconds before the timeout.
The internal router interface is also correctly set up, and configured a few seconds before the floating IP.

The only possible cause for failure of this routine is the VM booting but non getting an IP.

I checked 6 logs - 3 failing and 3 non failing.

In each failure scenario I always find the kill -HUP <dnsmasq_pid> being executed at least 1 second after the VM has booted
In each success scenario instead kill -HUP is executed always before the VM boots.

The only hint that this gives me is to check how the cirros instance tries to get an IP from DHCP., and whether it gives up soon.

This failure has apparently 20 hits in a week. I would therefore decrease the priority to high.

Changed in neutron:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-19: Related fix proposed to neutron (master)

#27

Related fix proposed to branch: master
Review: https://review.openstack.org/57290

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-21: Related fix merged to neutron (master)

#28

Reviewed: https://review.openstack.org/57290
Committed: http://github.com/openstack/neutron/commit/70096ae4e74ec96ac18b236b547510422a3be70a
Submitter: Jenkins
Branch: master

commit 70096ae4e74ec96ac18b236b547510422a3be70a
Author: Salvatore Orlando <email address hidden>
Date: Tue Nov 19 09:32:35 2013 -0800

Enable polling minimization

    In some cases the OVS agent might end up wiring a port for a VM
    several seconds after the VM has booted. As a result, there is a
    risk that the VM won't receive an IP address from DHCP in time.

    This patch changes the default value for agent.minimize_polling to
    True. This change should reduce the time needed for configuring
    an interface on br-int consistently.

Change-Id: I009f606fd34a132376f1d50f8ccda9d35d064bfa
Related-bug: 1224001

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-11-24:

#29

This bug disappeared for 24 hours after 57290 was merged and then it came back again.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-26:

#30

We merely reduced its occurence; I frankly expected that.

This bug is happening for the same reason which prevent parallel tests from working.
I expect it to go away once all the patches for enabling parallel testing are merged.

Thierry Carrez (ttx) on 2013-12-03

Changed in neutron:
milestone:	icehouse-1 → icehouse-2

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-01-10:

#31

For the sake of discussion, remaining manifestations of this failures are tracked under bug 1253896.
I suggest we either mark this as fix committed or duplicate or invalid or whatever, but it's not worth keeping it open.

It just makes things look worse than they are!

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1253896 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.