neutron

check_public_network_connectivity fails with timeout

Bug #1194026 reported by Avishay Traeger on 2013-06-24

This bug affects 14 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Critical	Nachi Ueno	neutron 2013.2 "havana"
	tempest	Fix Released	High	Nachi Ueno	tempest havana-3

Bug Description

2013-06-24 06:11:14.330 | ======================================================================
2013-06-24 06:11:14.330 | FAIL: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_008_check_public_network_connectivity[gate,smoke]
2013-06-24 06:11:14.331 | ----------------------------------------------------------------------
2013-06-24 06:11:14.331 | _StringException: Traceback (most recent call last):
2013-06-24 06:11:14.331 | File "/opt/stack/new/tempest/tempest/scenario/test_network_basic_ops.py", line 255, in test_008_check_public_network_connectivity
2013-06-24 06:11:14.331 | self._check_vm_connectivity(ip_address, ssh_login, private_key)
2013-06-24 06:11:14.331 | File "/opt/stack/new/tempest/tempest/scenario/manager.py", line 455, in _check_vm_connectivity
2013-06-24 06:11:14.331 | "reachable" % ip_address)
2013-06-24 06:11:14.331 | File "/usr/lib/python2.7/unittest/case.py", line 420, in assertTrue
2013-06-24 06:11:14.331 | raise self.failureException(msg)
2013-06-24 06:11:14.331 | AssertionError: Timed out waiting for 172.24.4.229 to become reachable

http://logs.openstack.org/33227/2/gate/gate-tempest-devstack-vm-quantum/33279/console.html.gz

Tags:

Jeremy Stanley (fungi) on 2013-06-24

affects:

openstack-ci → tempest

Revision history for this message

David Kranz (david-kranz) wrote on 2013-07-02:

This has been happening a fair bit. logstash shows many hits for the error and they all seem to be in quantum runs where the non-quantum run of the test passed. So changing to neutron until proven otherwise.

affects:

tempest → neutron

Revision history for this message

xiaohanyu (xiaohanyu1988) wrote on 2013-07-03:

Same problem here, see http://logs.openstack.org/34507/3/check/gate-tempest-devstack-vm-quantum/35986/console.html

Revision history for this message

David Kranz (david-kranz) wrote on 2013-07-03:

This is failing the gate almost once per hour. Perhaps it should be given a somewhat high priority.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-07-05:

47 rechecks, it definitely deserves some immediate attention!

Sean Dague (sdague) on 2013-07-09

Changed in neutron:
status:	New → Confirmed

Salvatore Orlando (salvatore-orlando) on 2013-07-09

Changed in neutron:
importance:	Undecided → High

Maru Newby (maru) on 2013-07-09

Changed in neutron:
assignee:	nobody → Maru Newby (maru)

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-09:

This could be timing bug on neutron l3-agent.

I greped the l3-agent log using 172.24.4.229.
http://logs.openstack.org/33227/2/gate/gate-tempest-devstack-vm-quantum/33279/logs/screen-q-l3.txt.gz

In this sequence of test, we use 172.24.4.229 twice times.
First one is for execise.sh, seconde is for tempest test.

This is for first one
http://paste.openstack.org/show/39852/

It working and exercise.sh succeed.

But for second time, it looks do nothing...
http://paste.openstack.org/show/39854/

I'll keep checking.

Maru Newby (maru) on 2013-07-09

Changed in neutron:
assignee:	Maru Newby (maru) → nobody

Revision history for this message

Thierry Carrez (ttx) wrote on 2013-07-09:

Would be great to nail this one down before the review rush to h2

Changed in neutron:
importance:	High → Critical
milestone:	none → havana-2

Mark McClain (markmcclain) on 2013-07-09

Changed in neutron:
assignee:	nobody → Gary Kotton (garyk)

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-11:

I'm checking logs.
I compared success one and failed one.

( I grepped 172.24.4.229)

q-srv log is same. but l3-agent log is quite different.

Also this error starts 6/24.
so I checked the commit
- which is before 6/24
- Modifies l3-agent

This commit looks related.. This commit may affect floating ip also.

Author: Jenkins <email address hidden>
Date: Wed May 29 14:40:59 2013 +0000

Merge "Configurable external gateway modes"
https://github.com/openstack/neutron/commit/59089e56f674f5f94f67c5986e9a616bb669d846

I have no proof of this, also there is still no explanation why it happens 6/24 at high rate..
(we upgraded greenlet at 6/23, this may related .. but I'm not sure..)

I'll trying to test neutron without this commit.

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-11:

I think I could reproduce this issue (if the test script is not wrong...)

Here is test script and result.
http://paste.openstack.org/show/40119/

I'm trying to fix the issue

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-11:

Sorry script is not copy correctly.

here is true one
http://paste.openstack.org/show/40133/

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-11:

#10

I think I found the possible cause of issue.

see
http://paste.openstack.org/show/40156/

line starts from **** Get is RPC message log.
line starts from [[[[[ Process is when l3-agent start processing rpc messages.
(I added rpc message number for debugging)

so we expected all RPC call is executed serialized way.
However it is executed in random order!

We are using semaphore in eventlet, but it looks we should use queue instead of queue.
We should also combine RPC messages.
Because when we get concurrent request, l3-agent processes one by one.
so sometimes, the process for the new ip will be delayed long time.

Attila Fazekas (afazekas) on 2013-07-12

Changed in tempest:
status:	New → Incomplete

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-12:

#11

I'm writing patch.

Changed in tempest:
status:	Incomplete → Invalid
Changed in neutron:
status:	Confirmed → In Progress
assignee:	Gary Kotton (garyk) → Nachi Ueno (nati-ueno)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-12: Fix proposed to neutron (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/36890

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-16: Fix merged to neutron (master)

#13

Reviewed: https://review.openstack.org/36890
Committed: http://github.com/openstack/neutron/commit/57e1fa32648a28ae83e3c3033258fe08de2e5fca
Submitter: Jenkins
Branch: master

commit 57e1fa32648a28ae83e3c3033258fe08de2e5fca
Author: Nachi Ueno <email address hidden>
Date: Fri Jul 12 12:21:46 2013 -0700

Improve l3-agent performance and prevent races in it.

Fixes bug 1194026
Fixes bug 1200749

    Introduce a looping call for performing synchronization with
    neutron server.
    The sync will be performed only if router changes are notified
    via rpc. Only affected routers will be synchronized.

    Changes will be implemented by the l3 agent spawning a
    distinct greenthread for each router - iptables will
    be executed only once using iptables_manager.defer_apply_on.

    This patch will prevent the occurence of the following issues:
    - Out-of-order rpc message processing
    - Long processing time for router changes due to serial execution
    - Occasional and expected RPC blocks for long periods
    - Unnecessary processing of multiple requests

Change-Id: I0978d1c38ac5c38c4548e5b1877857bb5cac3b81

Changed in neutron:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-07-17

Changed in neutron:
status:	Fix Committed → Fix Released

Nachi Ueno (nati-ueno) on 2013-07-18

Changed in neutron:
status:	Fix Released → In Progress
milestone:	havana-2 → havana-3

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-18:

#14

スクリーンショット 2013-07-18 15.27.30.png Edit (166.3 KiB, image/png)

hmm still I'm getting errors.
It looks previous patch improves success rate, but it still failing on some percentage

(fail case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBGQUlMXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzNzQxODYxOTg0MjF9

(pass case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBva1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzc0MTg2MjgyODU3fQ==?field.comment=hmm still I'm getting errors.

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-18:

#15

before.png Edit (190.1 KiB, image/png)

This is before the patch merged

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-18:

#16

after.png Edit (193.0 KiB, image/png)

This is after the patch merged

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-18:

#17

One findings from log is
Timing of "Got routers updated notification" looks delayed..

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-22:

#18

OK New fix looks working 10 times without this failure.

On previous patch it was working 10 times, but after the test, I added _rcp_loop..
and the _rpc_loop looks stopped (or may be delayed to execute ) on the gate failure.

In this patch I remove _rpc_loop

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-07-24:

#19

It looks like timeout of tempest side looks also short.
(it is 20 sec)

Changed in tempest:
status:	Invalid → Confirmed
assignee:	nobody → Nachi Ueno (nati-ueno)
importance:	Undecided → High

OpenStack Infra (hudson-openstack) on 2013-07-24

Changed in tempest:
status:	Confirmed → In Progress

Nachi Ueno (nati-ueno) on 2013-07-24

Changed in tempest:
milestone:	none → havana-3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-25: Fix merged to tempest (master)

#20

Reviewed: https://review.openstack.org/38513
Committed: http://github.com/openstack/tempest/commit/6d580bee934199c4935e4d754cbafb67cc859961
Submitter: Jenkins
Branch: master

commit 6d580bee934199c4935e4d754cbafb67cc859961
Author: Nachi Ueno <email address hidden>
Date: Wed Jul 24 10:58:11 2013 -0700

Increase ping timeout on scenario testing

Some gating may fail because ping timeout is short (20s).
In this commit, we will increase up to 60s.

- Added ping_timeout for compute config with default 60s
- Replaced hardcorded ssh timeout value with ssh_timeout

Fixes bug 1194026

Change-Id: If4e64aff17fc9aea1b6de03c684dff145ef5e6f2

Changed in tempest:
status:	In Progress → Fix Released

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2013-08-05:

#21

Gating is going back to normal again.
(see graph https://gist.github.com/nati/6158178)

We could start vorting again.

I don't see this errors in 48 hours run.

so IMO, We can close this bug.
# if we found this bug again, let's re-open this one.

Changed in neutron:
importance:	Critical → Medium
importance:	Medium → Critical
status:	In Progress → Fix Committed

Revision history for this message

Chmouel Boudjnah (chmouel) wrote on 2013-08-06:

#22

@Nachi we just got a new one again in https://review.openstack.org/#/c/40084/

Revision history for this message

Chmouel Boudjnah (chmouel) wrote on 2013-08-06:

#23

Sorry I mean this one https://review.openstack.org/#/c/38739/

Mark McClain (markmcclain) on 2013-09-05

tags:

added: tempest

Thierry Carrez (ttx) on 2013-09-05

Changed in neutron:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in neutron:
milestone:	havana-3 → 2013.2

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1197476

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.