check_public_network_connectivity fails with timeout

Bug #1194026 reported by Avishay Traeger
64
This bug affects 14 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Nachi Ueno
tempest
Fix Released
High
Nachi Ueno

Bug Description

2013-06-24 06:11:14.330 | ======================================================================
2013-06-24 06:11:14.330 | FAIL: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_008_check_public_network_connectivity[gate,smoke]
2013-06-24 06:11:14.331 | ----------------------------------------------------------------------
2013-06-24 06:11:14.331 | _StringException: Traceback (most recent call last):
2013-06-24 06:11:14.331 | File "/opt/stack/new/tempest/tempest/scenario/test_network_basic_ops.py", line 255, in test_008_check_public_network_connectivity
2013-06-24 06:11:14.331 | self._check_vm_connectivity(ip_address, ssh_login, private_key)
2013-06-24 06:11:14.331 | File "/opt/stack/new/tempest/tempest/scenario/manager.py", line 455, in _check_vm_connectivity
2013-06-24 06:11:14.331 | "reachable" % ip_address)
2013-06-24 06:11:14.331 | File "/usr/lib/python2.7/unittest/case.py", line 420, in assertTrue
2013-06-24 06:11:14.331 | raise self.failureException(msg)
2013-06-24 06:11:14.331 | AssertionError: Timed out waiting for 172.24.4.229 to become reachable

http://logs.openstack.org/33227/2/gate/gate-tempest-devstack-vm-quantum/33279/console.html.gz

Tags: tempest
Jeremy Stanley (fungi)
affects: openstack-ci → tempest
Revision history for this message
David Kranz (david-kranz) wrote :

This has been happening a fair bit. logstash shows many hits for the error and they all seem to be in quantum runs where the non-quantum run of the test passed. So changing to neutron until proven otherwise.

affects: tempest → neutron
Revision history for this message
xiaohanyu (xiaohanyu1988) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :

This is failing the gate almost once per hour. Perhaps it should be given a somewhat high priority.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

47 rechecks, it definitely deserves some immediate attention!

Sean Dague (sdague)
Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → High
Maru Newby (maru)
Changed in neutron:
assignee: nobody → Maru Newby (maru)
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

This could be timing bug on neutron l3-agent.

I greped the l3-agent log using 172.24.4.229.
http://logs.openstack.org/33227/2/gate/gate-tempest-devstack-vm-quantum/33279/logs/screen-q-l3.txt.gz

In this sequence of test, we use 172.24.4.229 twice times.
First one is for execise.sh, seconde is for tempest test.

This is for first one
http://paste.openstack.org/show/39852/

It working and exercise.sh succeed.

But for second time, it looks do nothing...
http://paste.openstack.org/show/39854/

I'll keep checking.

Maru Newby (maru)
Changed in neutron:
assignee: Maru Newby (maru) → nobody
Revision history for this message
Thierry Carrez (ttx) wrote :

Would be great to nail this one down before the review rush to h2

Changed in neutron:
importance: High → Critical
milestone: none → havana-2
Changed in neutron:
assignee: nobody → Gary Kotton (garyk)
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

I'm checking logs.
I compared success one and failed one.

( I grepped 172.24.4.229)

q-srv log is same. but l3-agent log is quite different.

Also this error starts 6/24.
so I checked the commit
- which is before 6/24
- Modifies l3-agent

This commit looks related.. This commit may affect floating ip also.

Author: Jenkins <email address hidden>
Date: Wed May 29 14:40:59 2013 +0000

    Merge "Configurable external gateway modes"
https://github.com/openstack/neutron/commit/59089e56f674f5f94f67c5986e9a616bb669d846

I have no proof of this, also there is still no explanation why it happens 6/24 at high rate..
(we upgraded greenlet at 6/23, this may related .. but I'm not sure..)

I'll trying to test neutron without this commit.

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Hi

I think I could reproduce this issue (if the test script is not wrong...)

Here is test script and result.
http://paste.openstack.org/show/40119/

I'm trying to fix the issue

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Sorry script is not copy correctly.

here is true one
http://paste.openstack.org/show/40133/

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

I think I found the possible cause of issue.

see
http://paste.openstack.org/show/40156/

line starts from **** Get is RPC message log.
line starts from [[[[[ Process is when l3-agent start processing rpc messages.
(I added rpc message number for debugging)

so we expected all RPC call is executed serialized way.
However it is executed in random order!

We are using semaphore in eventlet, but it looks we should use queue instead of queue.
We should also combine RPC messages.
Because when we get concurrent request, l3-agent processes one by one.
so sometimes, the process for the new ip will be delayed long time.

Changed in tempest:
status: New → Incomplete
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

I'm writing patch.

Changed in tempest:
status: Incomplete → Invalid
Changed in neutron:
status: Confirmed → In Progress
assignee: Gary Kotton (garyk) → Nachi Ueno (nati-ueno)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/36890

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/36890
Committed: http://github.com/openstack/neutron/commit/57e1fa32648a28ae83e3c3033258fe08de2e5fca
Submitter: Jenkins
Branch: master

commit 57e1fa32648a28ae83e3c3033258fe08de2e5fca
Author: Nachi Ueno <email address hidden>
Date: Fri Jul 12 12:21:46 2013 -0700

    Improve l3-agent performance and prevent races in it.

    Fixes bug 1194026
    Fixes bug 1200749

    Introduce a looping call for performing synchronization with
    neutron server.
    The sync will be performed only if router changes are notified
    via rpc. Only affected routers will be synchronized.

    Changes will be implemented by the l3 agent spawning a
    distinct greenthread for each router - iptables will
    be executed only once using iptables_manager.defer_apply_on.

    This patch will prevent the occurence of the following issues:
    - Out-of-order rpc message processing
    - Long processing time for router changes due to serial execution
    - Occasional and expected RPC blocks for long periods
    - Unnecessary processing of multiple requests

    Change-Id: I0978d1c38ac5c38c4548e5b1877857bb5cac3b81

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Nachi Ueno (nati-ueno)
Changed in neutron:
status: Fix Released → In Progress
milestone: havana-2 → havana-3
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

hmm still I'm getting errors.
It looks previous patch improves success rate, but it still failing on some percentage

(fail case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBGQUlMXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzNzQxODYxOTg0MjF9

(pass case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBva1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzc0MTg2MjgyODU3fQ==?field.comment=hmm still I'm getting errors.

(fail case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBGQUlMXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzNzQxODYxOTg0MjF9

(pass case)
http://logstash.openstack.org/#eyJzZWFyY2giOiJAdGFnczpjb25zb2xlLmh0bWwgQU5EIEBtZXNzYWdlOlwidGVtcGVzdC5zY2VuYXJpby50ZXN0X25ldHdvcmtfYmFzaWNfb3BzLlRlc3ROZXR3b3JrQmFzaWNPcHMudGVzdF8wMDhfY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5W2dhdGUsc21va2VdIC4uLiBva1wiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzc0MTg2MjgyODU3fQ==

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

This is before the patch merged

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

This is after the patch merged

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

One findings from log is
Timing of "Got routers updated notification" looks delayed..

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

OK New fix looks working 10 times without this failure.

On previous patch it was working 10 times, but after the test, I added _rcp_loop..
and the _rpc_loop looks stopped (or may be delayed to execute ) on the gate failure.

In this patch I remove _rpc_loop

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

It looks like timeout of tempest side looks also short.
(it is 20 sec)

Changed in tempest:
status: Invalid → Confirmed
assignee: nobody → Nachi Ueno (nati-ueno)
importance: Undecided → High
Changed in tempest:
status: Confirmed → In Progress
Nachi Ueno (nati-ueno)
Changed in tempest:
milestone: none → havana-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/38513
Committed: http://github.com/openstack/tempest/commit/6d580bee934199c4935e4d754cbafb67cc859961
Submitter: Jenkins
Branch: master

commit 6d580bee934199c4935e4d754cbafb67cc859961
Author: Nachi Ueno <email address hidden>
Date: Wed Jul 24 10:58:11 2013 -0700

    Increase ping timeout on scenario testing

    Some gating may fail because ping timeout is short (20s).
    In this commit, we will increase up to 60s.

    - Added ping_timeout for compute config with default 60s
    - Replaced hardcorded ssh timeout value with ssh_timeout

    Fixes bug 1194026

    Change-Id: If4e64aff17fc9aea1b6de03c684dff145ef5e6f2

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Gating is going back to normal again.
(see graph https://gist.github.com/nati/6158178)

We could start vorting again.

I don't see this errors in 48 hours run.

so IMO, We can close this bug.
# if we found this bug again, let's re-open this one.

Changed in neutron:
importance: Critical → Medium
importance: Medium → Critical
status: In Progress → Fix Committed
Revision history for this message
Chmouel Boudjnah (chmouel) wrote :

@Nachi we just got a new one again in https://review.openstack.org/#/c/40084/

Revision history for this message
Chmouel Boudjnah (chmouel) wrote :
tags: added: tempest
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-3 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.