sporadic test failures due to VMs not getting a DHCP lease

Bug #1404743 reported by Angus Salkeld
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
High
Ryan Moats

Bug Description

http://logs.openstack.org/01/141001/4/gate/gate-tempest-dsvm-neutron-full/0fcd5ec/console.html.gz

2014-12-19 14:01:31.371 | SSHTimeout: Connection to the 172.24.4.69 via SSH timed out.

Revision history for this message
Henry Gessau (gessau) wrote :

I suspect this might be a duplicate of Bug #1253896.

tags: added: gate-failure
Changed in neutron:
importance: Undecided → High
Revision history for this message
yalei wang (yalei-wang) wrote :

could not reproduce it in the master branch with 'tox -e py27' in tempest.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

That happens from time to time in the gate.
But apparently that has little to do with load balancer test.
Most probably it's some issue with floating ips and accessing VM.

Revision history for this message
Wei Hu (huwei-xtu) wrote :
Download full text (5.1 KiB)

My patch failed to pass jenkins by this issue.
Failure case: tempest.scenario.test_load_balancer_basic.TestLoadBalancerBasic.test_load_balancer_basic[compute,network].
Consolo-log: http://logs.openstack.org/99/144399/6/check/check-tempest-dsvm-neutron-full/05caba4/console.html.
And from the log informations. I think this can be result by cirros can not get an ip address. See cirros console-log(ifconfig -a).

2015-01-05 04:27:07.992 | === system information ===
2015-01-05 04:27:07.992 | Platform: OpenStack Foundation OpenStack Nova
2015-01-05 04:27:07.992 | Container: none
2015-01-05 04:27:07.992 | Arch: x86_64
2015-01-05 04:27:07.992 | CPU(s): 1 @ 2593.733 MHz
2015-01-05 04:27:07.992 | Cores/Sockets/Threads: 1/1/1
2015-01-05 04:27:07.992 | Virt-type: AMD-V
2015-01-05 04:27:07.992 | RAM Size: 49MB
2015-01-05 04:27:07.993 | Disks:
2015-01-05 04:27:07.993 | NAME MAJ:MIN SIZE LABEL MOUNTPOINT
2015-01-05 04:27:07.993 | vda 253:0 25165824 cirros-rootfs /
2015-01-05 04:27:07.993 | sr0 11:0 419840 config-2
2015-01-05 04:27:07.993 | === sshd host keys ===
2015-01-05 04:27:07.993 | -----BEGIN SSH HOST KEY KEYS-----
2015-01-05 04:27:07.993 | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgmrXHV1aEK2K+fy6cdj/OAw1SBDNZEJt0QwelbXK+nfTYPdtkzUTiWrifbDOtvZXcFUXkCj+uiH8EjlTkd1MOxcDwobz+auNy9O3DTB3RDzwFQ0EkWST1LTgRh8bWcPK5zFNOxdxAT2/l+j5Q0NLx/dDG1JV0KvjpGCUPLZMIHDaeXs= root@server1
2015-01-05 04:27:07.993 | ssh-dss AAAAB3NzaC1kc3MAAACBAK9X11hJOjTfL0yhKqZjqtkNAb//Nil96drqY8iHE7YYCMaXoWIfI8FcE8BKWFTIJQApPP+62y2ocEgeooZkpHOxiA9hAaixaJBjVYHt7++lpojlMmDYPzU+v6UH/BN9epHLKR78FKHyjgfoIVAeKvJB2iAUztFR72j+rcP62IMPAAAAFQCNWnIeBJF8zFcXKsrdTbP6ziCkdQAAAIAZcSyBWIZ+bQ76Wt0CDTo9CjA9WC0A+Cvu1IxvyUlsgUZAwOS9e7BlLg7rMSJMh9S2td3ormSf0LAafF7vucDxNCx3AaR1ZrWmOUqzMl4krgxQGBazobwZFg3xLogwPCRB6Gfqu9ihO6IbX1P2CWD2b4WjpUeSjTfkMJz7pyooGgAAAIA7Dv6NWRs1pHSgRdrjho86wLKZv/8A+4/q75R8nrjkFDZOMsvfQPmwwVN5zAI4CMS0KXdx0+eq/vauT6F6Uf66J2RWlvCwZvF/oBHIDNLx/4Oi1btixfOSekp+BPZVBSqV3BSvGkRdT6C5VMTgSwA4QSZefuy9SBlq7kOssTaPMg== root@server1
2015-01-05 04:27:07.993 | -----END SSH HOST KEY KEYS-----
2015-01-05 04:27:07.994 | === network info ===
2015-01-05 04:27:07.994 | if-info: lo,up,127.0.0.1,8,::1
2015-01-05 04:27:07.994 | if-info: eth0,up,,8,fe80::f816:3eff:fe5a:b2a
2015-01-05 04:27:07.994 | === datasource: configdrive local ===
2015-01-05 04:27:07.994 | instance-id: 6d7a1f8f-2b76-42d9-8d01-1512e6e3c9cf
2015-01-05 04:27:07.994 | name: server1
2015-01-05 04:27:07.994 | availability-zone: nova
2015-01-05 04:27:07.994 | local-hostname: server1.novalocal
2015-01-05 04:27:07.994 | launch-index: 0
2015-01-05 04:27:07.994 | === cirros: current=0.3.2 uptime=195.93 ===
2015-01-05 04:27:07.994 | route: fscanf
2015-01-05 04:27:07.994 | === pinging gateway failed, debugging connection ===
2015-01-05 04:27:07.995 | ############ debug start ##############
2015-01-05 04:27:07.995 | ### /etc/init.d/sshd start
2015-01-05 04:27:07.995 | Starting dropbear sshd: OK
2015-01-05 04:27:07.995 | route: fscanf
2015-01-05 04:27:07.995 | ### ifconfig -a
2015-01-05 04:27:07.995 | ...

Read more...

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

exactly, the reason of failure is VM is not getting an IP from DHCP - a bit more relevant part of cirros log below:

http://logs.openstack.org/99/144399/6/check/check-tempest-dsvm-neutron-full/05caba4/console.html

2015-01-05 04:27:08.148 | info: initramfs: up at 3.24
2015-01-05 04:27:08.148 | info: copying initramfs to /dev/vda
2015-01-05 04:27:08.148 | info: initramfs loading root from /dev/vda
2015-01-05 04:27:08.148 | info: /etc/init.d/rc.sysinit: up at 6.86
2015-01-05 04:27:08.148 | info: container: none
2015-01-05 04:27:08.148 | Starting logging: OK
2015-01-05 04:27:08.148 | Initializing random number generator... done.
2015-01-05 04:27:08.148 | Starting acpid: OK
2015-01-05 04:27:08.148 | cirros-ds 'local' up at 8.81
2015-01-05 04:27:08.148 | found datasource (configdrive, local)
2015-01-05 04:27:08.148 | Starting network...
2015-01-05 04:27:08.149 | udhcpc (v1.20.1) started
2015-01-05 04:27:08.149 | Sending discover...
2015-01-05 04:27:08.149 | Sending discover...
2015-01-05 04:27:08.149 | Sending discover...
---> 2015-01-05 04:27:08.149 | No lease, failing
2015-01-05 04:27:08.149 | WARN: /etc/rc3.d/S40-network failed
2015-01-05 04:27:08.149 | Starting dropbear sshd: generating rsa key... generating dsa key... OK
2015-01-05 04:27:08.149 | no userdata for datasource
...
2015-01-05 04:27:08.151 | === network info ===
2015-01-05 04:27:08.151 | if-info: lo,up,127.0.0.1,8,::1
2015-01-05 04:27:08.151 | if-info: eth0,up,,8,fe80::f816:3eff:fe5a:b2a

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :
summary: - test_load_balancer_basic fails with ssh timeout
+ sporadic test failures due to VMs not getting a DHCP lease
Changed in neutron:
status: New → Confirmed
tags: added: l3-ipam-dhcp sg-fw
Revision history for this message
SHENGJIAO CAO (caosj05) wrote :

I have met this issue two. On both Juno or Icehouse.

Sometimes I get the "no lease, failing" in the nova console-log when I boot a new instance, which leads to the neutron port associated with VM "DOWN".

And I discovered that by restarting the nova-compute and neutron-plugin-openvswitch-agent service on the compute node and reboot the VM, the port always become active and the DHCP is successfull: "Lease of XXXX obtained"

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Just checking in. 311 hits this week from the logstash query linked from comment #6. This is down quite a bit from January but still not gone. I did not look deeper. This is the kind of bug which would be caused by any number of problems. Needs more investigation.

Revision history for this message
Kyle Mestery (mestery) wrote :

The log stash query from comment #6 how shows 475 hits, so it's gone a bit since Carl last checked in. Needs more investigation.

Revision history for this message
Ryan Moats (rmoats) wrote :
Revision history for this message
Kyle Mestery (mestery) wrote :

Bumping to critical for now based on Ryan's comments in #10. Also, assigning to Ryan for now for further triage.

Changed in neutron:
importance: High → Critical
assignee: nobody → Ryan Moats (rmoats)
Revision history for this message
Ryan Moats (rmoats) wrote :

It looks like this error has multiple sources, which will have to be dealt with.

The first suspicious thing I'm seeing is signatures like the following: http://pastebin.com/PHnkYPRC - at which point the test tries to reach the floating ip address (172.24.5.161) and gives up after 16 tries.

Revision history for this message
Ryan Moats (rmoats) wrote :

The second suspicious thing I'm seeing is a signature like the following: http://pastebin.com/gKUqERBx

In this case, the floating-ip create has a port ID that doesn't show up in the later port list, so I'm not 100% sure how the floating-ip ends up with an the fixed IP address that it does.

Revision history for this message
Ryan Moats (rmoats) wrote :

The third failure signature I'm seeing creates a floating IP, whose status is DOWN and then doesn't wait for it to change before trying to run the tests.

Revision history for this message
Ryan Moats (rmoats) wrote :

some more clarifications - in the first case, Nova is returning 202 when the FIP is being attached to the instance. 202 typically means that the request is accepted, but hasn't been finished and so I'm thinking that the test case should have some code added to double check that the status of the FIP is active before trying to SSH to it. That would segment the issue.

I'm now thinking the second and third cases are the same - here again, I suspect the test code is hitting a situation it isn't equipped to deal with (multiple/stale ports with the same fixed IP), but I'll need to look at code [the next step] to try and understand the why a bit more

Revision history for this message
Ryan Moats (rmoats) wrote :

The first case is confirmed. The test case (TestVolumeBootPattern:test_volume_boot_pattern) does indeed use nova client to create and associate the floating ip (lines 99-109), but does not check that the floating IP is active before trying to connect. The proposed patch is to replace the use of nova client with neutron client and add the necessary logic to ensure the FIP is active before trying to connect.

Revision history for this message
Ryan Moats (rmoats) wrote :

There appears to be a common thread to all of these error cases: the FIP GW agent port is bound to an agent, but the agent does not ever request the status for the patch, nor does it attach it to a bridge, nor does it transition that port to ACTIVE. This in turn leads to the GARP failing (the failures show up in the L3 agent log) which means the FIP is not reachable, either by SSH or ping.

Declaring this to now be triaged

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

it looks like this may no longer be triage-able.

Revision history for this message
Ryan Moats (rmoats) wrote :

Yes, the two logstash queries above show only 1 hit in the last seven days. I believe this bug can be closed for now.

Revision history for this message
Kyle Mestery (mestery) wrote :

Closing for now per Ryan's comments in #19.

Changed in neutron:
status: In Progress → Invalid
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/210378
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=30b121dfa4d8542665b2209d21ece805d8957aa5
Submitter: Jenkins
Branch: master

commit 30b121dfa4d8542665b2209d21ece805d8957aa5
Author: Oleg Bondarev <email address hidden>
Date: Fri Aug 7 15:44:29 2015 +0300

    Ensure l3 agent receives notification about added router

    Currently router_added (and other) notifications are sent
    to agents with an RPC cast() method which does not ensure that
    the message is actually delivered to the recipient.
    If the message is lost (due to instability of messaging system
    in some failover scenarios for example) neither server nor agent
    will be aware of that and router will be "lost" till next agent
    resync. Resync will only happen in case of errors on agent side
    or restart.
    The fix makes server use call() to notify agents about added routers
    thus ensuring no routers will be lost.

    This also unifies reschedule_router() method to avoid code duplication
    between legacy and dvr agent schedulers.

    Closes-Bug: #1482630
    Related-Bug #1404743
    Change-Id: Id08764ba837d8f47a28649d081a5876797fe369e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/252852

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/252852
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1eb40b49f3da95e7960860da1033ee15932929ef
Submitter: Jenkins
Branch: stable/liberty

commit 1eb40b49f3da95e7960860da1033ee15932929ef
Author: Oleg Bondarev <email address hidden>
Date: Thu Dec 3 12:52:51 2015 +0300

    Ensure l3 agent receives notification about added router

    Currently router_added (and other) notifications are sent
    to agents with an RPC cast() method which does not ensure that
    the message is actually delivered to the recipient.
    If the message is lost (due to instability of messaging system
    in some failover scenarios for example) neither server nor agent
    will be aware of that and router will be "lost" till next agent
    resync. Resync will only happen in case of errors on agent side
    or restart.
    The fix makes server use call() to notify agents about added routers
    thus ensuring no routers will be lost.
    This also unifies reschedule_router() method to avoid code duplication
    between legacy and dvr agent schedulers.

    Closes-Bug: #1482630
    Related-Bug #1404743
    Change-Id: Id08764ba837d8f47a28649d081a5876797fe369e
    (cherry picked from commit 30b121dfa4d8542665b2209d21ece805d8957aa5)

tags: added: in-stable-liberty
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.