Bug #1404743 “sporadic test failures due to VMs not getting a DH...” : Bugs : neutron

Revision history for this message

Henry Gessau (gessau) wrote on 2014-12-21:

#1

I suspect this might be a duplicate of Bug #1253896.

Eugene Nikanorov (enikanorov) on 2014-12-22

tags:	added: gate-failure
Changed in neutron:
importance:	Undecided → High

Revision history for this message

yalei wang (yalei-wang) wrote on 2014-12-29:

#2

could not reproduce it in the master branch with 'tox -e py27' in tempest.

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2014-12-29:

#3

That happens from time to time in the gate.
But apparently that has little to do with load balancer test.
Most probably it's some issue with floating ips and accessing VM.

Revision history for this message

Wei Hu (huwei-xtu) wrote on 2015-01-05:

#4

Download full text (5.1 KiB)

My patch failed to pass jenkins by this issue.
Failure case: tempest.scenario.test_load_balancer_basic.TestLoadBalancerBasic.test_load_balancer_basic[compute,network].
Consolo-log: http://logs.openstack.org/99/144399/6/check/check-tempest-dsvm-neutron-full/05caba4/console.html.
And from the log informations. I think this can be result by cirros can not get an ip address. See cirros console-log(ifconfig -a).

2015-01-05 04:27:07.992 | === system information ===
2015-01-05 04:27:07.992 | Platform: OpenStack Foundation OpenStack Nova
2015-01-05 04:27:07.992 | Container: none
2015-01-05 04:27:07.992 | Arch: x86_64
2015-01-05 04:27:07.992 | CPU(s): 1 @ 2593.733 MHz
2015-01-05 04:27:07.992 | Cores/Sockets/Threads: 1/1/1
2015-01-05 04:27:07.992 | Virt-type: AMD-V
2015-01-05 04:27:07.992 | RAM Size: 49MB
2015-01-05 04:27:07.993 | Disks:
2015-01-05 04:27:07.993 | NAME MAJ:MIN SIZE LABEL MOUNTPOINT
2015-01-05 04:27:07.993 | vda 253:0 25165824 cirros-rootfs /
2015-01-05 04:27:07.993 | sr0 11:0 419840 config-2
2015-01-05 04:27:07.993 | === sshd host keys ===
2015-01-05 04:27:07.993 | -----BEGIN SSH HOST KEY KEYS-----
2015-01-05 04:27:07.993 | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgmrXHV1aEK2K+fy6cdj/OAw1SBDNZEJt0QwelbXK+nfTYPdtkzUTiWrifbDOtvZXcFUXkCj+uiH8EjlTkd1MOxcDwobz+auNy9O3DTB3RDzwFQ0EkWST1LTgRh8bWcPK5zFNOxdxAT2/l+j5Q0NLx/dDG1JV0KvjpGCUPLZMIHDaeXs= root@server1
2015-01-05 04:27:07.993 | ssh-dss AAAAB3NzaC1kc3MAAACBAK9X11hJOjTfL0yhKqZjqtkNAb//Nil96drqY8iHE7YYCMaXoWIfI8FcE8BKWFTIJQApPP+62y2ocEgeooZkpHOxiA9hAaixaJBjVYHt7++lpojlMmDYPzU+v6UH/BN9epHLKR78FKHyjgfoIVAeKvJB2iAUztFR72j+rcP62IMPAAAAFQCNWnIeBJF8zFcXKsrdTbP6ziCkdQAAAIAZcSyBWIZ+bQ76Wt0CDTo9CjA9WC0A+Cvu1IxvyUlsgUZAwOS9e7BlLg7rMSJMh9S2td3ormSf0LAafF7vucDxNCx3AaR1ZrWmOUqzMl4krgxQGBazobwZFg3xLogwPCRB6Gfqu9ihO6IbX1P2CWD2b4WjpUeSjTfkMJz7pyooGgAAAIA7Dv6NWRs1pHSgRdrjho86wLKZv/8A+4/q75R8nrjkFDZOMsvfQPmwwVN5zAI4CMS0KXdx0+eq/vauT6F6Uf66J2RWlvCwZvF/oBHIDNLx/4Oi1btixfOSekp+BPZVBSqV3BSvGkRdT6C5VMTgSwA4QSZefuy9SBlq7kOssTaPMg== root@server1
2015-01-05 04:27:07.993 | -----END SSH HOST KEY KEYS-----
2015-01-05 04:27:07.994 | === network info ===
2015-01-05 04:27:07.994 | if-info: lo,up,127.0.0.1,8,::1
2015-01-05 04:27:07.994 | if-info: eth0,up,,8,fe80::f816:3eff:fe5a:b2a
2015-01-05 04:27:07.994 | === datasource: configdrive local ===
2015-01-05 04:27:07.994 | instance-id: 6d7a1f8f-2b76-42d9-8d01-1512e6e3c9cf
2015-01-05 04:27:07.994 | name: server1
2015-01-05 04:27:07.994 | availability-zone: nova
2015-01-05 04:27:07.994 | local-hostname: server1.novalocal
2015-01-05 04:27:07.994 | launch-index: 0
2015-01-05 04:27:07.994 | === cirros: current=0.3.2 uptime=195.93 ===
2015-01-05 04:27:07.994 | route: fscanf
2015-01-05 04:27:07.994 | === pinging gateway failed, debugging connection ===
2015-01-05 04:27:07.995 | ############ debug start ##############
2015-01-05 04:27:07.995 | ### /etc/init.d/sshd start
2015-01-05 04:27:07.995 | Starting dropbear sshd: OK
2015-01-05 04:27:07.995 | route: fscanf
2015-01-05 04:27:07.995 | ### ifconfig -a
2015-01-05 04:27:07.995 | ...

My patch failed to pass jenkins by this issue.
Failure case: tempest.scenario.test_load_balancer_basic.TestLoadBalancerBasic.test_load_balancer_basic[compute,network].
Consolo-log: http://logs.openstack.org/99/144399/6/check/check-tempest-dsvm-neutron-full/05caba4/console.html.
And from the log informations. I think this can be result by cirros can not get an ip address. See cirros console-log(ifconfig -a).

2015-01-05 04:27:07.992 |     === system information ===
2015-01-05 04:27:07.992 |     Platform: OpenStack Foundation OpenStack Nova
2015-01-05 04:27:07.992 |     Container: none
2015-01-05 04:27:07.992 |     Arch: x86_64
2015-01-05 04:27:07.992 |     CPU(s): 1 @ 2593.733 MHz
2015-01-05 04:27:07.992 |     Cores/Sockets/Threads: 1/1/1
2015-01-05 04:27:07.992 |     Virt-type: AMD-V
2015-01-05 04:27:07.992 |     RAM Size: 49MB
2015-01-05 04:27:07.993 |     Disks:
2015-01-05 04:27:07.993 |     NAME MAJ:MIN     SIZE LABEL         MOUNTPOINT
2015-01-05 04:27:07.993 |     vda  253:0   25165824 cirros-rootfs /
2015-01-05 04:27:07.993 |     sr0   11:0     419840 config-2      
2015-01-05 04:27:07.993 |     === sshd host keys ===
2015-01-05 04:27:07.993 |     -----BEGIN SSH HOST KEY KEYS-----
2015-01-05 04:27:07.993 |     ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgmrXHV1aEK2K+fy6cdj/OAw1SBDNZEJt0QwelbXK+nfTYPdtkzUTiWrifbDOtvZXcFUXkCj+uiH8EjlTkd1MOxcDwobz+auNy9O3DTB3RDzwFQ0EkWST1LTgRh8bWcPK5zFNOxdxAT2/l+j5Q0NLx/dDG1JV0KvjpGCUPLZMIHDaeXs= root@server1
2015-01-05 04:27:07.993 |     ssh-dss AAAAB3NzaC1kc3MAAACBAK9X11hJOjTfL0yhKqZjqtkNAb//Nil96drqY8iHE7YYCMaXoWIfI8FcE8BKWFTIJQApPP+62y2ocEgeooZkpHOxiA9hAaixaJBjVYHt7++lpojlMmDYPzU+v6UH/BN9epHLKR78FKHyjgfoIVAeKvJB2iAUztFR72j+rcP62IMPAAAAFQCNWnIeBJF8zFcXKsrdTbP6ziCkdQAAAIAZcSyBWIZ+bQ76Wt0CDTo9CjA9WC0A+Cvu1IxvyUlsgUZAwOS9e7BlLg7rMSJMh9S2td3ormSf0LAafF7vucDxNCx3AaR1ZrWmOUqzMl4krgxQGBazobwZFg3xLogwPCRB6Gfqu9ihO6IbX1P2CWD2b4WjpUeSjTfkMJz7pyooGgAAAIA7Dv6NWRs1pHSgRdrjho86wLKZv/8A+4/q75R8nrjkFDZOMsvfQPmwwVN5zAI4CMS0KXdx0+eq/vauT6F6Uf66J2RWlvCwZvF/oBHIDNLx/4Oi1btixfOSekp+BPZVBSqV3BSvGkRdT6C5VMTgSwA4QSZefuy9SBlq7kOssTaPMg== root@server1
2015-01-05 04:27:07.993 |     -----END SSH HOST KEY KEYS-----
2015-01-05 04:27:07.994 |     === network info ===
2015-01-05 04:27:07.994 |     if-info: lo,up,127.0.0.1,8,::1
2015-01-05 04:27:07.994 |     if-info: eth0,up,,8,fe80::f816:3eff:fe5a:b2a
2015-01-05 04:27:07.994 |     === datasource: configdrive local ===
2015-01-05 04:27:07.994 |     instance-id: 6d7a1f8f-2b76-42d9-8d01-1512e6e3c9cf
2015-01-05 04:27:07.994 |     name: server1
2015-01-05 04:27:07.994 |     availability-zone: nova
2015-01-05 04:27:07.994 |     local-hostname: server1.novalocal
2015-01-05 04:27:07.994 |     launch-index: 0
2015-01-05 04:27:07.994 |     === cirros: current=0.3.2 uptime=195.93 ===
2015-01-05 04:27:07.994 |     route: fscanf
2015-01-05 04:27:07.994 |     === pinging gateway failed, debugging connection ===
2015-01-05 04:27:07.995 |     ############ debug start ##############
2015-01-05 04:27:07.995 |     ### /etc/init.d/sshd start
2015-01-05 04:27:07.995 |     Starting dropbear sshd: OK
2015-01-05 04:27:07.995 |     route: fscanf
2015-01-05 04:27:07.995 |     ### ifconfig -a
2015-01-05 04:27:07.995 |     eth0      Link encap:Ethernet  HWaddr FA:16:3E:5A:0B:2A  
2015-01-05 04:27:07.995 |               inet6 addr: fe80::f816:3eff:fe5a:b2a/64 Scope:Link
2015-01-05 04:27:07.995 |               UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
2015-01-05 04:27:07.995 |               RX packets:136 errors:0 dropped:0 overruns:0 frame:0
2015-01-05 04:27:07.995 |               TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
2015-01-05 04:27:07.995 |               collisions:0 txqueuelen:1000 
2015-01-05 04:27:07.995 |               RX bytes:7284 (7.1 KiB)  TX bytes:1114 (1.0 KiB)
2015-01-05 04:27:07.996 |     
2015-01-05 04:27:07.996 |     lo        Link encap:Local Loopback  
2015-01-05 04:27:07.996 |               inet addr:127.0.0.1  Mask:255.0.0.0
2015-01-05 04:27:07.996 |               inet6 addr: ::1/128 Scope:Host
2015-01-05 04:27:07.996 |               UP LOOPBACK RUNNING  MTU:16436  Metric:1
2015-01-05 04:27:07.996 |               RX packets:12 errors:0 dropped:0 overruns:0 frame:0
2015-01-05 04:27:07.996 |               TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
2015-01-05 04:27:07.996 |               collisions:0 txqueuelen:0 
2015-01-05 04:27:07.996 |               RX bytes:1020 (1020.0 B)  TX bytes:1020 (1020.0 B)
2015-01-05 04:27:07.996 |     
2015-01-05 04:27:07.996 |     ### route -n
2015-01-05 04:27:07.996 |     Kernel IP routing table
2015-01-05 04:27:07.997 |     Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
2015-01-05 04:27:07.997 |     route: fscanf
2015-01-05 04:27:07.997 |     ### cat /etc/resolv.conf
2015-01-05 04:27:07.997 |     cat: can't open '/etc/resolv.conf': No such file or directory
2015-01-05 04:27:07.997 |     ### gateway not found
2015-01-05 04:27:07.997 |     /sbin/cirros-status: line 1: can't open /etc/resolv.conf: no such file
2015-01-05 04:27:07.997 |     ### pinging nameservers
2015-01-05 04:27:07.997 |     ### uname -a
2015-01-05 04:27:07.997 |     Linux server1 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 GNU/Linux

Revision history for this message

Pavlo Shchelokovskyy (pshchelo) wrote on 2015-01-08:

#5

exactly, the reason of failure is VM is not getting an IP from DHCP - a bit more relevant part of cirros log below:

http://logs.openstack.org/99/144399/6/check/check-tempest-dsvm-neutron-full/05caba4/console.html

2015-01-05 04:27:08.148 | info: initramfs: up at 3.24
2015-01-05 04:27:08.148 | info: copying initramfs to /dev/vda
2015-01-05 04:27:08.148 | info: initramfs loading root from /dev/vda
2015-01-05 04:27:08.148 | info: /etc/init.d/rc.sysinit: up at 6.86
2015-01-05 04:27:08.148 | info: container: none
2015-01-05 04:27:08.148 | Starting logging: OK
2015-01-05 04:27:08.148 | Initializing random number generator... done.
2015-01-05 04:27:08.148 | Starting acpid: OK
2015-01-05 04:27:08.148 | cirros-ds 'local' up at 8.81
2015-01-05 04:27:08.148 | found datasource (configdrive, local)
2015-01-05 04:27:08.148 | Starting network...
2015-01-05 04:27:08.149 | udhcpc (v1.20.1) started
2015-01-05 04:27:08.149 | Sending discover...
2015-01-05 04:27:08.149 | Sending discover...
2015-01-05 04:27:08.149 | Sending discover...
---> 2015-01-05 04:27:08.149 | No lease, failing
2015-01-05 04:27:08.149 | WARN: /etc/rc3.d/S40-network failed
2015-01-05 04:27:08.149 | Starting dropbear sshd: generating rsa key... generating dsa key... OK
2015-01-05 04:27:08.149 | no userdata for datasource
...
2015-01-05 04:27:08.151 | === network info ===
2015-01-05 04:27:08.151 | if-info: lo,up,127.0.0.1,8,::1
2015-01-05 04:27:08.151 | if-info: eth0,up,,8,fe80::f816:3eff:fe5a:b2a

Revision history for this message

Pavlo Shchelokovskyy (pshchelo) wrote on 2015-01-12:

#6

I'm tempted to change the title as I regularly encounter such failure reason on various tests.

Logstash query
message:"No lease, failing" AND tags:console

1,380 hits for last 7 days

http://logstash.openstack.org/#eyJzZWFyY2giOiIgbWVzc2FnZTpcIsKgwqDCoMKgTm8gbGVhc2UsIGZhaWxpbmdcIiBBTkQgdGFnczpjb25zb2xlIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDIxMDgwNzMyODQ2LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

Pavlo Shchelokovskyy (pshchelo) on 2015-01-12

summary:

- test_load_balancer_basic fails with ssh timeout
+ sporadic test failures due to VMs not getting a DHCP lease

Eugene Nikanorov (enikanorov) on 2015-01-27

Changed in neutron:
status:	New → Confirmed
tags:	added: l3-ipam-dhcp sg-fw

Revision history for this message

SHENGJIAO CAO (caosj05) wrote on 2015-04-24:

#7

I have met this issue two. On both Juno or Icehouse.

Sometimes I get the "no lease, failing" in the nova console-log when I boot a new instance, which leads to the neutron port associated with VM "DOWN".

And I discovered that by restarting the nova-compute and neutron-plugin-openvswitch-agent service on the compute node and reboot the VM, the port always become active and the DHCP is successfull: "Lease of XXXX obtained"

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-07-08:

#8

Just checking in. 311 hits this week from the logstash query linked from comment #6. This is down quite a bit from January but still not gone. I did not look deeper. This is the kind of bug which would be caused by any number of problems. Needs more investigation.

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-08-11:

#9

The log stash query from comment #6 how shows 475 hits, so it's gone a bit since Carl last checked in. Needs more investigation.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-12:

#10

A slightly different logstash query shows things aren't so good:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRVJST1IgdGVtcGVzdC5zY2VuYXJpby5tYW5hZ2VyIFNTSFRpbWVvdXQ6XCIgQU5EIGJ1aWxkX25hbWU6ICpuZXV0cm9uKiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQzOTM5MjQ0ODMwMH0=

~1400 hits in 7 days

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-08-12:

#11

Bumping to critical for now based on Ryan's comments in #10. Also, assigning to Ryan for now for further triage.

Changed in neutron:
importance:	High → Critical
assignee:	nobody → Ryan Moats (rmoats)

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-12:

#12

It looks like this error has multiple sources, which will have to be dealt with.

The first suspicious thing I'm seeing is signatures like the following: http://pastebin.com/PHnkYPRC - at which point the test tries to reach the floating ip address (172.24.5.161) and gives up after 16 tries.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-12:

#13

The second suspicious thing I'm seeing is a signature like the following: http://pastebin.com/gKUqERBx

In this case, the floating-ip create has a port ID that doesn't show up in the later port list, so I'm not 100% sure how the floating-ip ends up with an the fixed IP address that it does.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-13:

#14

The third failure signature I'm seeing creates a floating IP, whose status is DOWN and then doesn't wait for it to change before trying to run the tests.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-13:

#15

some more clarifications - in the first case, Nova is returning 202 when the FIP is being attached to the instance. 202 typically means that the request is accepted, but hasn't been finished and so I'm thinking that the test case should have some code added to double check that the status of the FIP is active before trying to SSH to it. That would segment the issue.

I'm now thinking the second and third cases are the same - here again, I suspect the test code is hitting a situation it isn't equipped to deal with (multiple/stale ports with the same fixed IP), but I'll need to look at code [the next step] to try and understand the why a bit more

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-13:

#16

The first case is confirmed. The test case (TestVolumeBootPattern:test_volume_boot_pattern) does indeed use nova client to create and associate the floating ip (lines 99-109), but does not check that the floating IP is active before trying to connect. The proposed patch is to replace the use of nova client with neutron client and add the necessary logic to ensure the FIP is active before trying to connect.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-13:

#17

There appears to be a common thread to all of these error cases: the FIP GW agent port is bound to an agent, but the agent does not ever request the status for the patch, nor does it attach it to a bridge, nor does it transition that port to ACTIVE. This in turn leads to the GARP failing (the failures show up in the L3 agent log) which means the FIP is not reachable, either by SSH or ping.

Declaring this to now be triaged

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-08-17:

#18

it looks like this may no longer be triage-able.

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-08-20:

#19

Yes, the two logstash queries above show only 1 hit in the last seven days. I believe this bug can be closed for now.

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-08-21:

#20

Closing for now per Ryan's comments in #19.

Changed in neutron:
status:	In Progress → Invalid
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-20: Related fix merged to neutron (master)

#21

Reviewed: https://review.openstack.org/210378
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=30b121dfa4d8542665b2209d21ece805d8957aa5
Submitter: Jenkins
Branch: master

commit 30b121dfa4d8542665b2209d21ece805d8957aa5
Author: Oleg Bondarev <email address hidden>
Date: Fri Aug 7 15:44:29 2015 +0300

Ensure l3 agent receives notification about added router

    Currently router_added (and other) notifications are sent
    to agents with an RPC cast() method which does not ensure that
    the message is actually delivered to the recipient.
    If the message is lost (due to instability of messaging system
    in some failover scenarios for example) neither server nor agent
    will be aware of that and router will be "lost" till next agent
    resync. Resync will only happen in case of errors on agent side
    or restart.
    The fix makes server use call() to notify agents about added routers
    thus ensuring no routers will be lost.

This also unifies reschedule_router() method to avoid code duplication
between legacy and dvr agent schedulers.

    Closes-Bug: #1482630
    Related-Bug #1404743
    Change-Id: Id08764ba837d8f47a28649d081a5876797fe369e

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-03: Related fix proposed to neutron (stable/liberty)

#22

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/252852

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-09: Related fix merged to neutron (stable/liberty)

#23

Reviewed: https://review.openstack.org/252852
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1eb40b49f3da95e7960860da1033ee15932929ef
Submitter: Jenkins
Branch: stable/liberty

commit 1eb40b49f3da95e7960860da1033ee15932929ef
Author: Oleg Bondarev <email address hidden>
Date: Thu Dec 3 12:52:51 2015 +0300

Ensure l3 agent receives notification about added router

    Currently router_added (and other) notifications are sent
    to agents with an RPC cast() method which does not ensure that
    the message is actually delivered to the recipient.
    If the message is lost (due to instability of messaging system
    in some failover scenarios for example) neither server nor agent
    will be aware of that and router will be "lost" till next agent
    resync. Resync will only happen in case of errors on agent side
    or restart.
    The fix makes server use call() to notify agents about added routers
    thus ensuring no routers will be lost.
    This also unifies reschedule_router() method to avoid code duplication
    between legacy and dvr agent schedulers.

    Closes-Bug: #1482630
    Related-Bug #1404743
    Change-Id: Id08764ba837d8f47a28649d081a5876797fe369e
    (cherry picked from commit 30b121dfa4d8542665b2209d21ece805d8957aa5)

tags:

added: in-stable-liberty

neutron

sporadic test failures due to VMs not getting a DHCP lease

Bug Description

Other bug subscribers

Remote bug watches