Bug #1599329 “Potential regression on handing over DHCP addresse...” : Bugs : neutron

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-07-05:

#1

As the traces show:

info: initramfs: up at 2.08
modprobe: module virtio_blk not found in modules.dep
modprobe: module virtio_net not found in modules.dep
info: copying initramfs to /dev/vda
info: initramfs loading root from /dev/vda
info: /etc/init.d/rc.sysinit: up at 4.50
info: container: none
Starting logging: OK
modprobe: module virtio_blk not found in modules.dep
modprobe: module virtio_net not found in modules.dep
WARN: /etc/rc3.d/S10-load-modules failed
Initializing random number generator... done.
Starting acpid: OK
cirros-ds 'local' up at 5.90
found datasource (configdrive, local)
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending discover...

2016-07-05 22:37:40,487 28998 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 204 DELETE http://127.0.0.1:9696/v2.0/floatingips/fa755c01-9f6d-47cb-bfa5-27bba175624f 0.247s
2016-07-05 22:37:40,487 28998 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'x-openstack-request-id': 'req-947ba53e-e239-43f3-ace9-c068f224f8ee', 'content-length': '0', 'date': 'Tue, 05 Jul 2016 22:37:40 GMT', 'content-location': 'http://127.0.0.1:9696/v2.0/floatingips/fa755c01-9f6d-47cb-bfa5-27bba175624f', 'connection': 'close', 'status': '204'}
        Body:
2016-07-05 22:37:40,635 28998 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 204 DELETE http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c 0.146s
2016-07-05 22:37:40,635 28998 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'vary': 'X-OpenStack-Nova-API-Version', 'x-compute-request-id': 'req-48e30348-70c5-4c9e-8668-33768e43a89d', 'date': 'Tue, 05 Jul 2016 22:37:40 GMT', 'x-openstack-nova-api-version': '2.1', 'content-length': '0', 'connection': 'close', 'content-type': 'application/json', 'content-location': 'http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c', 'status': '204', 'openstack-api-version': 'compute 2.1'}
        Body:
2016-07-05 22:37:40,781 28998 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 200 GET http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c 0.145s
2016-07-05 22:37:40,781 28998 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'vary': 'X-Ope

There's no lease received.

As the traces show:

info: initramfs: up at 2.08
modprobe: module virtio_blk not found in modules.dep
modprobe: module virtio_net not found in modules.dep
info: copying initramfs to /dev/vda
info: initramfs loading root from /dev/vda
info: /etc/init.d/rc.sysinit: up at 4.50
info: container: none
Starting logging: OK
modprobe: module virtio_blk not found in modules.dep
modprobe: module virtio_net not found in modules.dep
WARN: /etc/rc3.d/S10-load-modules failed
Initializing random number generator... done.
Starting acpid: OK
cirros-ds 'local' up at 5.90
found datasource (configdrive, local)
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending discover...

2016-07-05 22:37:40,487 28998 INFO     [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 204 DELETE http://127.0.0.1:9696/v2.0/floatingips/fa755c01-9f6d-47cb-bfa5-27bba175624f 0.247s
2016-07-05 22:37:40,487 28998 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'x-openstack-request-id': 'req-947ba53e-e239-43f3-ace9-c068f224f8ee', 'content-length': '0', 'date': 'Tue, 05 Jul 2016 22:37:40 GMT', 'content-location': 'http://127.0.0.1:9696/v2.0/floatingips/fa755c01-9f6d-47cb-bfa5-27bba175624f', 'connection': 'close', 'status': '204'}
        Body: 
2016-07-05 22:37:40,635 28998 INFO     [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 204 DELETE http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c 0.146s
2016-07-05 22:37:40,635 28998 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'vary': 'X-OpenStack-Nova-API-Version', 'x-compute-request-id': 'req-48e30348-70c5-4c9e-8668-33768e43a89d', 'date': 'Tue, 05 Jul 2016 22:37:40 GMT', 'x-openstack-nova-api-version': '2.1', 'content-length': '0', 'connection': 'close', 'content-type': 'application/json', 'content-location': 'http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c', 'status': '204', 'openstack-api-version': 'compute 2.1'}
        Body: 
2016-07-05 22:37:40,781 28998 INFO     [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 200 GET http://127.0.0.1:8774/v2.1/servers/0747306e-8b5d-415e-8e70-faa19fe83c1c 0.145s
2016-07-05 22:37:40,781 28998 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
        Body: None
    Response - Headers: {'vary': 'X-Ope

There's no lease received.

Changed in neutron:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Kevin Benton (kevinbenton)
tags:	added: l3-bgp
tags:	added: l3-ipam-dhcp removed: l3-bgp

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-20:

#2

The cause of this is that multiple ports are getting the same ofport assignment in OVS. This can be seen the vswitchd log:

My suspicion is that this is a direct result of us ignoring the guideline not to use OVS devices that get moved into namespaces that was outlined in http://lists.openstack.org/pipermail/openstack-dev/2015-February/056834.html and https://bugzilla.redhat.com/show_bug.cgi?id=1160340

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-20: Fix proposed to neutron (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/344545

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-20: Related fix proposed to neutron (master)

#4

Related fix proposed to branch: master
Review: https://review.openstack.org/344731

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2016-07-20:

#5

It looks like a bug which may be it's related to the non-netns support for internal ports in ovs. If that's the case, an alternative could be moving to veths, while that is not desirable (specially for router ports).

With a single host, cpu-pegged test run: [1]

I've seen that overhead of using a veth is around +10% in CPU consumption. Please note that the above test tests passing through two veth pairs (so it's 110% * 110% = 121% increase, or ~21% decrease in bandwidth, 20G.5bps to 16Gbps TCP_STREAM or 93k conn/sec TCP_RR to 93k conn/sec TCP_RR)

Talking to some ovs developers, they're willing to fix this (even if it was related to netns usage on internal ports, which is not supported), so I'm continuing kevinbenton's work trying to recreate a bash reproducer.

[1] https://github.com/mangelajo/ovs-experiments/blob/master/veth_vs_intport/veth_vs_intport_speed.sh

[2] https://gist.github.com/kevinbenton/e8a323b61ac6145f9293ef90dbdec0b1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-21: Change abandoned on neutron (master)

#6

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/344731

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2016-07-21:

#8

Ok, I spent most of the day looking at how thing works on openvswitch under the hood.

It seems that the kernel datapath assigns a ofport number for the kernel datapath, and then the user level assigns another port number, and maps it to the first one [1]

In the middle of that, it happens to try to fetch netdev details, I suspect again of a netns related, because if the device is moved into the netns then that mechanism fails (can't find the device and finish the mapping [1]) although we had code to avoid that by fetching the ofport number (with retries), may be the ovsdb switch broke it.

Btw, the second competing port (qvo, is a veth, and not namespaces, so still the whole picture isn't clear to me)

[1] https://github.com/openvswitch/ovs/blob/branch-2.0/ofproto/ofproto.c#L1714

[2] https://github.com/openstack/neutron/blob/master/neutron/agent/common/ovs_lib.py#L230

OpenStack Infra (hudson-openstack) on 2016-07-21

Changed in neutron:
assignee:	Kevin Benton (kevinbenton) → Terry Wilson (otherwiseguy)

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-07-21:

#9

@Ajo: are you thinking that the switch to ovsdb native might have induced the fault?

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-07-22:

#10

Failures as far as the eye can see:

http://paste.openstack.org/show/539671/

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2016-07-26:

#11

I observe similar log on devstack while running dsvm-functional job:

tox -e dsvm-functional neutron.tests.functional.agent.test_l2_ovs_agent.TestOVSAgent.test_ancillary_port_creation_and_deletion

results in multiple same ports/Device not found errors in vswitchd.log

Full log here - http://paste.openstack.org/show/542112/

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2016-07-26:

#12

My bad, the ports are actually different among the bridges. Seems my observation is not relevant, sorry.

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2016-07-27:

#13

@armando-migliaccio, I analysed Kevin's list of test failures and all of them have openvswitch agent with ovsdb_interface = vsctl enabled.

See for yourself (not all of test logs are alive, but still) - http://paste.openstack.org/show/542548/

Carl Baldwin (carl-baldwin) on 2016-07-28

Changed in neutron:
assignee:	Terry Wilson (otherwiseguy) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-04:

#14

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/344545
Reason: abandoning for now assuming Terry's fix addressed it. Will watch for gate failures to confirm.

Revision history for this message

Brian Haley (brian-haley) wrote on 2016-08-11:

#15

I've seen this in the multinode dvr jobs lately (VM not getting DHCP response). Take a look in the logs from the initial report, similar issue:

http://logs.openstack.org/92/320092/43/check/gate-tempest-dsvm-neutron-dvr/cf32111/logs/screen-q-dhcp.txt.gz?#_2016-07-05_22_39_17_220

RuntimeError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Network is unreachable

This is another occurrence I saw just yesterday:

http://logs.openstack.org/51/337851/19/check/gate-tempest-dsvm-neutron-dvr-multinode-full/c944b3d/logs/screen-q-dhcp.txt.gz

So maybe there's a case where the IP isn't getting added to the DHCP namespace properly, so trying to add a route via it isn't working? I'm assuming the interface is there already.

Miguel Lavalle (minsel) on 2016-09-01

Changed in neutron:
assignee:	nobody → Miguel Lavalle (minsel)

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-09-07:

#16

I can still see 28 occurrences of this bug over the past 7 days. Using the following query:

message:"AssertionError: False is not true : Timed out waiting for" AND message:"in check_vm_connectivity" AND build_queue:"check" AND project:"openstack/neutron"

I traced this failure: http://paste.openstack.org/show/567665/

The network id is aef3be78-6d90-46cf-b1ea-b27e947eebc3

Couldn't find errors related to this network in the DHCP screen:

http://logs.openstack.org/91/359591/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/0e1f116/logs/screen-q-dhcp.txt.gz#_2016-09-07_10_38_07_190

Will keep digging in

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-09-08:

#17

We have a condition where the data plane is not ready when the Neutron's control plane and the instance deem the port ready. As a consequence, the instance's DHCPDISCOVER messages are never received by the corresponding dnsmasq instance, because data place is not ready yet. However, at the end, dnsmasq receives a DHCPRELEASE for an unknown lease for the port's instance, when presumably, data plane is ready.

Out of 35 occurrences over the past 7 days, 34 are with gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial. The other one is with gate-grenade-dsvm-neutron-dvr-multinode

This is an example that I traced all the way (patchset 367191):

1) Test case that failed:

tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops

2) Tenant network tempest-network-smoke--949101106, id 51b4c899-f196-4561-839b-773303f73fd0 is created:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_14_10_998

3) DHCP agent creates and reloads allocations including instance's port. There are no DHCP agent failures related to this network in the log file:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-dhcp.txt.gz#_2016-09-08_14_14_25_297

4) VM gets address:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_14_33_449

5) Test failure:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_16_38_313

Instance doesn't get response to DHCPDISCOVER:

udhcpc (v1.20.1) started
Sending discover...
Sending discover...
Sending discover...

6) dnsmasq instance never receives DHCPDISCOVER from instance. However, it receives a DHCPRELEASE for unknown lease from instance:

http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/syslog.txt.gz#_Sep_08_14_16_41

We have a condition where the data plane is not ready when the Neutron's control plane and the instance deem the port ready. As a consequence, the instance's DHCPDISCOVER messages are never received by the corresponding dnsmasq instance, because data place is not ready yet. However, at the end, dnsmasq receives a DHCPRELEASE for an unknown lease for the port's instance, when presumably, data plane is ready.

Out of 35 occurrences over the past 7 days, 34 are with gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial. The other one is with gate-grenade-dsvm-neutron-dvr-multinode

This is an example that I traced all the way (patchset 367191):

1) Test case that failed:

tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops

2) Tenant network tempest-network-smoke--949101106, id 51b4c899-f196-4561-839b-773303f73fd0 is created:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_14_10_998

3) DHCP agent creates and reloads allocations including instance's port. There are no DHCP agent failures related to this network in the log file:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-dhcp.txt.gz#_2016-09-08_14_14_25_297

4) VM gets address:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_14_33_449

5) Test failure:
http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/tempest.txt.gz#_2016-09-08_14_16_38_313

Instance doesn't get response to DHCPDISCOVER:

udhcpc (v1.20.1) started
Sending discover...
Sending discover...
Sending discover...

6) dnsmasq instance never receives DHCPDISCOVER from instance. However, it receives a DHCPRELEASE for unknown lease from instance:

http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/syslog.txt.gz#_Sep_08_14_16_41

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-09-08:

#18

@Miguel: As of today the port is marked active only when both L2 and DHCP are setup properly. Your comment baffles me a bit. Digging...

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#19

Provisioning block for DHCP released here: http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-svc.txt.gz#_2016-09-08_14_14_25_813

After reservation was setup here: http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-dhcp.txt.gz#_2016-09-08_14_14_25_297

Provisioning block for L2 wasn't released until much later here: http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-svc.txt.gz#_2016-09-08_14_14_29_419

This resulted in notification to Nova here: http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-svc.txt.gz#_2016-09-08_14_14_31_829

And the instance was un-paused here: http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/subnode-2/screen-n-cpu.txt.gz#_2016-09-08_14_14_32_042

So I don't think the issue is in the DHCP agent not being ready. There is some other connectivity issue between the second node and the main node. (issue with l2pop, tunneling, etc)

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#20

That DHCP release at the end of the log is a bit misleading. It's the one generated by the DHCP agent itself as the port is removed. I don't believe the instance ever got the address to release it.

http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/screen-q-dhcp.txt.gz#_2016-09-08_14_16_41_966

http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/syslog.txt.gz#_Sep_08_14_16_41

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#21

Interestingly there is a kernel iptables drop message that mentions the instance's MAC on the node with the dhcp server. Could be related. It's the only drop on br-ex in the whole log.

http://logs.openstack.org/91/367191/2/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/7fd0e33/logs/syslog.txt.gz#_Sep_08_14_15_35

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#22

Here is another syslog from a successful run: no iptables drops on br-ex http://logs.openstack.org/31/367331/1/check/gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial/861d697/logs/syslog.txt.gz

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#23

br-ex shouldn't even see this broadcast actually. it should be encapped in a VXLAN tunnel to br-int since this is a tenant network

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#24

I believe this is because the VNI being used by infra is the same as the one being allocated to the network. So when the network node receives the packet, it demultiplexes it out of the VNI onto the incorrect bridge, which is why we are seeing br-ex drops. Patch to infra will confirm.

Changed in neutron:
assignee:	Miguel Lavalle (minsel) → Kevin Benton (kevinbenton)

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-09-09:

#25

Patch is here: https://review.openstack.org/#/c/367716/

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-09-14:

#26

Fix (https://review.openstack.org/#/c/367716/) merged 2 days ago. Using this query, I don't see any occurrences since the fix merged: message:"AssertionError: False is not true : Timed out waiting for" AND message:"in check_vm_connectivity" AND project:"openstack/neutron"

Will keep an eye on it for a few more days

Miguel Lavalle (minsel) on 2016-10-06

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Miguel Lavalle (minsel) wrote on 2016-10-06:

#27

Nope.... Ocurrences went away for almost a month and are showing up again since 10/3. Will take a look

Changed in neutron:
status:	Fix Released → In Progress

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2016-10-06:

#28

I looked into all of the failures and they are either caused by the patch they appeared on (broke lots of other things as well), or they occurred on the linux bridge job. There may be a separate bug for linux bridge, but it's definitely not the same issue as this bug. We can't keep re-opening this every time there is a connectivity issue.

Changed in neutron:
status:	In Progress → Fix Released

neutron

Potential regression on handing over DHCP addresses to VMs

Bug Description

Other bug subscribers

Remote bug watches