DHCP connectivity after migration/resize not working

Bug #1850557 reported by Slawek Kaplonski on 2019-10-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski
Slawek Kaplonski (slaweq) wrote :

I was checking today this issue on example from https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3ae/682483/1/check/tempest-slow-py3/3ae72c6/testr_results.html.gz

Instance id was "c706aa83-8e19-40be-8162-e33e847e3468".

Instance was first spawned on "controller" node and IP was configured properly on it.
Than instance was resized and because of that migrated to node "compute1".
After that instance didn't get IP address from dhcp.
Migration was done at 17:45:18.715284

Oct 26 17:45:18.715284 ubuntu-bionic-rax-iad-0012515704 nova-compute[19743]: DEBUG nova.virt.libvirt.driver [None req-84dcdf58-d2fd-4eff-957e-0f27b22e328b tempest-TestNetworkAdvancedServerOps-947101307 tempest-TestNetworkAdvancedServerOps-947101307] [instance: c706aa83-8e19-40be-8162-e33e847e3468] finish_migration finished successfully. {{(pid=19743) finish_migration /opt/stack/nova/nova/virt/libvirt/driver.py:9939}}

Than port of this instance was configured on compute node by neutron-ovs-agent and port update was also noticed by dhcp agent. This happend at 17:45:17.057947

Oct 26 17:45:17.057947 ubuntu-bionic-rax-iad-0012515703 neutron-dhcp-agent[11839]: INFO neutron.agent.dhcp.agent [-] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=ubuntu-bionic-rax-iad-0012515704, binding:profile=, binding:vif_details=bridge_name=br-int, connectivity=l2, datapath_type=system, ovs_hybrid_plug=False, port_filter=True, binding:vif_type=ovs, binding:vnic_type=normal, created_at=2019-10-26T17:44:10Z, description=, device_id=c706aa83-8e19-40be-8162-e33e847e3468, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{'subnet_id': '06b556d1-b47e-4c44-b8f9-3c9b50e092c5', 'ip_address': '10.1.0.12'}], id=e630dfc6-1d61-4451-9b93-f44d91f5be9f, mac_address=fa:16:3e:a0:fd:b9, name=, network_id=edf3abeb-95cc-4b3b-93ac-dbc9c5c068f4, port_security_enabled=True, project_id=7f5516f5c8b84e27b8b50cca898dbdfd, qos_policy_id=None, resource_request=None, revision_number=7, security_groups=['f833fa53-cd9d-4e37-8bc3-e8229f314af5'], status=DOWN, tags=[], tenant_id=7f5516f5c8b84e27b8b50cca898dbdfd, updated_at=2019-10-26T17:45:16Z

As You can see above, port binding is now changed to node which is named "compute1" in test job.

After this migration, instance was started and was trying to configure network through DHCP. And in DHCP agent's logs it is visible:

Oct 26 17:45:36.692901 ubuntu-bionic-rax-iad-0012515703 dnsmasq-dhcp[13765]: DHCPDISCOVER(tapd87044de-e1) fa:16:3e:a0:fd:b9
Oct 26 17:45:36.692928 ubuntu-bionic-rax-iad-0012515703 dnsmasq-dhcp[13765]: DHCPOFFER(tapd87044de-e1) 10.1.0.12 fa:16:3e:a0:fd:b9
...
Oct 26 17:46:36.775236 ubuntu-bionic-rax-iad-0012515703 dnsmasq-dhcp[13765]: DHCPDISCOVER(tapd87044de-e1) fa:16:3e:a0:fd:b9
Oct 26 17:46:36.775268 ubuntu-bionic-rax-iad-0012515703 dnsmasq-dhcp[13765]: DHCPOFFER(tapd87044de-e1) 10.1.0.12 fa:16:3e:a0:fd:b9

So DHCP requests came from instance to the dhcp server and lease with correct IP address was offered to the vm.
But the problem is that we can't check if those packets with DHCPOFFER came to the "compute1" node or where between br-tun and br-int were dropped.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Changed in neutron:
importance: Medium → High
Slawek Kaplonski (slaweq) wrote :

I was able to reproduce this issue locally.
After some debugging here is what I found:

* vm used in test has got mac address fa:16:3e:e2:4e:06
* vm was migrated from allinone node to the compute node
* on allinone node there is also dhcp server for network used by this vm,
* after migration on allinone node, in br-int there was left OF rule:

     cookie=0x90daec9e07fd1af6, duration=1858.683s, table=73, n_packets=240, n_bytes=27852, priority=100,reg6=0x26,dl_dst=fa:16:3e:e2:4e:06 actions=load:0x96->NXM_NX_REG5[],resubmit(,81)

and this rule caused problems as packets which has got dest mac fa:16:3e:e2:4e:06 were not pushed to br-tun and to the other node.
After removing this rule manually vm has got dhcp connectivity again.

Now I need to find out where such rule should be created and when it should be removed.

Slawek Kaplonski (slaweq) wrote :
Download full text (3.2 KiB)

I did some deep analysis of one failed test in https://b4ee658ece9317c00235-25614151543ec2e702e81ba3282ddc61.ssl.cf5.rackcdn.com/695479/3/check/tempest-slow-py3/2060c35/testr_results.html.gz

VM ID: b99494b0-cfbe-4ced-bf0f-040759e6fdf5
Port id: 7ae157c7-0f96-467c-8707-00d30f65a9a4

VM was first spawned on controller node and than unshelved on compute1.

Here is what happend on controller node during removing of VM's interface:

ovsdb monitor noticed that port was removed at:
Nov 25 16:51:57.914160 ubuntu-bionic-rax-dfw-0013031003 neutron-openvswitch-agent[11632]: DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["6493a059-28c4-4d1b-8d6f-80c93682f043","delete","tap7ae157c7-0f",-1,["map",[["attached-mac","fa:16:3e:92:8b:ff"],["iface-id","7ae157c7-0f96-467c-8707-00d30f65a9a4"],["iface-status","active"],["vm-id","b99494b0-cfbe-4ced-bf0f-040759e6fdf5"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=11632) _read_stdout /opt/stack/neutron/neutron/agent/common/async_process.py:262}}

But in that time there was already running rpc_loop iteration in which this port was treated as updated:

Nov 25 16:51:57.886502 ubuntu-bionic-rax-dfw-0013031003 neutron-openvswitch-agent[11632]: DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-dffe9755-afe2-4a99-8126-3816edb50b2b None None] Starting to process devices in:{'added': set(), 'removed': set(), 'current': {'7ae157c7-0f96-467c-8707-00d30f65a9a4', 'd25f6c96-c47d-4337-9bb8-4742749ae807', '52b47a4a-18fe-45d4-811d-d75b8190c08f', '3f98c244-3e31-479b-a31e-303d9f944f09', 'dd303d88-c2c0-4141-8e11-b395d78fd23d', 'ad668826-f453-4798-b672-5452bce056c3', 'de189df6-6ad5-44ad-a4d7-885b6ce37390', '1ca74f7a-67d6-4eee-be50-237ab4d32e53', 'c667b8dd-2d9f-4fd5-99cd-54b36d743d09'}, 'updated': {'7ae157c7-0f96-467c-8707-00d30f65a9a4'}} {{(pid=11632) rpc_loop /opt/stack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2438}}

Unfortunatelly, as port was removed from ovs brigde, its ofport was already -1 so it wasn't processed properly by ovs firwall driver:
Nov 25 16:51:57.905049 ubuntu-bionic-rax-dfw-0013031003 neutron-openvswitch-agent[11632]: INFO neutron.agent.linux.openvswitch_firewall.firewall [None req-dffe9755-afe2-4a99-8126-3816edb50b2b None None] port 7ae157c7-0f96-467c-8707-00d30f65a9a4 does not exist in ovsdb: Port 7ae157c7-0f96-467c-8707-00d30f65a9a4 is not managed by this agent..

And it was added to skipped ports at
Nov 25 16:51:57.910390 ubuntu-bionic-rax-dfw-0013031003 neutron-openvswitch-agent[11632]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-dffe9755-afe2-4a99-8126-3816edb50b2b None None] Ports {'7ae157c7-0f96-467c-8707-00d30f65a9a4'} skipped, changing status to down

And in this iteration port was in skipped devices so it was removed from port_info[“current”]. Because of that in next iteration it wasn’t treated as deleted even if there was event from ovsdb-monitor received about that.
And that cause to not cleaned firewall rules from br-int for this port and traffic which has dest_...

Read more...

Fix proposed to branch: master
Review: https://review.opendev.org/696794

Changed in neutron:
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/696794
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b01e0c2aa98866df9c25e20a66c02fccccdc7885
Submitter: Zuul
Branch: master

commit b01e0c2aa98866df9c25e20a66c02fccccdc7885
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 27 10:44:19 2019 +0100

    [OVS FW] Clean port rules if port not found in ovsdb

    During e.g. migration or shelve of VM it may happend that
    port update event will be send to the ovs agent and in the almost
    the same time, port will be removed from br-int.
    In such case during update_port_filter method openvswitch firewall
    driver will not find port in br-int, and it will do nothing with it.
    That will lead to leftover rules for this port in br-int.

    So this patch adds calling remove_port_filter() method if port was
    not found in br-int. Just to be sure that there is no any leftovers
    from the port in br-int anymore.

    Change-Id: I06036ce5fe15d91aa440dc340a70dd27ae078c53
    Closes-Bug: #1850557

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/697021
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ef689b284c079fba497723fb779603aa99d8afff
Submitter: Zuul
Branch: stable/queens

commit ef689b284c079fba497723fb779603aa99d8afff
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 27 10:44:19 2019 +0100

    [OVS FW] Clean port rules if port not found in ovsdb

    During e.g. migration or shelve of VM it may happend that
    port update event will be send to the ovs agent and in the almost
    the same time, port will be removed from br-int.
    In such case during update_port_filter method openvswitch firewall
    driver will not find port in br-int, and it will do nothing with it.
    That will lead to leftover rules for this port in br-int.

    So this patch adds calling remove_port_filter() method if port was
    not found in br-int. Just to be sure that there is no any leftovers
    from the port in br-int anymore.

    Conflicts:
        neutron/agent/linux/openvswitch_firewall/firewall.py

    Change-Id: I06036ce5fe15d91aa440dc340a70dd27ae078c53
    Closes-Bug: #1850557
    (cherry picked from commit b01e0c2aa98866df9c25e20a66c02fccccdc7885)

tags: added: in-stable-queens
tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/697020
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=50a02ebc0699a0a4906fc10d15924cd059eab057
Submitter: Zuul
Branch: stable/rocky

commit 50a02ebc0699a0a4906fc10d15924cd059eab057
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 27 10:44:19 2019 +0100

    [OVS FW] Clean port rules if port not found in ovsdb

    During e.g. migration or shelve of VM it may happend that
    port update event will be send to the ovs agent and in the almost
    the same time, port will be removed from br-int.
    In such case during update_port_filter method openvswitch firewall
    driver will not find port in br-int, and it will do nothing with it.
    That will lead to leftover rules for this port in br-int.

    So this patch adds calling remove_port_filter() method if port was
    not found in br-int. Just to be sure that there is no any leftovers
    from the port in br-int anymore.

    Conflicts:
        neutron/agent/linux/openvswitch_firewall/firewall.py

    Change-Id: I06036ce5fe15d91aa440dc340a70dd27ae078c53
    Closes-Bug: #1850557
    (cherry picked from commit b01e0c2aa98866df9c25e20a66c02fccccdc7885)

Reviewed: https://review.opendev.org/697018
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8958395145a13c7de4c9db9febb6bb73ec26b1d6
Submitter: Zuul
Branch: stable/stein

commit 8958395145a13c7de4c9db9febb6bb73ec26b1d6
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 27 10:44:19 2019 +0100

    [OVS FW] Clean port rules if port not found in ovsdb

    During e.g. migration or shelve of VM it may happend that
    port update event will be send to the ovs agent and in the almost
    the same time, port will be removed from br-int.
    In such case during update_port_filter method openvswitch firewall
    driver will not find port in br-int, and it will do nothing with it.
    That will lead to leftover rules for this port in br-int.

    So this patch adds calling remove_port_filter() method if port was
    not found in br-int. Just to be sure that there is no any leftovers
    from the port in br-int anymore.

    Change-Id: I06036ce5fe15d91aa440dc340a70dd27ae078c53
    Closes-Bug: #1850557
    (cherry picked from commit b01e0c2aa98866df9c25e20a66c02fccccdc7885)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/697017
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cf59430df5f7ee3fc4e7a0eb409f534d9d6eeffb
Submitter: Zuul
Branch: stable/train

commit cf59430df5f7ee3fc4e7a0eb409f534d9d6eeffb
Author: Slawek Kaplonski <email address hidden>
Date: Wed Nov 27 10:44:19 2019 +0100

    [OVS FW] Clean port rules if port not found in ovsdb

    During e.g. migration or shelve of VM it may happend that
    port update event will be send to the ovs agent and in the almost
    the same time, port will be removed from br-int.
    In such case during update_port_filter method openvswitch firewall
    driver will not find port in br-int, and it will do nothing with it.
    That will lead to leftover rules for this port in br-int.

    So this patch adds calling remove_port_filter() method if port was
    not found in br-int. Just to be sure that there is no any leftovers
    from the port in br-int anymore.

    Change-Id: I06036ce5fe15d91aa440dc340a70dd27ae078c53
    Closes-Bug: #1850557
    (cherry picked from commit b01e0c2aa98866df9c25e20a66c02fccccdc7885)

tags: added: in-stable-train

This issue was fixed in the openstack/neutron 13.0.6 release.

This issue was fixed in the openstack/neutron 14.0.4 release.

This issue was fixed in the openstack/neutron 15.0.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers