Router interfaces report being in BUILD state - l3ha vrrp+LinuxBridge

Bug #1590845 reported by Miguel Alejandro Cantu
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
venkata anil

Bug Description

I'm running a Liberty environment with two network hosts using the L3HA VRRP driver.
I also have L2pop on and am using the ML2 LinuxBridge driver.

When we programmatically attach subnets and/or ports to routers(we attach 1 interface every 60 seconds), some report back stuck in the BUILD state. Take this interface, for example:
neutron port-show 98b55b89-a002-496f-a5d4-8de598613da8
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| admin_state_up | True |
| allowed_address_pairs | |
| binding:host_id | dn3usoskctl03_neutron_agents_container-e64e37d6 |
| binding:profile | {} |
| binding:vif_details | {"port_filter": true} |
| binding:vif_type | bridge |
| binding:vnic_type | normal |
| device_id | 5838c5de-e87a-4e5e-b61f-a3f068fa7726 |
| device_owner | network:router_interface |
| dns_assignment | {"hostname": "host-10-169-160-1", "ip_address": "10.169.160.1", "fqdn": "host-10-169-160-1.openstacklocal."} |
| dns_name | |
| extra_dhcp_opts | |
| fixed_ips | {"subnet_id": "bc3a8d37-6cd7-4d57-b0c9-2b35743b0a0b", "ip_address": "10.169.160.1"} |
| id | 98b55b89-a002-496f-a5d4-8de598613da8 |
| mac_address | fa:16:3e:b9:7a:1d |
| name | |
| network_id | 535c3336-202c-4dab-b517-2232c4ce1481 |
| security_groups | |
| status | BUILD |
| tenant_id | 3ccf712795c44edcbc8ffcc331a59853 |
+-----------------------+--------------------------------------------------------------------------------------------------------------+

It's reporting itself in the BUILD state, but when I check the router namespace, it's linux networking component counter part seems to be functioning just fine:

8: qr-98b55b89-a0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:b9:7a:1d brd ff:ff:ff:ff:ff:ff
    inet 10.169.160.1/23 scope global qr-98b55b89-a0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feb9:7a1d/64 scope link
       valid_lft forever preferred_lft forever

I can even ping the address with no problem once i open up the security group rules.

Note: The problem doesn't appear when L3HA is turned off. Only when L3HA with VRRP keepalived driver is being used.

Where would be a good place to start debugging this?

Thanks!

tags: added: l3-ha
tags: added: linuxbridge
Assaf Muller (amuller)
tags: added: l2-pop
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Could somebody confirm if this is also happening on master?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Miguel, did you find any relevant traces in the neutron-server(s) or linuxbridge-agent where the router becomes master?

Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

Miguel Angel, I do see some logs saying that a routers state is being transitioned to "master" or "backup" I've linked my neutron-l3-agent.log:
https://gist.github.com/alextricity25/c1066b36f8e4ee101fb30f0bc91a6b74

Aside from that...I don't see those messages anywhere else.

I've captured the logs while running a script(really using ansible modules) that attaches subnets to a router every 60seconds. Unfortunately some interfaces are still reporting in the BUILD state after running it this morning.

neutron.log:
https://gist.github.com/alextricity25/4a04bde4615ff1dd2a39e4b151066e89
neutron-linuxbridge-agent.log:
https://gist.github.com/alextricity25/7b94147bf8132e05894600a1af51282d
neutron-l3-agent.log:
https://gist.github.com/alextricity25/c1066b36f8e4ee101fb30f0bc91a6b74
neutron-ha-tool.log:
https://gist.github.com/alextricity25/e4b85074342605cef909a23a1596e223

Revision history for this message
Bjoern (bjoern-t) wrote :

Per https://bugs.launchpad.net/neutron/+bug/1591386 I have the same issue without L2Pop, hence I removed the l2pop from the title here.
The problem persist in any case.

summary: - Router interfaces report being in BUILD state - l3ha
- vrrp+L2pop+LinuxBridge
+ Router interfaces report being in BUILD state - l3ha vrrp+LinuxBridge
Revision history for this message
Assaf Muller (amuller) wrote :

What is the impact of this bug? Are we talking about solely about the state of the port as reported to users, or are there other implications?

Revision history for this message
Bjoern (bjoern-t) wrote :

AFAIK, it is currently a state issue, the interface seems to have been created just fine. But the port states are used quite often to monitor agent health, so for us it is more than just a "display/state" issue

Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

Assaf,

Bjoern is right. Network traffic still passes through the interface, and as stated in the bug description, the Linux component counterpart is functional and ping-able. This seems to only be an issue with how the state is reported.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
venkata anil (anil-venkata) wrote :
Changed in neutron:
assignee: venkata anil (anil-venkata) → Sindhu Devale (sindhu-devale-3)
Revision history for this message
James Denton (james-denton) wrote :

Just to add to this, we have seen that if the port is in BUILD state and l2pop is enabled, the FDB and ARP tables will not updated across nodes. This results in lack of connectivity to/from the qr ports of the router. The fix in this case is to convert the router back to standalone.

Revision history for this message
Kevin Carter (kevin-carter) wrote :

+1 -- I just had to address this issue for just about all of the routers in the "cloud1.osic.org" environment. If there's anything that need be done, testing or otherwise, please let me know.

Changed in neutron:
assignee: Sindhu Devale (sindhu-devale-3) → venkata anil (anil-venkata)
status: New → In Progress
Revision history for this message
venkata anil (anil-venkata) wrote :

Another simple solution will be -
calling update_port_status from update_device_up for HA router port also, like DVR i.e

https://review.openstack.org/#/c/282874/11/neutron/plugins/ml2/rpc.py L213-L215

Changed in neutron:
assignee: venkata anil (anil-venkata) → Kevin Benton (kevinbenton)
Revision history for this message
Xiang Wang (wangxian) wrote :

We are also seeing this issue in stable/mitaka. Same as James Denton mentioned in comment #9 when router interface is stuck in BUILD, ARP and FDB tables do not getting updated and it causes connectivity issues. We are able to reproduce the issue by failing over the node where the master router is hosted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/346323
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2325e2aea86ddc28bc0e1573d4954518991cad19
Submitter: Jenkins
Branch: master

commit 2325e2aea86ddc28bc0e1573d4954518991cad19
Author: Kevin Benton <email address hidden>
Date: Sat Jul 23 00:07:17 2016 -0700

    Skip DHCP provisioning block for network ports

    Network ports created via internal core plugin calls
    (e.g. dhcp ports and router interfaces) don't generate
    DHCP notifications to the DHCP agent so the agent never
    clears the DHCP provisioning block. This patch just skips
    adding DHCP provisioning blocks for network owned ports
    since they don't depend on DHCP anyway.

    Closes-Bug: #1590845
    Closes-Bug: #1605955
    Change-Id: I0111de79d9259ada3b1c06a087d0eaeb8f3cb158

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

Will this fix be backported to mitaka and/or liberty?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/346493
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d8952e2a6964f9d768ad19d07dca32d82468ad59
Submitter: Jenkins
Branch: master

commit d8952e2a6964f9d768ad19d07dca32d82468ad59
Author: Kevin Benton <email address hidden>
Date: Sat Jul 23 22:36:37 2016 -0700

    Add API tests for router and DHCP port status

    Add API tests that ensure DHCP ports and router interface ports
    become active.

    Router gateway ports were excluded because deployments using
    'external_network_bridge = br-ex' will always have their external
    interface in the DOWN state.

    Related-Bug: #1590845
    Related-Bug: #1605955
    Change-Id: I843f9217a3c401e8221c9dd42cbd4ea55dcd7a81

Revision history for this message
Matthew Thode (prometheanfire) wrote :

Here's another vote for a backport if we can...

Revision history for this message
Xiang Wang (wangxian) wrote :

We also think this defect should be changed to High importance and would like to see it backported to stable/mitaka. Please post here for any progress made for this.

Revision history for this message
Sangeetha Srikanth (ssrikant) wrote :

Please backport to stable/mitaka

tags: added: mitaka-backport-potential
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Sorry, I just noticed the issues around stable/mitaka. The provisioning blocks were clearly not the only cause of this behavior because they didn't exist in Mitaka.

Changed in neutron:
status: Fix Released → Confirmed
assignee: Kevin Benton (kevinbenton) → nobody
Revision history for this message
Kevin Benton (kevinbenton) wrote :

@Venkata, provisioning blocks cannot be the cause of this. The reporter is on Liberty and people are experiencing this in Mitaka.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
Sangeetha Srikanth (ssrikant) wrote :

Can this be backported to Mitaka release?

Revision history for this message
Brad Behle (behle) wrote :

Vankata, are you still working on this? There are fixes provided, but then a comment from Kevin Benton that these fixes couldn't have completely fixed the problem??

Revision history for this message
venkata anil (anil-venkata) wrote :

Can someone try with this patch https://review.openstack.org/285773 ?

Revision history for this message
Sangeetha Srikanth (ssrikant) wrote :

We are seeing more occurrences of this issue. Any update?

Revision history for this message
Sangeetha Srikanth (ssrikant) wrote :

Can the priority of this bug be raised? We are seeing more of this issue in production stack and it affects the network connectivity of customer guest vms.

Revision history for this message
venkata anil (anil-venkata) wrote :

This change is the fix for https://bugs.launchpad.net/neutron/+bug/1590845

For example, I have two hosts - host1 and host2, hosting HA router. host1 is master and host2 is backup. Both hosts try to wire up the router interface port.
If we have the below sequence i.e
1) Host1 calls get_devices_details_list, setting status to BUILD
2) Host1 calls update_device_up, setting status to ACTIVE
3) Now Host2 calls get_devices_details_list, it modifies the status to BUILD, as we are not passing host(without this patch).
4) Now Host2 calls update_device_up, but plugin won't change status to ACTIVE as port_bound_to_host failed.
Hence port remains in BUILD state. To fix this bug for linuxbridgeagent, we have to pass host like we do in OVS.

Revision history for this message
venkata anil (anil-venkata) wrote :

https://review.openstack.org/#/c/397062/ is the fix for this issue.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

With that back-port, I think we can safely mark this as fix released.

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/285773
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/440341

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/440341
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1353927d12ae6d5970d435e68a73403030785b7e
Submitter: Jenkins
Branch: master

commit 1353927d12ae6d5970d435e68a73403030785b7e
Author: Kevin Benton <email address hidden>
Date: Thu Mar 2 02:11:42 2017 -0800

    Remove network port special-case in provisioning block

    Since the merge of I607635601caff0322fd0c80c9023f5c4f663ca25,
    DHCP agents now receive all port update/create events so we
    no longer need to special-case network ports in this function.

    Related-Bug: #1621345
    Related-Bug: #1605955
    Related-Bug: #1590845
    Change-Id: I4b1cfcfee7441e63370ff3e61f75c119b34cc0fd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/456748

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ocata)

Change abandoned by Brian Haley (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/456748

Revision history for this message
Dilip Renkila (dilip-renkila278) wrote :

Hi all, I am also experiencing the same on openstack rocky (l3 vrrp + linuxbridge). Router interfaces are stuck in BUILD state. Is there any fix committed or debug is still going on ?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.