Comment 9 for bug 1523031

Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

I think I'm running into this problem in my hyper-converged environment with two network nodes. This is ultimately effecting our network architecture in our production environment(we also see this behavior there). We keep seeing multiple router interfaces getting stuck at the "BUILD" state. I'm not sure if it's because l2pop is not sending that fdb entry, but right now I'm not spinning up any instances..just making networks and attaching them to routers. Is there a way I could check to see if this is a problem with a lack of an FDB entry right now?. In the meantime...I think it's safe to say that my problem is related to this bug.
Below is a summary of the relevant environment configurations in my environment.

Two network nodes with these configurations in l3_agent.ini:
# HA failover
ha_confs_path = /var/lib/neutron/ha_confs
ha_vrrp_advert_int = 2
ha_vrrp_auth_password = 498d3d5737ca7273167c8badee533e9f2cf18e4db30676450
ha_vrrp_auth_type = PASS
handle_internal_only_routers = False
send_arp_for_ha = 3

in plugins/ml2/ml2_conf.ini:
mechanism_drivers = linuxbridge,l2population
l2_population = True

in plugins/ml2/linuxbridge_agent.ini:
[linux_bridge]

physical_interface_mappings = flat:eth12,vlan:eth11

# Linux bridge agent VXLAN networks
[vxlan]

enable_vxlan = True
vxlan_group = 239.1.1.1
# VXLAN local tunnel endpoint
local_ip = 172.29.243.52
l2_population = True

# Agent
[agent]
prevent_arp_spoofing = False

# Security groups
[securitygroup]
firewall_driver = neutron.agent.linux.iptables_firewall.IptablesFirewallDriver
enable_security_group = True

Like stated above, when we create our private tenant networks, then attach them to the router, some of them stay in the BUILD state until we bounce them(bring them down, then back up). Sometimes, the router gateway interfaces stay in the BUILD state too. Bouncing the interface doesn't always bring it back up the first time. Sometimes we have to bounce it multiple times before it finally reports as being "ACTIVE".
Keep in mind this is in an environment with 10 or so subnets being created rapidly. We are creating a pair of routers and many subnets using Ansible. All of this is done within the span of 30 seconds. Maybe we are creating them too fast?

Anyway, maybe someone here could help me figure out a workaround for now? Unfortunately, this isn't acceptable for our production environment.

Many thanks, and if there is anything I can do to help out with this bug please hit me up in IRC.