I've triaged this bug myself, you can reproduce it by:
1) starting a 2 or 3 network nodes, and setting up ha routers
2) creating a few ha routers (10 would suffice)
3) stopping ovs-agent & l3-agent & dhcp agent on all the nodes for T>agent_down_time
4) starting them all at once.
like 50% of the time:
1) l3-agent will try to rebind some of the router ports before any ovs-agent has reported himself (via heartbeat) as UP.
2) The result is the port being moved into binding failed status.
3) Then ovs-agent boots up, and marks the ports as dead internal VLAN (4095).
4) This recovers if you restart the l3-agent again, because that tries again to rebind the port, and some agent is up now.
[5) I'm not sure now if you needed to restart OVS agent again or not]
I've triaged this bug myself, you can reproduce it by:
1) starting a 2 or 3 network nodes, and setting up ha routers
2) creating a few ha routers (10 would suffice)
3) stopping ovs-agent & l3-agent & dhcp agent on all the nodes for T>agent_down_time
4) starting them all at once.
like 50% of the time:
1) l3-agent will try to rebind some of the router ports before any ovs-agent has reported himself (via heartbeat) as UP.
2) The result is the port being moved into binding failed status.
3) Then ovs-agent boots up, and marks the ports as dead internal VLAN (4095).
4) This recovers if you restart the l3-agent again, because that tries again to rebind the port, and some agent is up now.
[5) I'm not sure now if you needed to restart OVS agent again or not]