Incorrect state of the Openflow table

Bug #1754695 reported by Przemyslaw Wernicki
This bug affects 4 people
Affects Status Importance Assigned to Milestone

Bug Description

During provision of large scale vm's number several percent of vm's fireup without network connectivity. We found that the reason of faulty networking is the incorrect state in Openflow table and there is no connectivity over vxlan between affected compute nodes and controllers.

A proper Openflow table shows complete list of vxlan interfaces to all compute nodes and controllers:
cookie=0x98572d2e5f45dc06, duration=2639.513s, table=22, n_packets=212, n_bytes=57108, priority=1,dl_vlan=10 actions=strip_vlan,load:0x30->NXM_NX_TUN_ID[],output:"vxlan-0afe0c74",output:"vxlan-0afe0c80",output:"vxlan-0afe0c0b",output:"vxlan-0afe0c0c",output:"vxlan-0afe0c0d",output:"vxlan-0afe0c7d",output:"vxlan-0afe0c66",output:"vxlan-0afe0c81",output:"vxlan-0afe0c6d",output:"vxlan-0afe0c6c",output:"vxlan-0afe0c69",output:"vxlan-0afe0c7a",output:"vxlan-0afe0c79",output:"vxlan-0afe0c78",output:"vxlan-0afe0c7f",output:"vxlan-0afe0c7e",output:"vxlan-0afe0c67",output:"vxlan-0afe0c7c",output:"vxlan-0afe0c83",output:"vxlan-0afe0c86",output:"vxlan-0afe0c87",output:"vxlan-0afe0c76",output:"vxlan-0afe0c84",output:"vxlan-0afe0c85",output:"vxlan-0afe0c75",output:"vxlan-0afe0c72",output:"vxlan-0afe0c73",output:"vxlan-0afe0c71",output:"vxlan-0afe0c6f",output:"vxlan-0afe0c7b",output:"vxlan-0afe0c6b",output:"vxlan-0afe0c6a",output:"vxlan-0afe0c6e",output:"vxlan-0afe0c77",output:"vxlan-0afe0c65",output:"vxlan-0afe0c70"

An incorrect state of Openflow table shows that the vxlan interfaces to controllers are missing:
cookie=0xeee71baa637a6dde, duration=754.490s, table=22, n_packets=147, n_bytes=39834, priority=1,dl_vlan=10 actions=strip_vlan,load:0x30->NXM_NX_TUN_ID[],output:"vxlan-0afe0c74",output:"vxlan-0afe0c80",output:"vxlan-0afe0c7d",output:"vxlan-0afe0c66",output:"vxlan-0afe0c81",output:"vxlan-0afe0c6d",output:"vxlan-0afe0c6c",output:"vxlan-0afe0c69",output:"vxlan-0afe0c7a",output:"vxlan-0afe0c79",output:"vxlan-0afe0c78",output:"vxlan-0afe0c7f",output:"vxlan-0afe0c7e",output:"vxlan-0afe0c67",output:"vxlan-0afe0c7c",output:"vxlan-0afe0c86",output:"vxlan-0afe0c87",output:"vxlan-0afe0c76",output:"vxlan-0afe0c84",output:"vxlan-0afe0c85",output:"vxlan-0afe0c75",output:"vxlan-0afe0c72",output:"vxlan-0afe0c73",output:"vxlan-0afe0c71",output:"vxlan-0afe0c6f",output:"vxlan-0afe0c7b",output:"vxlan-0afe0c6b",output:"vxlan-0afe0c6a",output:"vxlan-0afe0c6e",output:"vxlan-0afe0c65",output:"vxlan-0afe0c70"

Restarting neutron_openvswitch_agent container fix the problem on affected compute node by adding missing vxlans.

Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

Missing output ports:

output:"vxlan-0afe0c0b" - ctrl1
output:"vxlan-0afe0c0c" - ctrl2
output:"vxlan-0afe0c0d" - ctrl3

When the issue arise, always all controllers are missing in the output port list.

Revision history for this message
Brian Haley (brian-haley) wrote :

You added the 'in-stable-pike' tag, does this mean you don't see the problem in Queens or master? Thanks.

Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

Hi Brian,

We haven't tested this issue on Queens and master.
Currently we are focusing on Pike version, because we need this working on Pike.
This is a cloud with 38 compute nodes and 3 controllers.
When we deploy 100 VMs at once there is usually problem on one or two compute nodes and non of VMs spawned on those compute nodes have network connectivity. Which is obvious because this table entry is common for all VMs connected to particular private network on particular compute node.

After extensive debugging I suppose now that it is more like lack of Vxlan port existence in br-tun than issue with this table entry. I suppose this table entry isn't complete because there are no Vxlan ports configured to controller nodes in br_tun bridge.

Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

It's worth mentioning we use DVR and L3_HA in this cloud.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Do you use l2 population in your cloud?

Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

Yes, we use l2population

zhaobo (zhaobo6)
tags: added: l3-dvr-backlog l3-ha needs-attention
Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

We also have an issue with some of L3_HA routers in the same env.

Some central SNAT routers are not working. Most of them are provisioned by Heat. I've found that for those routers there are no keepalived processes running and they have status STANDBY on all of our L3 agents:

# neutron l3-agent-list-hosting-router 60a0771b-a65b-46bb-9da6-30a2f3f36216
| id | host | admin_state_up | alive | ha_state |
| 12622333-5dba-4a5f-bd97-11258fc9ab5a | ctrl2 | True | :slightly_smiling_face: | standby |
| 734d6178-240a-4025-a60c-ba773eadfe67 | ctrl1 | True | :slightly_smiling_face: | standby |
| fad0e95e-1865-40b5-a6b4-ab634cba4191 | ctrl3 | True | :slightly_smiling_face: | standby |

To be clear those two issues are not appearing in the same networks. Networks with OVS table issue have a running router instance. But maybe they have a one root cause, because both seems to appear where there are massive network resource provisioning using Heat or spawning 100 VMs at a time from CLI/Horizon/Heat.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Thanks for the detailed provided so far. That said, are there any errors in the logs that may pinpoint why the flows are missing? Without a more detailed list of steps to repro, the scale of the environment, the exact nature of the network topology etc, it's hard to provide any feedback.

Marking incomplete until more details are provided.

tags: removed: in-stable-pike
Changed in neutron:
status: New → Incomplete
Revision history for this message
Piotr Misiak (piotr-misiak) wrote :

I created a different bug about L3-HA routers issue to distinguish it from this one:

There are no errors in log which shows any problem. Even debug logs don't show any issue/problem but even that the table 22 is not correctly set-up.

Currently I'm focusing on debugging #1757188 issue because there is no easy way to fix it

I will back to you when I come back to debug this issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Piotr, any news about this issue ?
We are running kind of the same issue and we are suspecting the native interface.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Seems to be fixed in the latest Pike version (and Queens/master branches).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers