neutron

ovn-controller don't update new flows

Bug #1969354 reported by Nguyen Thanh Cong on 2022-04-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Invalid	Undecided	Unassigned

Bug Description

Problem: OVN-Controller down, can't update new flows

Description:
- Running OpenStack Victoria, build by kolla-ansible
- E.g we have 2 compute (A & B). ~ 70vm, > 80k flows per compute
- I live migrate some vm from compute A to compute B
- openstack network agent list tells ovn-controller on compute B down
- On ovn-nb-db, i check `ovn-nbctl list NB_Global`. It response hv_cfg < nb_cfg
- On ovn-sb-db, i check `ovn-sbctl list Chassis`. It response nb_cfg of compute A = nb_cfg of ovn_nb. But nb_cfg of compute B < nb_cfg of ovn_nb
==> Neutron response OVN-controller on compute B down

Work arround:
- Flush all flows
- Restart ovn-controller

Reproduce:
- Can't reproduce with live migrate

2. ovsdb-server on compute B:
2022-04-15T09:15:50.489Z|79819|jsonrpc|WARN|tcp:127.0.0.1:31970: receive error: Connection reset by peer
2022-04-15T09:15:50.489Z|79820|reconnect|WARN|tcp:127.0.0.1:31970: connection dropped (Connection reset by peer)
2022-04-15T09:17:50.763Z|79821|reconnect|ERR|tcp:127.0.0.1:32952: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:19:19.271Z|79822|reconnect|ERR|tcp:127.0.0.1:34554: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:19:50.806Z|79823|reconnect|ERR|tcp:127.0.0.1:35858: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:20:49.662Z|79824|reconnect|ERR|tcp:127.0.0.1:36300: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:21:21.746Z|79825|reconnect|ERR|tcp:127.0.0.1:36722: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:22:01.375Z|79826|reconnect|ERR|tcp:127.0.0.1:36914: no response to inactivity probe after 5.01 seconds, disconnecting
2022-04-15T09:22:57.564Z|79827|reconnect|ERR|tcp:127.0.0.1:37582: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:23:44.051Z|79828|reconnect|ERR|tcp:127.0.0.1:38234: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:24:27.935Z|79829|reconnect|ERR|tcp:127.0.0.1:39244: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:25:03.414Z|79830|reconnect|ERR|tcp:127.0.0.1:39484: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:25:58.860Z|79831|reconnect|ERR|tcp:127.0.0.1:39700: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:27:34.469Z|79832|reconnect|ERR|tcp:127.0.0.1:40224: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:29:07.546Z|79833|reconnect|ERR|tcp:127.0.0.1:41346: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:29:52.822Z|79834|reconnect|ERR|tcp:127.0.0.1:41930: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:30:40.555Z|79835|reconnect|ERR|tcp:127.0.0.1:42470: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:31:19.660Z|79836|reconnect|ERR|tcp:127.0.0.1:42872: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:32:07.061Z|79837|reconnect|ERR|tcp:127.0.0.1:43214: no response to inactivity probe after 5 seconds, disconnecting

3. ovn-sb-db on controller openstack:
2022-04-15T09:55:37.516Z|322313|jsonrpc|WARN|tcp:172.19.20.12:63278: receive error: Connection reset by peer
2022-04-15T09:55:37.516Z|322314|reconnect|WARN|tcp:172.19.20.12:63278: connection dropped (Connection reset by peer)
2022-04-15T09:55:46.857Z|322315|reconnect|ERR|tcp:172.19.20.11:35978: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:46.974Z|322316|reconnect|ERR|tcp:172.19.20.12:22952: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:49.367Z|322317|reconnect|ERR|tcp:172.19.20.101:12810: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:50.302Z|322318|reconnect|ERR|tcp:172.19.20.31:59604: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:55.287Z|322319|reconnect|ERR|tcp:172.19.20.102:61964: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:30.760Z|322320|reconnect|ERR|tcp:172.19.20.102:62190: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:48.494Z|322321|reconnect|ERR|tcp:172.19.20.105:53684: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:48.765Z|322322|reconnect|ERR|tcp:172.19.20.10:22338: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:05.311Z|322323|reconnect|ERR|tcp:172.19.20.102:62422: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:13.860Z|322324|reconnect|ERR|tcp:172.19.20.30:38406: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:39.726Z|322325|reconnect|ERR|tcp:172.19.20.102:62642: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:58:12.186Z|322326|reconnect|ERR|tcp:172.19.20.12:32468: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:58:14.320Z|322327|reconnect|ERR|tcp:172.19.20.102:62820: no response to inactivity probe after 5 seconds, disconnecting

Question:
1. ovsdb disconnecting after 5s, how i can fix it?
2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2022-04-18:

It looks more like a configuration or scaling problem on your environment and not a bug in OVN controller. If it's disconnected then it can't process new flows.

Question:
1. ovsdb disconnecting after 5s, how i can fix it?
yes, by default. you can increase it by running 'ovs-vsctl set open . external_ids:ovn-remote-probe-interval="60000"'

2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
It means ovn-controller attempted a write to the OVN DB but failed. It will recalculate all openflows on the next iteration.

3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
No, ovn-controller is single process, single threaded.

4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?
It looks like the whole environment is unstable, you should check southbound database if it's not overloaded.

Changed in neutron:
status:	New → Invalid

Revision history for this message

Nguyen Thanh Cong (congnt95) wrote on 2022-04-21 (last edit on 2022-04-21):

Hello Jakub Libosvar,
1. Log OVN SB on controller and OVS-DB Server on compute still disconnecting after 5s after i run 'ovs-vsctl set open . external_ids:ovn-remote-probe-interval="60000"'. I run this command on all compute & controller, but it still disconnecting. What is the cause of the error?

/var/log/kolla/openvswitch/ovn-sb-db.log
2022-04-21T09:24:32.040Z|355953|reconnect|ERR|tcp:172.19.20.105:16424: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:24:34.918Z|355954|reconnect|ERR|tcp:172.19.20.104:20134: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:00.579Z|355955|reconnect|ERR|tcp:172.19.20.100:7800: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:11.165Z|355956|reconnect|ERR|tcp:172.19.20.106:44522: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:13.475Z|355957|reconnect|ERR|tcp:172.19.20.101:62388: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:14.339Z|355958|reconnect|ERR|tcp:172.19.20.102:49252: no response to inactivity probe after 5 seconds, disconnecting

/var/log/kolla/openvswitch/ovsdb-server.log
2022-04-21T09:24:40.443Z|87372|reconnect|ERR|tcp:127.0.0.1:29416: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:24:58.558Z|87373|reconnect|ERR|tcp:127.0.0.1:29526: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:16.633Z|87374|reconnect|ERR|tcp:127.0.0.1:29918: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:29.628Z|87375|reconnect|ERR|tcp:127.0.0.1:30132: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:40.831Z|87376|reconnect|ERR|tcp:127.0.0.1:30400: no response to inactivity probe after 5 seconds, disconnecting

https://paste.ubuntu.com/p/BHHJwqwWGx/

2. I build openstack by kolla-ansible. OVN run cluster but OVN SB move leader regularly. Is it normal?

https://paste.ubuntu.com/p/BHHJwqwWGx/

2. I build openstack by kolla-ansible. OVN run cluster but OVN SB move leader regularly. Is it normal?

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2022-04-26:

1) This could be also caused by the probe running on the DB side. You can bump this by changing the inactivity_probe in connection tables. These commands do that:

ovn-nbctl set connection . inactivity_probe=60000
ovn-sbctl set connection . inactivity_probe=60000

2) I don't know how is the OVN cluster managed by your tooling. If you use RAFT active/active or pacemaker managed active/backup solution. But it sounds like the DBs are loaded enough to not answer to monitoring probes in time. For the pacemaker solution you can try to change the monitor interval and timeouts on master and slave roles. Again, I don't know how it's managed in your env but these commands can be handy:

pcs resource update <your_ovn_db_resource> op monitor interval=60s role=Master timeout=180s enabled=true
pcs resource update <your_ovn_db_resource> op monitor interval=120s role=Slave timeout=180s enabled=true

That all said, it seems like a misconfiguration of the cloud leading to scaling issue rather then a bug in Neutron.

Revision history for this message

Nguyen Thanh Cong (congnt95) wrote on 2022-07-14 (last edit on 2022-07-14):

Hi Jakub,

I'm increase inactivity_probe in NB & SB, but ovsdb-server still response:
2022-07-14T02:46:24.519Z|06705|reconnect|ERR|tcp:127.0.0.1:36624: no response to inactivity probe after 5 seconds, disconnecting

I hit a bug about network vlan:
- Compute don't have any VM with vlan
- Create VM in that compute with network vlan
- VM with that network don't have connectivity
- After 1h or 2h, it's has connectivity

Debug:
- I check ovsdb-server still response disconnect after 5 seconds
- In openvswitch, i can't see port patch-provnet-xxxxxx-to-br-int in br-ex and port patch-br-int-to-provnet-xxxx in br-int
- ovs-ofctl dump-flows br-int table=0 |grep vlan don't show flows about this vlan in 1h or 2h

==> I think ovs can't proceess flow in 5s so it can't add port to br-ex and br-int

Note: I use OSP Victoria and OVN 20.03 deploy by kolla-ansible

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.