ovn-controller don't update new flows

Bug #1969354 reported by Nguyen Thanh Cong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Problem: OVN-Controller down, can't update new flows

Description:
- Running OpenStack Victoria, build by kolla-ansible
- E.g we have 2 compute (A & B). ~ 70vm, > 80k flows per compute
- I live migrate some vm from compute A to compute B
- openstack network agent list tells ovn-controller on compute B down
- On ovn-nb-db, i check `ovn-nbctl list NB_Global`. It response hv_cfg < nb_cfg
- On ovn-sb-db, i check `ovn-sbctl list Chassis`. It response nb_cfg of compute A = nb_cfg of ovn_nb. But nb_cfg of compute B < nb_cfg of ovn_nb
==> Neutron response OVN-controller on compute B down

Work arround:
- Flush all flows
- Restart ovn-controller

Reproduce:
- Can't reproduce with live migrate

Logs:
1. ovn-controller on compute B:
2022-04-15T09:53:22.156Z|12616545|timeval|WARN|Unreasonably long 9077ms poll interval (8944ms user, 131ms system)
2022-04-15T09:53:22.156Z|12616546|timeval|WARN|faults: 69006 minor, 0 major
2022-04-15T09:53:22.156Z|12616547|timeval|WARN|disk: 0 reads, 8 writes
2022-04-15T09:53:22.156Z|12616548|timeval|WARN|context switches: 0 voluntary, 19 involuntary
2022-04-15T09:53:22.156Z|12616549|poll_loop|INFO|Dropped 280 log messages in last 10 seconds (most recently, 9 seconds ago) due to excessive rate
2022-04-15T09:53:22.156Z|12616550|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (127.0.0.1:30242<->127.0.0.1:6640) at lib/stream-fd.c:157 (100% CPU usage)
2022-04-15T09:53:22.156Z|12616551|jsonrpc|WARN|tcp:127.0.0.1:6640: send error: Broken pipe
2022-04-15T09:53:22.156Z|12616552|jsonrpc|WARN|tcp:172.19.20.11:6642: send error: Broken pipe
2022-04-15T09:53:22.160Z|12616553|reconnect|WARN|tcp:127.0.0.1:6640: connection dropped (Broken pipe)
2022-04-15T09:53:22.160Z|12616554|reconnect|WARN|tcp:172.19.20.11:6642: connection dropped (Broken pipe)
2022-04-15T09:53:23.160Z|12616555|reconnect|INFO|tcp:127.0.0.1:6640: connecting...
2022-04-15T09:53:23.160Z|12616556|reconnect|INFO|tcp:172.19.20.10:6642: connecting...
2022-04-15T09:53:23.163Z|12616557|reconnect|INFO|tcp:127.0.0.1:6640: connected
2022-04-15T09:53:23.163Z|12616558|reconnect|INFO|tcp:172.19.20.10:6642: connected
2022-04-15T09:53:23.165Z|12616559|main|INFO|OVNSB commit failed, force recompute next time.
2022-04-15T09:53:31.991Z|12616569|timeval|WARN|Unreasonably long 8805ms poll interval (8708ms user, 94ms system)
2022-04-15T09:53:31.991Z|12616570|timeval|WARN|faults: 69251 minor, 0 major
2022-04-15T09:53:31.991Z|12616571|timeval|WARN|context switches: 0 voluntary, 19 involuntary
2022-04-15T09:53:31.991Z|12616572|poll_loop|INFO|Dropped 406 log messages in last 10 seconds (most recently, 9 seconds ago) due to excessive rate
2022-04-15T09:53:31.991Z|12616573|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (127.0.0.1:30540<->127.0.0.1:6640) at lib/stream-fd.c:157 (100% CPU usage)
2022-04-15T09:53:31.991Z|12616574|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.19.20.100:8744<->172.19.20.10:6642) at lib/stream-fd.c:157 (100% CPU usage)

2. ovsdb-server on compute B:
2022-04-15T09:15:50.489Z|79819|jsonrpc|WARN|tcp:127.0.0.1:31970: receive error: Connection reset by peer
2022-04-15T09:15:50.489Z|79820|reconnect|WARN|tcp:127.0.0.1:31970: connection dropped (Connection reset by peer)
2022-04-15T09:17:50.763Z|79821|reconnect|ERR|tcp:127.0.0.1:32952: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:19:19.271Z|79822|reconnect|ERR|tcp:127.0.0.1:34554: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:19:50.806Z|79823|reconnect|ERR|tcp:127.0.0.1:35858: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:20:49.662Z|79824|reconnect|ERR|tcp:127.0.0.1:36300: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:21:21.746Z|79825|reconnect|ERR|tcp:127.0.0.1:36722: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:22:01.375Z|79826|reconnect|ERR|tcp:127.0.0.1:36914: no response to inactivity probe after 5.01 seconds, disconnecting
2022-04-15T09:22:57.564Z|79827|reconnect|ERR|tcp:127.0.0.1:37582: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:23:44.051Z|79828|reconnect|ERR|tcp:127.0.0.1:38234: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:24:27.935Z|79829|reconnect|ERR|tcp:127.0.0.1:39244: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:25:03.414Z|79830|reconnect|ERR|tcp:127.0.0.1:39484: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:25:58.860Z|79831|reconnect|ERR|tcp:127.0.0.1:39700: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:27:34.469Z|79832|reconnect|ERR|tcp:127.0.0.1:40224: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:29:07.546Z|79833|reconnect|ERR|tcp:127.0.0.1:41346: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:29:52.822Z|79834|reconnect|ERR|tcp:127.0.0.1:41930: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:30:40.555Z|79835|reconnect|ERR|tcp:127.0.0.1:42470: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:31:19.660Z|79836|reconnect|ERR|tcp:127.0.0.1:42872: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:32:07.061Z|79837|reconnect|ERR|tcp:127.0.0.1:43214: no response to inactivity probe after 5 seconds, disconnecting

3. ovn-sb-db on controller openstack:
2022-04-15T09:55:37.516Z|322313|jsonrpc|WARN|tcp:172.19.20.12:63278: receive error: Connection reset by peer
2022-04-15T09:55:37.516Z|322314|reconnect|WARN|tcp:172.19.20.12:63278: connection dropped (Connection reset by peer)
2022-04-15T09:55:46.857Z|322315|reconnect|ERR|tcp:172.19.20.11:35978: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:46.974Z|322316|reconnect|ERR|tcp:172.19.20.12:22952: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:49.367Z|322317|reconnect|ERR|tcp:172.19.20.101:12810: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:50.302Z|322318|reconnect|ERR|tcp:172.19.20.31:59604: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:55:55.287Z|322319|reconnect|ERR|tcp:172.19.20.102:61964: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:30.760Z|322320|reconnect|ERR|tcp:172.19.20.102:62190: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:48.494Z|322321|reconnect|ERR|tcp:172.19.20.105:53684: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:56:48.765Z|322322|reconnect|ERR|tcp:172.19.20.10:22338: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:05.311Z|322323|reconnect|ERR|tcp:172.19.20.102:62422: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:13.860Z|322324|reconnect|ERR|tcp:172.19.20.30:38406: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:57:39.726Z|322325|reconnect|ERR|tcp:172.19.20.102:62642: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:58:12.186Z|322326|reconnect|ERR|tcp:172.19.20.12:32468: no response to inactivity probe after 5 seconds, disconnecting
2022-04-15T09:58:14.320Z|322327|reconnect|ERR|tcp:172.19.20.102:62820: no response to inactivity probe after 5 seconds, disconnecting

4. ovn-sb-db on controller:
2022-04-15T01:28:50.733Z|22383|jsonrpc|WARN|tcp:172.19.20.12:25398: receive error: Connection reset by peer
2022-04-15T01:28:50.733Z|22384|reconnect|WARN|tcp:172.19.20.12:25398: connection dropped (Connection reset by peer)
2022-04-15T03:47:16.056Z|22385|jsonrpc|WARN|tcp:172.19.20.11:14554: receive error: Connection reset by peer
2022-04-15T03:47:16.056Z|22386|reconnect|WARN|tcp:172.19.20.11:14554: connection dropped (Connection reset by peer)
2022-04-15T03:47:31.910Z|22387|jsonrpc|WARN|tcp:172.19.20.11:15666: receive error: Connection reset by peer
2022-04-15T03:47:31.910Z|22388|reconnect|WARN|tcp:172.19.20.11:15666: connection dropped (Connection reset by peer)
2022-04-15T04:06:03.251Z|22389|jsonrpc|WARN|tcp:172.19.20.10:60386: receive error: Connection reset by peer
2022-04-15T04:06:03.251Z|22390|reconnect|WARN|tcp:172.19.20.10:60386: connection dropped (Connection reset by peer)
2022-04-15T05:54:59.704Z|22391|jsonrpc|WARN|tcp:172.19.20.12:59450: receive error: Connection reset by peer
2022-04-15T05:54:59.704Z|22392|reconnect|WARN|tcp:172.19.20.12:59450: connection dropped (Connection reset by peer)
2022-04-15T07:16:21.438Z|22393|jsonrpc|WARN|tcp:172.19.20.12:59508: receive error: Connection reset by peer
2022-04-15T07:16:21.438Z|22394|reconnect|WARN|tcp:172.19.20.12:59508: connection dropped (Connection reset by peer)
2022-04-15T07:16:31.218Z|22395|jsonrpc|WARN|tcp:172.19.20.10:43826: receive error: Connection reset by peer
2022-04-15T07:16:31.218Z|22396|reconnect|WARN|tcp:172.19.20.10:43826: connection dropped (Connection reset by peer)
2022-04-15T07:16:31.356Z|22397|jsonrpc|WARN|tcp:172.19.20.11:50460: receive error: Connection reset by peer
2022-04-15T07:16:31.356Z|22398|reconnect|WARN|tcp:172.19.20.11:50460: connection dropped (Connection reset by peer)
2022-04-15T08:20:10.313Z|22399|jsonrpc|WARN|tcp:172.19.20.10:25824: receive error: Connection reset by peer
2022-04-15T08:20:10.313Z|22400|reconnect|WARN|tcp:172.19.20.10:25824: connection dropped (Connection reset by peer)
2022-04-15T08:20:19.600Z|22401|jsonrpc|WARN|tcp:172.19.20.12:28378: receive error: Connection reset by peer
2022-04-15T08:20:19.600Z|22402|reconnect|WARN|tcp:172.19.20.12:28378: connection dropped (Connection reset by peer)
2022-04-15T08:52:14.280Z|22403|jsonrpc|WARN|tcp:172.19.20.11:19300: receive error: Connection reset by peer
2022-04-15T08:52:14.280Z|22404|reconnect|WARN|tcp:172.19.20.11:19300: connection dropped (Connection reset by peer)

Question:
1. ovsdb disconnecting after 5s, how i can fix it?
2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?

Revision history for this message
Jakub Libosvar (libosvar) wrote :

It looks more like a configuration or scaling problem on your environment and not a bug in OVN controller. If it's disconnected then it can't process new flows.

Question:
1. ovsdb disconnecting after 5s, how i can fix it?
yes, by default. you can increase it by running 'ovs-vsctl set open . external_ids:ovn-remote-probe-interval="60000"'

2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
It means ovn-controller attempted a write to the OVN DB but failed. It will recalculate all openflows on the next iteration.

3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
No, ovn-controller is single process, single threaded.

4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?
It looks like the whole environment is unstable, you should check southbound database if it's not overloaded.

Changed in neutron:
status: New → Invalid
Revision history for this message
Nguyen Thanh Cong (congnt95) wrote (last edit ):

Hello Jakub Libosvar,
1. Log OVN SB on controller and OVS-DB Server on compute still disconnecting after 5s after i run 'ovs-vsctl set open . external_ids:ovn-remote-probe-interval="60000"'. I run this command on all compute & controller, but it still disconnecting. What is the cause of the error?

/var/log/kolla/openvswitch/ovn-sb-db.log
2022-04-21T09:24:32.040Z|355953|reconnect|ERR|tcp:172.19.20.105:16424: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:24:34.918Z|355954|reconnect|ERR|tcp:172.19.20.104:20134: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:00.579Z|355955|reconnect|ERR|tcp:172.19.20.100:7800: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:11.165Z|355956|reconnect|ERR|tcp:172.19.20.106:44522: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:13.475Z|355957|reconnect|ERR|tcp:172.19.20.101:62388: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:14.339Z|355958|reconnect|ERR|tcp:172.19.20.102:49252: no response to inactivity probe after 5 seconds, disconnecting

/var/log/kolla/openvswitch/ovsdb-server.log
2022-04-21T09:24:40.443Z|87372|reconnect|ERR|tcp:127.0.0.1:29416: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:24:58.558Z|87373|reconnect|ERR|tcp:127.0.0.1:29526: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:16.633Z|87374|reconnect|ERR|tcp:127.0.0.1:29918: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:29.628Z|87375|reconnect|ERR|tcp:127.0.0.1:30132: no response to inactivity probe after 5 seconds, disconnecting
2022-04-21T09:25:40.831Z|87376|reconnect|ERR|tcp:127.0.0.1:30400: no response to inactivity probe after 5 seconds, disconnecting

https://paste.ubuntu.com/p/BHHJwqwWGx/

2. I build openstack by kolla-ansible. OVN run cluster but OVN SB move leader regularly. Is it normal?

Revision history for this message
Jakub Libosvar (libosvar) wrote :

1) This could be also caused by the probe running on the DB side. You can bump this by changing the inactivity_probe in connection tables. These commands do that:

ovn-nbctl set connection . inactivity_probe=60000
ovn-sbctl set connection . inactivity_probe=60000

2) I don't know how is the OVN cluster managed by your tooling. If you use RAFT active/active or pacemaker managed active/backup solution. But it sounds like the DBs are loaded enough to not answer to monitoring probes in time. For the pacemaker solution you can try to change the monitor interval and timeouts on master and slave roles. Again, I don't know how it's managed in your env but these commands can be handy:

pcs resource update <your_ovn_db_resource> op monitor interval=60s role=Master timeout=180s enabled=true
pcs resource update <your_ovn_db_resource> op monitor interval=120s role=Slave timeout=180s enabled=true

That all said, it seems like a misconfiguration of the cloud leading to scaling issue rather then a bug in Neutron.

Revision history for this message
Nguyen Thanh Cong (congnt95) wrote (last edit ):

Hi Jakub,

I'm increase inactivity_probe in NB & SB, but ovsdb-server still response:
2022-07-14T02:46:24.519Z|06705|reconnect|ERR|tcp:127.0.0.1:36624: no response to inactivity probe after 5 seconds, disconnecting

I hit a bug about network vlan:
- Compute don't have any VM with vlan
- Create VM in that compute with network vlan
- VM with that network don't have connectivity
- After 1h or 2h, it's has connectivity

Debug:
- I check ovsdb-server still response disconnect after 5 seconds
- In openvswitch, i can't see port patch-provnet-xxxxxx-to-br-int in br-ex and port patch-br-int-to-provnet-xxxx in br-int
- ovs-ofctl dump-flows br-int table=0 |grep vlan don't show flows about this vlan in 1h or 2h

==> I think ovs can't proceess flow in 5s so it can't add port to br-ex and br-int

Note: I use OSP Victoria and OVN 20.03 deploy by kolla-ansible

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers