Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.
$ git cherry v2.16.0 v2.17.0 -v |grep raft
+ 0de882954032aa37dc943bafd72c33324aa0c95a raft: Don't keep full json objects in memory if no longer needed.
+ bf07cc9cdb2f37fede8c0363937f1eb9f4cfd730 raft: Only allow followers to snapshot.
but I checked the data, it seems that it's not caused by it.
(old leader, 6-lxd-24, 0f3c) - sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
2022-08-18T17:52:53.367Z|82383|raft|INFO|rejected append_reply (not leader)
2022-08-18T17:52:53.378Z|82384|raft|INFO|server 9b15 is leader for term 297
(follower, 5-lxd-23: f801) - sosreport-juju-2752e1-5-lxd-23-xxx-2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297
(new leader, 7-lxd-27: 9b15)
$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer' -A2 |tail -n3
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92894|raft|INFO|term 297: starting election
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers
We see that the new leader (7-lxd-27) to be receives the leadership transfer, initiates the electiona and imediately after starts a snapshot taht takes 0.353 second (17:52:53.378 - 17:52:53.025). During this time, follower(5-xld-23) votes for 7-lxd-27 electing it as cluster leader but 7-lxd-27 doesn't effectively become leader unitl it finishes snapshotting, essentially keeping the cluster without a leader for up to 0.353 second. So actaully this snapshot is less than 0.353 second, this also means that the data to be compressed in the snapshot is not large.
Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.
$ git cherry v2.16.0 v2.17.0 -v |grep raft 7dc943bafd72c33 324aa0c95a raft: Don't keep full json objects in memory if no longer needed. ede8c0363937f1e b9f4cfd730 raft: Only allow followers to snapshot.
+ 0de882954032aa3
+ bf07cc9cdb2f37f
but I checked the data, it seems that it's not caused by it.
5-lxd-23: f801 100.94.0.99:6644 follower term=297
6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
7-lxd-27: 9b15 100.94.0.204:6644 leader term=297 leader
(old leader, 6-lxd-24, 0f3c) - sosreport- juju-2752e1- 6-lxd-24- xxx-2022- 08-18-entowko/ var/log/ ovn/ovsdb- server- sb.log 18T17:52: 53.024Z| 82382|raft| INFO|Transferri ng leadership to write a snapshot. 18T17:52: 53.367Z| 82383|raft| INFO|rejected append_reply (not leader) 18T17:52: 53.378Z| 82384|raft| INFO|server 9b15 is leader for term 297
2022-08-
2022-08-
2022-08-
(follower, 5-lxd-23: f801) - sosreport- juju-2752e1- 5-lxd-23- xxx-2022- 08-18-bnsdhsj/ var/log/ ovn/ovsdb- server- sb.log 18T17:52: 53.379Z| 32327|raft| INFO|server 9b15 is leader for term 297
2022-08-
(new leader, 7-lxd-27: 9b15) juju-2752e1- */var/log/ ovn/* |xargs zgrep -i -E 'received leadership transfer' -A2 |tail -n3 juju-2752e1- 7-lxd-27- xxx-2022- 08-18-hhxxqci/ var/log/ ovn/ovsdb- server- sb.log: 2022-08- 18T17:52: 53.025Z| 92893|raft| INFO|received leadership transfer from 0f3c in term 296 juju-2752e1- 7-lxd-27- xxx-2022- 08-18-hhxxqci/ var/log/ ovn/ovsdb- server- sb.log: 2022-08- 18T17:52: 53.025Z| 92894|raft| INFO|term 297: starting election juju-2752e1- 7-lxd-27- xxx-2022- 08-18-hhxxqci/ var/log/ ovn/ovsdb- server- sb.log: 2022-08- 18T17:52: 53.378Z| 92895|raft| INFO|term 297: elected leader by 2+ of 3 servers
$ find sosreport-
sosreport-
sosreport-
sosreport-
We see that the new leader (7-lxd-27) to be receives the leadership transfer, initiates the electiona and imediately after starts a snapshot taht takes 0.353 second (17:52:53.378 - 17:52:53.025). During this time, follower(5-xld-23) votes for 7-lxd-27 electing it as cluster leader but 7-lxd-27 doesn't effectively become leader unitl it finishes snapshotting, essentially keeping the cluster without a leader for up to 0.353 second. So actaully this snapshot is less than 0.353 second, this also means that the data to be compressed in the snapshot is not large.