Comment 4 for bug 1996594

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.

$ git cherry v2.16.0 v2.17.0 -v |grep raft
+ 0de882954032aa37dc943bafd72c33324aa0c95a raft: Don't keep full json objects in memory if no longer needed.
+ bf07cc9cdb2f37fede8c0363937f1eb9f4cfd730 raft: Only allow followers to snapshot.

but I checked the data, it seems that it's not caused by it.

5-lxd-23: f801 100.94.0.99:6644 follower term=297
6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
7-lxd-27: 9b15 100.94.0.204:6644 leader term=297 leader

(old leader, 6-lxd-24, 0f3c) - sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
2022-08-18T17:52:53.367Z|82383|raft|INFO|rejected append_reply (not leader)
2022-08-18T17:52:53.378Z|82384|raft|INFO|server 9b15 is leader for term 297

(follower, 5-lxd-23: f801) - sosreport-juju-2752e1-5-lxd-23-xxx-2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297

(new leader, 7-lxd-27: 9b15)
$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer' -A2 |tail -n3
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92894|raft|INFO|term 297: starting election
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers

We see that the new leader (7-lxd-27) to be receives the leadership transfer, initiates the electiona and imediately after starts a snapshot taht takes 0.353 second (17:52:53.378 - 17:52:53.025). During this time, follower(5-xld-23) votes for 7-lxd-27 electing it as cluster leader but 7-lxd-27 doesn't effectively become leader unitl it finishes snapshotting, essentially keeping the cluster without a leader for up to 0.353 second. So actaully this snapshot is less than 0.353 second, this also means that the data to be compressed in the snapshot is not large.