Comment 5 for bug 2016002

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

This is something I have been able to reproduce fairly frequently on my lab when testing charm upgrades of OVN on a hyper-converged deployment.
In one case, I even had 2 ovn-central charm upgrade being stucked until I manually rebooted the units.

My environment is :
* Ubuntu Focal 20.04
* Openstack Ussuri
* OVN charm from 20.03/stable to 22.03/stable
* Juju 2.9.42
* election timer is set to 30 as well before upgrading the charm.

I'm guessing this happens because :
* when upgrading the charm, it takes a while for pip to build all the dependencies from the wheelhouse. (it takes a few minutes on my servers)
* after that the charm will run some hooks (coordinator-relation-changed) which needs to communicate with the Southbound database
* (This is a guess from my side) However since the charm upgrade can take some time to complete, the unit thinks falsely that it is hosting as the leader the Southbound database and doesn't check before launch a command with ovs-appctl which never expires/time-out.
On this part, I notice from juju status, that frequently 2 units would display they are hosting the leader for ovnsb_db.

This needs more investigation but I think the charm should either :
* Check systematically if it is really the leader before issuing *any* command to the database
or
* Have a timeout to the command to prevent the unit from being stuck completely.

I'm attaching a part of the logs from /var/log/juju/unit-ovn-central-0.log when this happens.
The last line would be like this indefinitely until the reboot.