Upgrading ovn-central fails trying to run nb or sb commands on juju leader unit that is not the ovnnb_db or ovnsb_db leader

Bug #2007847 reported by Diko Parvanov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-central
Fix Committed
Undecided
Martin Kalcok
22.03
Fix Released
Undecided
Martin Kalcok
22.09
Fix Released
Undecided
Martin Kalcok
23.03
Fix Released
Undecided
Martin Kalcok

Bug Description

In an environment upgrading ovn-central from charmstore revision 16 to charmhub channel 22.03/stable (rev 57) fails

The ovn-central unit is trying to run a sbctl command on itself, but it's not holding the nb or sb database leadership

ovn-central/0 active idle 3/lxd/18 10.11.2.147 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db)
ovn-central/1* error idle 4/lxd/18 10.11.2.173 6641/tcp,6642/tcp hook failed: "config-changed"
ovn-central/2 active idle 5/lxd/19 10.11.2.79 6641/tcp,6642/tcp Unit is ready

unit-ovn-central-1: 11:50:37 WARNING unit.ovn-central/1.config-changed Removed /etc/systemd/system/ovn-central.service.
unit-ovn-central-1: 11:50:37 WARNING unit.ovn-central/1.config-changed Removed /etc/systemd/system/ovn-ovsdb-server-nb.service.
unit-ovn-central-1: 11:50:38 WARNING unit.ovn-central/1.config-changed Removed /etc/systemd/system/ovn-ovsdb-server-sb.service.
unit-ovn-central-1: 11:50:38 WARNING unit.ovn-central/1.config-changed 2023-02-20T11:50:38Z|00001|jsonrpc|WARN|unix:/var/run/ovn/ovnsb_db.sock: receive error: Connection reset by peer
unit-ovn-central-1: 11:50:38 WARNING unit.ovn-central/1.config-changed 2023-02-20T11:50:38Z|00002|reconnect|WARN|unix:/var/run/ovn/ovnsb_db.sock: connection dropped (Connection reset by peer)
unit-ovn-central-1: 11:50:38 WARNING unit.ovn-central/1.config-changed ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (Connection reset by peer)
unit-ovn-central-1: 11:50:38 ERROR unit.ovn-central/1.juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-ovn-central-1/charm/reactive/ovn_central_handlers.py", line 261, in render
    ovn_charm.configure_ovn(
  File "/var/lib/juju/agents/unit-ovn-central-1/charm/lib/charm/openstack/ovn_central.py", line 692, in configure_ovn
    self.configure_ovn_listener(
  File "/var/lib/juju/agents/unit-ovn-central-1/charm/lib/charm/openstack/ovn_central.py", line 552, in configure_ovn_listener
    for connection in connections.find(
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charmhelpers/contrib/network/ovs/ovsdb.py", line 230, in _find_tbl
    output = utils._run(*cmd)
  File "/var/lib/juju/agents/unit-ovn-central-1/.venv/lib/python3.8/site-packages/charmhelpers/contrib/network/ovs/utils.py", line 26, in _run
    return subprocess.check_output(args, universal_newlines=True)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('ovn-sbctl', '-f', 'json', 'find', 'connection', 'target="pssl:16642"')' returned non-zero exit status 1.

Revision history for this message
Diko Parvanov (dparv) wrote :

This was resolved by pausing one unit, restarting the nb and sb service on the other 2 until the raft placed the nb and sb leaders on the juju leader.

description: updated
Steven Parker (sbparke)
Changed in charm-ovn-central:
status: New → Incomplete
Diko Parvanov (dparv)
Changed in charm-ovn-central:
status: Incomplete → New
Changed in charm-ovn-central:
assignee: nobody → Martin Kalcok (martin-kalcok)
status: New → In Progress
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

Root of this issue was that the charm upgrade inadvertently triggered upgrade of OVN packages which cause service restarts and unexpected leadership change.

PR to better fence against OVN upgrades during charm upgrade: https://review.opendev.org/c/x/charm-ovn-central/+/888289

(This will require backporting to 22.03/stable after merge to master)

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

My previous statement about the inadvertent OVN upgrades was wrong. The package upgrade is expected as new charm release usually carries updated `source` config option (which is expected to trigger pakcage upgrade).

The fact still remains that when OVN package upgrade happens during the charm-upgrade process some of the OVN services don't survive.

I'm investigating why are OVN services getting killed off.

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

New PR is opened that avoids this problem by not masking OVN services during charm upgrades: https://review.opendev.org/c/x/charm-ovn-central/+/888761

Frode Nordahl (fnordahl)
Changed in charm-ovn-central:
status: In Progress → Fix Committed
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

Fix was backported to:

* 22.03
* 22.09
* 23.03

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.