Bug #2016002 “charm upgrade fron 20.03 to 22.03 never completes” : Bugs : charm-ovn-central

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2023-04-12:

#1

SB cluster membership seems ok:

# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
fb8c
Name: OVN_Southbound
Cluster ID: 9625 (96251ba3-e94f-418a-871a-3f09aa9af559)
Server ID: fb8c (fb8cd694-fcac-4d73-9658-52de4b929680)
Address: ssl:10.5.2.11:6644
Status: cluster member
Role: follower
Term: 6
Leader: 1fcc
Vote: unknown

Election timer: 4000
Log: [2, 106]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->3f54 ->1fcc <-3f54 <-1fcc
Servers:
    fb8c (fb8c at ssl:10.5.2.11:6644) (self)
    3f54 (3f54 at ssl:10.5.1.42:6644)
    1fcc (1fcc at ssl:10.5.3.76:6644)

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2023-04-12:

#2

Ok so a node reboot did fix this. I assume that the following means it tried to connect to one of its peers and couldn't so timed out and the service exited:

systemd journal has:

Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00682|raft|INFO|received leadership transfer from 3f54 in term 1
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00683|raft|INFO|term 2: starting election
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00684|reconnect|INFO|ssl:10.5.1.42:6644: connection closed by peer
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00685|raft|INFO|term 2: elected leader by 2+ of 3 servers
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00686|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00687|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00688|reconnect|INFO|ssl:10.5.1.42:6644: waiting 2 seconds before reconnect
Apr 12 11:31:48 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00689|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00690|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00691|reconnect|INFO|ssl:10.5.1.42:6644: waiting 4 seconds before reconnect
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 45917 (ovsdb-server) in control group while starting unit. Ignoring.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.
Apr 12 11:55:24 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.

Ok so a node reboot did fix this. I assume that the following means it tried to connect to one of its peers and couldn't so timed out and the service exited:

systemd journal has:

Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00682|raft|INFO|received leadership transfer from 3f54 in term 1
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00683|raft|INFO|term 2: starting election
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00684|reconnect|INFO|ssl:10.5.1.42:6644: connection closed by peer
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00685|raft|INFO|term 2: elected leader by 2+ of 3 servers
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00686|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00687|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00688|reconnect|INFO|ssl:10.5.1.42:6644: waiting 2 seconds before reconnect
Apr 12 11:31:48 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00689|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00690|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00691|reconnect|INFO|ssl:10.5.1.42:6644: waiting 4 seconds before reconnect
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 45917 (ovsdb-server) in control group while starting unit. Ignoring.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.
Apr 12 11:55:24 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2023-04-12:

#3

the last lime is where I manually restarted ovn-ovsdb-server-sb

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2023-05-11:

#4

I am still getting this behaviour after a charm upgrade (20.03 to 22.03). journal says:

May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00718|stream_ssl|WARN|SSL_accept: system error (Success)
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00719|jsonrpc|WARN|ssl:127.0.0.1:51952: receive error: Protocol error
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00720|reconnect|WARN|ssl:127.0.0.1:51952: connection dropped (Protocol error)
May 11 10:21:28 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

systemd says:

# systemctl status ovn-ovsdb-server-sb
● ovn-ovsdb-server-sb.service - Open vSwitch database server for OVN Southbound database
     Loaded: loaded (/lib/systemd/system/ovn-ovsdb-server-sb.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-05-11 10:21:30 UTC; 31min ago
    Process: 107358 ExecStart=/usr/share/ovn/scripts/ovn-ctl run_sb_ovsdb $OVN_CTL_OPTS (code=exited, status=0/SUCCESS)
    Process: 107503 ExecStop=/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb (code=exited, status=0/SUCCESS)
   Main PID: 107358 (code=exited, status=0/SUCCESS)

May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

A manual restart of ovn-ovsdb-server-sb does resolve the issue.

For what its worth I also have the election-timer set high (30s) in this env.

I am still getting this behaviour after a charm upgrade (20.03 to 22.03). journal says:

May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00718|stream_ssl|WARN|SSL_accept: system error (Success)
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00719|jsonrpc|WARN|ssl:127.0.0.1:51952: receive error: Protocol error
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00720|reconnect|WARN|ssl:127.0.0.1:51952: connection dropped (Protocol error)
May 11 10:21:28 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

systemd says:

# systemctl status ovn-ovsdb-server-sb
● ovn-ovsdb-server-sb.service - Open vSwitch database server for OVN Southbound database
     Loaded: loaded (/lib/systemd/system/ovn-ovsdb-server-sb.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-05-11 10:21:30 UTC; 31min ago
    Process: 107358 ExecStart=/usr/share/ovn/scripts/ovn-ctl run_sb_ovsdb $OVN_CTL_OPTS (code=exited, status=0/SUCCESS)
    Process: 107503 ExecStop=/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb (code=exited, status=0/SUCCESS)
   Main PID: 107358 (code=exited, status=0/SUCCESS)

May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

A manual restart of ovn-ovsdb-server-sb does resolve the issue.

For what its worth I also have the election-timer set high (30s) in this env.

Revision history for this message

DUFOUR Olivier (odufourc) wrote on 2023-06-22:

#5

ovn-central-debug.log Edit (19.8 KiB, text/plain)

This is something I have been able to reproduce fairly frequently on my lab when testing charm upgrades of OVN on a hyper-converged deployment.
In one case, I even had 2 ovn-central charm upgrade being stucked until I manually rebooted the units.

My environment is :
* Ubuntu Focal 20.04
* Openstack Ussuri
* OVN charm from 20.03/stable to 22.03/stable
* Juju 2.9.42
* election timer is set to 30 as well before upgrading the charm.

I'm guessing this happens because :
* when upgrading the charm, it takes a while for pip to build all the dependencies from the wheelhouse. (it takes a few minutes on my servers)
* after that the charm will run some hooks (coordinator-relation-changed) which needs to communicate with the Southbound database
* (This is a guess from my side) However since the charm upgrade can take some time to complete, the unit thinks falsely that it is hosting as the leader the Southbound database and doesn't check before launch a command with ovs-appctl which never expires/time-out.
On this part, I notice from juju status, that frequently 2 units would display they are hosting the leader for ovnsb_db.

This needs more investigation but I think the charm should either :
* Check systematically if it is really the leader before issuing *any* command to the database
or
* Have a timeout to the command to prevent the unit from being stuck completely.

I'm attaching a part of the logs from /var/log/juju/unit-ovn-central-0.log when this happens.
The last line would be like this indefinitely until the reboot.

charm-ovn-central

charm upgrade fron 20.03 to 22.03 never completes

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches