charm upgrade fron 20.03 to 22.03 never completes

Bug #2016002 reported by Edward Hope-Morley
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
charm-ovn-central
New
Undecided
Unassigned

Bug Description

I deployed from 20.03/stable then upgraded to 22.03/stable and 1/3 units is stuck as follows:

$ juju status ovn-central
Model Controller Cloud/Region Version SLA Timestamp
ovntest hopem stsstack/stsstack 2.9.37 unsupported 11:52:17Z

App Version Status Scale Charm Channel Rev Exposed Message
ovn-central 20.03.2 active 3 ovn-central 22.03/stable 57 no Unit is ready

Unit Workload Agent Machine Public address Ports Message
ovn-central/0 active executing 10 10.5.2.11 6641/tcp,6642/tcp (upgrade-charm) Unit is ready
ovn-central/1 active idle 11 10.5.3.76 6641/tcp,6642/tcp Unit is ready (leader: ovnsb_db)
ovn-central/2* active idle 12 10.5.1.42 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db)

Machine State Address Inst id Series AZ Message
10 started 10.5.2.11 56f71304-56f7-4f92-bfd8-93c701161665 focal nova ACTIVE
11 started 10.5.3.76 743003bc-bf92-47a8-ad54-05f0a9c171ef focal nova ACTIVE
12 started 10.5.1.42 9e1d3966-d78d-4b81-8a12-a166d44cc1a1 focal nova ACTIVE

ovn-central/0 log shows:

# egrep "ovn-ovsdb-server-(s|n)b" /var/log/juju/unit-ovn-central-0.log
2023-04-11 17:29:23 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Failed to start ovn-ovsdb-server-nb.service: Unit ovn-ovsdb-server-nb.service is masked.
2023-04-11 17:29:23 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Failed to start ovn-ovsdb-server-sb.service: Unit ovn-ovsdb-server-sb.service is masked.
2023-04-11 17:29:25 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Removed /etc/systemd/system/ovn-ovsdb-server-nb.service.
2023-04-11 17:29:25 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Created symlink /etc/systemd/system/ovn-nb-ovsdb.service → /lib/systemd/system/ovn-ovsdb-server-nb.service.
2023-04-11 17:29:26 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Removed /etc/systemd/system/ovn-ovsdb-server-sb.service.
2023-04-11 17:29:26 WARNING unit.ovn-central/0.ovsdb-peer-relation-changed logger.go:60 Created symlink /etc/systemd/system/ovn-sb-ovsdb.service → /lib/systemd/system/ovn-ovsdb-server-sb.service.
2023-04-12 11:31:49 WARNING unit.ovn-central/0.upgrade-charm logger.go:60 Removed /etc/systemd/system/ovn-ovsdb-server-nb.service.
2023-04-12 11:31:50 WARNING unit.ovn-central/0.upgrade-charm logger.go:60 Removed /etc/systemd/system/ovn-ovsdb-server-sb.service.

And appears the sbdb unit is missing:

# systemctl list-units| grep ovn
  ovn-central.service loaded active exited Open Virtual Network central components
  ovn-northd.service loaded active running Open Virtual Network central control daemon
  ovn-ovsdb-server-nb.service loaded active running Open vSwitch database server for OVN Northbound database

It is enabled though:

# systemctl list-unit-files| grep ovn
ovn-central.service enabled enabled
ovn-nb-ovsdb.service enabled enabled
ovn-northd.service static enabled
ovn-ovsdb-server-nb.service enabled enabled
ovn-ovsdb-server-sb.service enabled enabled
ovn-sb-ovsdb.service enabled enabled

Manual restart does bring it back but ovn-sbctl commands are hanging. Logs show:

2023-04-12T11:56:40.400Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log
2023-04-12T11:56:40.426Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.8
2023-04-12T11:56:40.427Z|00003|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
2023-04-12T11:56:40.428Z|00004|reconnect|INFO|ssl:10.5.3.76:6644: connecting...
2023-04-12T11:56:40.437Z|00005|reconnect|INFO|ssl:10.5.1.42:6644: connected
2023-04-12T11:56:40.441Z|00006|reconnect|INFO|ssl:10.5.3.76:6644: connected
2023-04-12T11:56:41.025Z|00007|raft|INFO|server 1fcc is leader for term 6
2023-04-12T11:56:41.042Z|00008|raft|INFO|ssl:10.5.1.42:49886: learned server ID 3f54
2023-04-12T11:56:41.042Z|00009|raft|INFO|ssl:10.5.1.42:49886: learned remote address ssl:10.5.1.42:6644
2023-04-12T11:56:41.050Z|00010|raft|INFO|ssl:10.5.3.76:43938: learned server ID 1fcc
2023-04-12T11:56:41.050Z|00011|raft|INFO|ssl:10.5.3.76:43938: learned remote address ssl:10.5.3.76:6644
2023-04-12T11:56:50.427Z|00012|memory|INFO|9528 kB peak resident set size after 10.0 seconds
2023-04-12T11:56:50.427Z|00013|memory|INFO|cells:3011 monitors:0 raft-connections:4 raft-log:104

Revision history for this message
Edward Hope-Morley (hopem) wrote :

SB cluster membership seems ok:

# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
fb8c
Name: OVN_Southbound
Cluster ID: 9625 (96251ba3-e94f-418a-871a-3f09aa9af559)
Server ID: fb8c (fb8cd694-fcac-4d73-9658-52de4b929680)
Address: ssl:10.5.2.11:6644
Status: cluster member
Role: follower
Term: 6
Leader: 1fcc
Vote: unknown

Election timer: 4000
Log: [2, 106]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->3f54 ->1fcc <-3f54 <-1fcc
Servers:
    fb8c (fb8c at ssl:10.5.2.11:6644) (self)
    3f54 (3f54 at ssl:10.5.1.42:6644)
    1fcc (1fcc at ssl:10.5.3.76:6644)

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok so a node reboot did fix this. I assume that the following means it tried to connect to one of its peers and couldn't so timed out and the service exited:

2023-04-12T11:31:47.869Z|00686|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
2023-04-12T11:31:47.869Z|00687|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
2023-04-12T11:31:47.869Z|00688|reconnect|INFO|ssl:10.5.1.42:6644: waiting 2 seconds before reconnect
2023-04-12T11:31:49.869Z|00689|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
2023-04-12T11:31:49.871Z|00690|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
2023-04-12T11:31:49.871Z|00691|reconnect|INFO|ssl:10.5.1.42:6644: waiting 4 seconds before reconnect

systemd journal has:

Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00682|raft|INFO|received leadership transfer from 3f54 in term 1
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00683|raft|INFO|term 2: starting election
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00684|reconnect|INFO|ssl:10.5.1.42:6644: connection closed by peer
Apr 12 11:31:46 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00685|raft|INFO|term 2: elected leader by 2+ of 3 servers
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00686|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00687|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:47 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00688|reconnect|INFO|ssl:10.5.1.42:6644: waiting 2 seconds before reconnect
Apr 12 11:31:48 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00689|reconnect|INFO|ssl:10.5.1.42:6644: connecting...
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00690|reconnect|INFO|ssl:10.5.1.42:6644: connection attempt failed (Connection refused)
Apr 12 11:31:49 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00691|reconnect|INFO|ssl:10.5.1.42:6644: waiting 4 seconds before reconnect
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 45917 (ovsdb-server) in control group while starting unit. Ignoring.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
Apr 12 11:31:51 juju-34a1ff-ovntest-10 ovsdb-server[45917]: ovs|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
Apr 12 11:31:51 juju-34a1ff-ovntest-10 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.
Apr 12 11:55:24 juju-34a1ff-ovntest-10 systemd[1]: Started Open vSwitch database server for OVN Southbound database.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

the last lime is where I manually restarted ovn-ovsdb-server-sb

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I am still getting this behaviour after a charm upgrade (20.03 to 22.03). journal says:

May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00718|stream_ssl|WARN|SSL_accept: system error (Success)
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00719|jsonrpc|WARN|ssl:127.0.0.1:51952: receive error: Protocol error
May 11 10:21:14 juju-5b471f-ovn-upgrade-test-13 ovsdb-server[20703]: ovs|00720|reconnect|WARN|ssl:127.0.0.1:51952: connection dropped (Protocol error)
May 11 10:21:28 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Current command vanished from the unit file, execution of the command list won't be resumed.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

systemd says:

# systemctl status ovn-ovsdb-server-sb
● ovn-ovsdb-server-sb.service - Open vSwitch database server for OVN Southbound database
     Loaded: loaded (/lib/systemd/system/ovn-ovsdb-server-sb.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2023-05-11 10:21:30 UTC; 31min ago
    Process: 107358 ExecStart=/usr/share/ovn/scripts/ovn-ctl run_sb_ovsdb $OVN_CTL_OPTS (code=exited, status=0/SUCCESS)
    Process: 107503 ExecStop=/usr/share/ovn/scripts/ovn-ctl stop_sb_ovsdb (code=exited, status=0/SUCCESS)
   Main PID: 107358 (code=exited, status=0/SUCCESS)

May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Found left-over process 20703 (ovsdb-server) in control group while starting unit. Ignoring.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: Started Open vSwitch database server for OVN Southbound database.
May 11 10:21:30 juju-5b471f-ovn-upgrade-test-13 systemd[1]: ovn-ovsdb-server-sb.service: Succeeded.

A manual restart of ovn-ovsdb-server-sb does resolve the issue.

For what its worth I also have the election-timer set high (30s) in this env.

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

This is something I have been able to reproduce fairly frequently on my lab when testing charm upgrades of OVN on a hyper-converged deployment.
In one case, I even had 2 ovn-central charm upgrade being stucked until I manually rebooted the units.

My environment is :
* Ubuntu Focal 20.04
* Openstack Ussuri
* OVN charm from 20.03/stable to 22.03/stable
* Juju 2.9.42
* election timer is set to 30 as well before upgrading the charm.

I'm guessing this happens because :
* when upgrading the charm, it takes a while for pip to build all the dependencies from the wheelhouse. (it takes a few minutes on my servers)
* after that the charm will run some hooks (coordinator-relation-changed) which needs to communicate with the Southbound database
* (This is a guess from my side) However since the charm upgrade can take some time to complete, the unit thinks falsely that it is hosting as the leader the Southbound database and doesn't check before launch a command with ovs-appctl which never expires/time-out.
On this part, I notice from juju status, that frequently 2 units would display they are hosting the leader for ovnsb_db.

This needs more investigation but I think the charm should either :
* Check systematically if it is really the leader before issuing *any* command to the database
or
* Have a timeout to the command to prevent the unit from being stuck completely.

I'm attaching a part of the logs from /var/log/juju/unit-ovn-central-0.log when this happens.
The last line would be like this indefinitely until the reboot.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.