charm hangs waiting for ovsdb

Bug #1911225 reported by Alexander Balderson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-chassis
New
Undecided
Unassigned

Bug Description

Testing Openstack Charmers Next, there was an instance of the OVN chassis hanging waiting for ovsdb, stating the relation was incomplete.

The logs show that the database is connected, then goes forward to run a series of hooks. here is a snippet of output near the end of the log:

2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 INFO juju-log ovsdb:196: CompletedProcess(args=('ovs-vsctl', 'set', 'open', '.', 'external-ids:ovn-remote=ssl:10.246.64.207:6642,ssl:10.246.64.233:6642,ssl:10.246.64.226:6642'), returncode=0, stdout='')
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 DEBUG juju-log ovsdb:196: Writing file /etc/openvswitch/system-id.conf root:root 640
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 DEBUG juju-log ovsdb:196: tracer: set flag config.rendered
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 INFO juju-log ovsdb:196: Invoking reactive handler: reactive/ovn_chassis_charm_handlers.py:76:configure_bridges
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 DEBUG juju-log ovsdb:196: BridgePortInterfaceMap: "dict_items([])"
2021-01-12 11:16:53 DEBUG jujuc server.go:211 running hook tool "juju-log" for octavia-ovn-chassis/1-ovsdb-relation-changed-4797203550333522695
2021-01-12 11:16:53 INFO juju-log ovsdb:196: Creating bridge br-int
2021-01-12 11:16:54 WARNING ovsdb-relation-changed 2021-01-12T11:16:54Z|00002|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: receive error: Connection reset by peer
2021-01-12 11:16:54 WARNING ovsdb-relation-changed 2021-01-12T11:16:54Z|00003|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection dropped (Connection reset by peer)
2021-01-12 11:17:26 DEBUG juju.worker.uniter.remotestate watcher.go:636 update status timer triggered for octavia-ovn-chassis/1

from here it spends 4 hours running the update status hook, and not realizing that the database is connected.

Here is the version from juju status"
octavia-ovn-chassis 20.03.1 waiting 3 ovn-chassis jujucharms 46 ubuntu

The crashdump can be found at: https://oil-jenkins.canonical.com/artifacts/796144c4-9c22-4325-b0d1-a6527d730069/generated/generated/openstack/juju-crashdump-openstack-2021-01-12-15.20.48.tar.gz

and the testrun is available at https://solutions.qa.canonical.com/testruns/testRun/796144c4-9c22-4325-b0d1-a6527d730069

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Looking at the charm log I see:
2021-01-12 11:16:54 WARNING ovsdb-relation-changed 2021-01-12T11:16:54Z|00002|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: receive error: Connection reset by peer
2021-01-12 11:16:54 WARNING ovsdb-relation-changed 2021-01-12T11:16:54Z|00003|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection dropped (Connection reset by peer)

And looking at the Open vSwitch switch log:
2021-01-12T11:16:58.809Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted), core dumped)
2021-01-12T11:16:58.809Z|00003|daemon_unix|EMER|could not detach from foreground session

I also see that this is a case of the charm running in a container, this is a dup of bug 1906280

Revision history for this message
Jeffrey Chang (modern911) wrote :

Solution QA still see some occurance of this issue since Apr 2022, https://solutions.qa.canonical.com/bugs/bugs/bug/1911225
While bug 1906280 was fixed and release in 2021.

Will need more troubleshooting here.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

@modern911 - hmm, so the symptoms are similar here but there was distinct evidence in the previous version and comment #1 that indicates that this was clearly bug 1906280. That was a problem that was introduced by a change in systemd behavior in which the mlock was no longer able to allocate the memory required. It caused the openvswitch-vswitchd process to die and not be able to startup due to limitations within the container.

However, in the data that you've provided is presenting similar symptoms ('ovsdb' relation incomplete) but an altogether different problem. This is quite an unfortunate thing as we will likely see this in other scenarios as well as long as the <relation-name> relation is not considered to be complete.

For this bug, we should spin out a new bug rather than keep this existing bug since the initial data provided with this bug has positively been identified as a duplicate. @modern911 if you could please raise a new bug, this would be extremely helpful.

As far as the data analysis, the single unit of ovn-chassis/2 is waiting on the ovsdb.available flag to be set for the reactive handlers. However, only ovsdb.connected flag is set. On a cursory glance, this indicates that the number of expected remote units are not yet available [0][1][2] which is dependent on the goal state of Juju returning the correct target goal. It appears that this is working on the other ovn-chassis units, but there is something about this particular unit that it is not seeing the right number of units. Unfortunately logging (FFDC) is sparse in this particular path.

[0] - https://opendev.org/x/charm-interface-ovsdb/src/branch/master/src/ovsdb/requires.py#L42
[1] - https://opendev.org/x/charm-interface-ovsdb/src/branch/master/src/lib/ovsdb.py#L150
[2] - https://github.com/juju/charm-helpers/blob/master/charmhelpers/core/hookenv.py#L594

Revision history for this message
Jeffrey Chang (modern911) wrote :

@billy-oslen I totally agree a new bug should be filed here. And Solution QA should improve how we link known bugs to similar test failures.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.