charm hook may hang indefinitely querying payload if payload does not run

Bug #1912820 reported by Alexander Balderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-central
Triaged
High
Unassigned

Bug Description

On openstack charmers next, the deployment hangs (after relating the lma stack) waiting for the upgrade-charm hook to execute. it runs for 4 hours before we stopped the run.

the testrun can be found at:
https://solutions.qa.canonical.com/testruns/testRun/8b124e98-942a-4b56-b4ee-17f3499ab61c

bundle at:
https://oil-jenkins.canonical.com/artifacts/8b124e98-942a-4b56-b4ee-17f3499ab61c/generated/generated/openstack/bundle.yaml

and crashdump at:
https://oil-jenkins.canonical.com/artifacts/8b124e98-942a-4b56-b4ee-17f3499ab61c/generated/generated/lmacmr/juju-crashdump-openstack-2021-01-22-16.26.57.tar.gz

Marking this as a release blocker

Changed in charm-ovn-central:
status: New → Triaged
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So it looks like the ovs-sbctl command hung in the charm. Looking at the syslog for ovn-central/0 it looks like the sb process was offline when the command was run and this may have contributed to the hang. Bug https://bugzilla.redhat.com/show_bug.cgi?id=1622051 might be what is responsible.

Does this happen on every run?
Do you know what triggered the upgrade-charm (was it a resource addition?)

I don't think the title is correct as this appears in the log for the unit (it was in the upgrade-charm hook, but had finished 'upgrading the charm'. It was in the render handler where it hung.

2021-01-22 12:18:52 INFO juju-log Invoking reactive handler: reactive/ovn_central_handlers.py:157:render

My suspicion is:

- the upgrading of the payload resulted in the ovn-ovsdb-server-sb service dying / being stopped.
- the ovs-sbctl command hung waiting to connect to the service (which isn't running).
- I don't know why the ovn-ovsdb-server-sb process didn't come back (it may have, and the ovs-sbctl command) may just have hung.

I wonder if the charm should timeout the command and retry it a few times instead of relying on it working first time?

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Also, does this happen on every run, or is it a one-off or race condition?

Changed in charm-ovn-central:
status: Triaged → Incomplete
Revision history for this message
Alexander Balderson (asbalderson) wrote :

This doesnt happen every run, but it has happened more than once.
you can see all the times it's happened here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1912820

Currently the process for deploying where we came across this issue is:
deploy openstack and bring it fully up with charm actions etc.
deploy lma in separate model
relate lma charms to openstack by redeploying the openstack with saas relations to the lma bits (telegraph, filebeat, landscape client).

I actually notice that 2 of the cases of it happening are on the stable charms.
And with some further digging, it looks like we're missing the version pinning during the lma re-deploy. so the charm is upgrading from revision 2 to revision 3.

Also removing the release blocker tag, since upgrade isnt part of the release gate.

tags: removed: cdo-release-blocker
Changed in charm-ovn-central:
status: Incomplete → New
tags: added: charm-upgrade
Changed in charm-ovn-central:
assignee: Alex Kavanagh (ajkavanagh) → nobody
status: New → Triaged
importance: Undecided → High
Frode Nordahl (fnordahl)
summary: - Charm never clears upgrade-charm flag and runs forvever
+ charm hook may hang indefinitely querying payload if payload does not
+ run
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.