Application status changes right after juju wait-for timeout

Bug #1992666 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/2c103621-e4fe-455e-afd1-3e2ff05de21a, a `juju wait-for` command fails on a timeout error:

```
2022-10-08-08:43:07 root ERROR [localhost] Command failed: juju wait-for unit -m foundations-maas:openstack --timeout 3600s vault/2 --query 'workload-message=="Vault needs to be initialized" || workload-status == "active"'
2022-10-08-08:43:07 root ERROR [localhost] STDOUT follows:
properties:
  workload-message: configuring Nagios checks
  workload-status: maintenance
```

The weird thing is that if we look at the status, vault/2 is actually in the expected state:
```
vault/2 blocked idle 8 10.246.166.160 8200/tcp Vault needs to be initialized
  canonical-livepatch/1 active idle 10.246.166.160 Running kernel 4.15.0-193.204-generic, patchState: nothing-to-apply (source version/commit dad6199)
  filebeat/1 active idle 10.246.166.160 Filebeat ready.
  hacluster-vault/1 active idle 10.246.166.160 Unit is ready and clustered
  landscape-client/1 maintenance idle 10.246.166.160 Need computer-title and juju-info to proceed
  nrpe/1 active idle 10.246.166.160 icmp,5666/tcp Ready
  ntp/1 active idle 10.246.166.160 123/udp chrony: Ready, OK: offset is 0.000003
  prometheus-grok-exporter/1 active idle 10.246.166.160 9144/tcp Unit is ready
  telegraf/1 active idle 10.246.166.160 9103/tcp Monitoring vault/2 (source version/commit 76901fd)
```

Looking at the status log in the crashdump, vault changes state 5 ms before the timeout error is thrown:
```
08 Oct 2022 07:20:29Z juju-unit executing running certificates-relation-changed hook for aodh/2
08 Oct 2022 07:20:36Z juju-unit executing running certificates-relation-changed hook for keystone/0
08 Oct 2022 07:20:45Z juju-unit executing running certificates-relation-joined hook for glance/2
08 Oct 2022 07:20:52Z juju-unit executing running etcd-relation-changed hook for etcd/2
08 Oct 2022 07:21:01Z juju-unit executing running certificates-relation-changed hook for glance/2
08 Oct 2022 07:21:11Z juju-unit executing running shared-db-relation-changed hook for mysql/1
08 Oct 2022 07:21:20Z juju-unit executing running certificates-relation-changed hook for glance/0
08 Oct 2022 07:21:24Z workload waiting 'shared-db' incomplete
08 Oct 2022 07:21:27Z juju-unit idle
08 Oct 2022 07:21:31Z juju-unit executing running shared-db-relation-changed hook for mysql/0
08 Oct 2022 07:21:36Z workload maintenance configuring Nagios checks
08 Oct 2022 07:22:06Z juju-unit executing running shared-db-relation-changed hook for mysql/1
08 Oct 2022 07:22:14Z juju-unit idle
08 Oct 2022 07:23:22Z juju-unit executing running shared-db-relation-changed hook for mysql/2
08 Oct 2022 07:23:29Z juju-unit idle
08 Oct 2022 07:26:46Z juju-unit executing running certificates-relation-changed hook for cinder/1
08 Oct 2022 07:26:52Z juju-unit idle
08 Oct 2022 07:29:48Z juju-unit executing running certificates-relation-changed hook for keystone/1
08 Oct 2022 07:29:54Z juju-unit idle
08 Oct 2022 08:43:02Z workload blocked Vault needs to be initialized
```

This could be a weird coincidence, except that something very similar happened in testrun https://solutions.qa.canonical.com/testruns/testRun/609b06bc-e9d4-4298-b9c4-74c42a7dfa21 where the same problems shows up on a keystone unit.

Crashdumps and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/2c103621-e4fe-455e-afd1-3e2ff05de21a/index.html

Tags: cdo-qa
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Joseph Phillips (manadart) wrote :

Looking at strategy.go, it doesn't look right to fire off a new Goroutine in the run method, that uses time.After.

This should be initiated outside, and the channel passed into the method.

Looks like there are places for racy behaviour to live here.

tags: added: solutions-qa-expired
Changed in juju:
status: Triaged → Invalid
Changed in juju:
status: Invalid → New
tags: added: cdo-qa
removed: solutions-qa-expired
Harry Pidcock (hpidcock)
Changed in juju:
milestone: none → 2.9-next
status: New → Triaged
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9-next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.