unit reports as being in "executing" but there is no jujud or logs

Bug #1891076 reported by Jason Hobbs
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

This is on focal with jujud 2.8.1.

heat-mysql-router/1 has been stuck in executing "(install) installing charm software" for over 5 hours:

ubuntu@production-cpe-07d59674-95e9-4406-9d05-8f25282eb455:~/project$ juju status heat-mysql-router
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud/default 2.8.1 unsupported 18:54:25Z

App Version Status Scale Charm Store Rev OS Notes
hacluster-heat waiting 0 hacluster jujucharms 143 ubuntu
heat 14.0.0 waiting 3 heat jujucharms 357 ubuntu
heat-mysql-router 8.0.21 waiting 3 mysql-router jujucharms 21 ubuntu
logrotate waiting 0 logrotate-charm jujucharms 5 ubuntu
public-policy-routing waiting 0 advanced-routing jujucharms 5 ubuntu

Unit Workload Agent Machine Public address Ports Message
heat/0 waiting idle 3/lxd/5 10.244.40.226 8000/tcp,8004/tcp Incomplete relations: database, identity
  hacluster-heat/1 waiting allocating 10.244.40.226 agent initializing
  heat-mysql-router/1 maintenance executing 10.244.40.226 (install) installing charm software
  logrotate/42 waiting allocating 10.244.40.226 agent initializing
  public-policy-routing/34 waiting allocating 10.244.40.226 agent initializing
heat/1 waiting idle 4/lxd/11 10.244.41.2 8000/tcp,8004/tcp Incomplete relations: database, identity
  hacluster-heat/2 waiting idle 10.244.41.2 Resource: res_heat_8afed2c_vip not yet configured
  heat-mysql-router/2 waiting idle 10.244.41.2 'db-router' incomplete, MySQL Router not yet bootstrapped
  logrotate/50 active idle 10.244.41.2 Unit is ready.
  public-policy-routing/41 active idle 10.244.41.2 Unit is ready
heat/2* waiting idle 5/lxd/6 10.244.40.205 8000/tcp,8004/tcp Incomplete relations: database, identity
  hacluster-heat/0* waiting idle 10.244.40.205 Resource: res_heat_8afed2c_vip not yet configured
  heat-mysql-router/0* waiting idle 10.244.40.205 'db-router' incomplete, MySQL Router not yet bootstrapped
  logrotate/12 active idle 10.244.40.205 Unit is ready.
  public-policy-routing/3 active idle 10.244.40.205 Unit is ready

Machine State DNS Inst id Series AZ Message
3 started 10.244.40.108 armaldo focal zone1 Deployed
3/lxd/5 started 10.244.40.226 juju-05dab1-3-lxd-5 focal zone1 Container started
4 started 10.244.40.111 spearow focal zone2 Deployed
4/lxd/11 started 10.244.41.2 juju-05dab1-4-lxd-11 focal zone2 Container started
5 started 10.244.40.106 beartic focal zone3 Deployed
5/lxd/6 started 10.244.40.205 juju-05dab1-5-lxd-6 focal zone3 Container started

ubuntu@production-cpe-07d59674-95e9-4406-9d05-8f25282eb455:~/project$ juju show-status-log heat-mysql-router/1
Time Type Status Message
07 Aug 2020 13:06:30Z workload waiting waiting for machine
07 Aug 2020 13:06:30Z juju-unit allocating
07 Aug 2020 13:06:30Z workload waiting installing agent
07 Aug 2020 13:06:33Z workload waiting agent initializing
07 Aug 2020 13:06:46Z workload maintenance installing charm software
07 Aug 2020 13:06:46Z juju-unit executing running install hook
ubuntu@production-cpe-07d59674-95e9-4406-9d05-8f25282eb455:~/project$ date
Mon Aug 10 18:56:18 UTC 2020

However, there is no jujud for it:
ubuntu@juju-05dab1-3-lxd-5:~$ ps auxf | grep jujud
ubuntu 33523 0.0 0.0 8164 660 pts/1 S+ 18:56 0:00 \_ grep --color=auto jujud
root 313 0.0 0.0 8696 3424 ? Ss 14:59 0:00 bash /etc/systemd/system/jujud-machine-3-lxd-5-exec-start.sh
root 347 0.0 0.0 828256 78016 ? SLl 14:59 0:05 \_ /var/lib/juju/tools/machine-3-lxd-5/jujud machine --data-dir /var/lib/juju --machine-id 3/lxd/5 --debug
root 315 0.0 0.0 8696 3432 ? Ss 14:59 0:00 bash /etc/systemd/system/jujud-unit-heat-0-exec-start.sh
root 342 0.0 0.0 829024 80224 ? SLl 14:59 0:13 \_ /var/lib/juju/tools/unit-heat-0/jujud unit --data-dir /var/lib/juju --unit-name heat/0 --debug

And no unit logs either.

Machine log:
http://paste.ubuntu.com/p/378wmfKmGk/

controller crashdump:
http://people.canonical.com/~jhobbs/juju-crashdump-controller-2020-08-10-19.18.04.tar.gz

description: updated
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Is there a crash dump of the openstack model available?

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

We need more information to resolve this.

Has it been reproduced? Increased logging level would be helpful. As would a crash dump from the model. A tar ball of /var/lib/juju and /var/log on machine who's unit has the issue, if they are not included in the model crash dump.

Changed in juju:
status: New → Triaged
importance: Undecided → High
status: Triaged → Incomplete
Revision history for this message
Pen Gale (pengale) wrote :

New theory about this. Speculative, but based on troubleshooting we did today, it is possible that the scripts that bootstrap the juju agent on a machine can get into a state where they stall out, without exiting, while downloading the charm. Running juju sosreport seems to unstick things, which might have to do with it restarting some dbus related services.

We might need more timeouts, or otherwise better error detecting when something goes wrong with comms in the underlying system, and the early stages of downloading and dropping the agents and charm in place runs into trouble.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Per comment #3, an issue which appears to be the same occurred yesterday with juju 2.6.8. Part of upgrading software version running in the model, included adding new subordinate units to long running machines. A few of the new subordinates got stuck in agent allocating.

After looking at additional logs, the problem seen yesterday happens before the download stage.

2020-08-25 18:35:30 ERROR juju.worker.dependency engine.go:663 "unit-agent-deployer" manifold worker returned unexpected error: dbus link request failed for application "jujud-unit-docker-9": Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
2020-08-25 18:36:01 ERROR juju.service.systemd service.go:148 dbus link request failed for application "jujud-unit-docker-9": Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
2020-08-25 18:36:01 ERROR juju.service.systemd service.go:460 failed to install service "jujud-unit-docker-9": dbus link request failed for application "jujud-unit-docker-9": Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

We're trying to link the agent files but it keeps failing on this call:
https://github.com/juju/juju/blob/71ee73f42e5574f14b5c93273adaffb0b29bcc91/service/systemd/service.go#L593

The unit-agent-deployer worker is stuck in a error loop, continually restarting.

We also know that dbus got wedged on the machine. Which may indicate why restarting the jujud services did not resolve the issue but rebooting did.

Interestingly a script was able to unwedge one of the machines this was seen on. Then the unit agent did install.

Revision history for this message
John A Meinel (jameinel) wrote :

Is this a case where we were looking in the wrong location for the logs? Eg machine-1 is failing to start unit-foo-1 but we were looking for unit-foo-1.log but it doesn't exist yet, and we didn't check the machine-1.log.

We certainly could record errors of this sort against the unit that we are failing to start (so we have more detailed information about the failure than 'allocating').

Revision history for this message
John A Meinel (jameinel) wrote :

It does seem more like we just need to raise visibility if we run into problems talking to dbus, rather than something where Juju should be trying to fix dbus directly.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Juju has not control over dbus errors. However we can surface the error better to juju status.

Changed in juju:
status: Incomplete → Triaged
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.