2.8.1 juju machine agent restarting itself

Bug #1891586 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

Recently upgraded a set of controllers from 2.7.6 to 2.8.1. On one of them, it looks like it crashed and restarted itself.

| root 11195 0.0 0.0 22084 3524 ? Ss 05:46 0:00 bash /etc/systemd/system/jujud-machine-1-exec-start.sh
| root 11209 1.5 1.3 825828 113204 ? Sl 05:46 0:12 \_ /var/lib/juju/tools/machine-1/jujud machine --data-dir

Last logging lines:

| 2020-08-14 05:45:27 ERROR juju.worker.dependency engine.go:671 "firewaller" manifold worker returned unexpected error: machine 2 not provisioned
| 2020-08-14 05:45:28 ERROR juju.worker.dependency engine.go:671 "firewaller" manifold worker returned unexpected error: cannot respond to units changes for "machine-0": instances not found
| 2020-08-14 05:45:42 ERROR juju.worker.dependency engine.go:671 "environ-tracker" manifold worker returned unexpected error: cannot create environ: authentication failed.: authentication failed
| caused by: requesting token: Unauthorised URL https://keystone.bos01.canonistack.canonical.com:5000/v3/auth/tokens
| caused by: request (https://keystone.bos01.canonistack.canonical.com:5000/v3/auth/tokens) returned unexpected status: 401; error info: Failed: 401 error: The request you have made requires authentication.
| 2020-08-14 05:45:45 ERROR juju.worker.dependency engine.go:671 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: read tcp 10.48.128.113:40964->10.48.128.113:37017: i/o timeout
| 2020-08-14 05:45:45 ERROR juju.apiserver.instancepoller instancepoller.go:174 link layer device merge attempt for machine 4 failed due to error: read tcp 127.0.0.1:43758->127.0.0.1:37017: i/o timeout; waiting until next instance-poller run to retry
| 2020-08-14 05:45:53 ERROR juju.worker.dependency engine.go:671 "environ-tracker" manifold worker returned unexpected error: cannot create environ: authentication failed.: authentication failed
| caused by: requesting token: Unauthorised URL https://keystone.bos01.canonistack.canonical.com:5000/v3/auth/tokens
| caused by: request (https://keystone.bos01.canonistack.canonical.com:5000/v3/auth/tokens) returned unexpected status: 401; error info: Failed: 401 error: The request you have made requires authentication.
| 2020-08-14 05:45:56 ERROR juju.state status.go:431 failed to update status history: read tcp 127.0.0.1:44126->127.0.0.1:37017: i/o timeout
| 2020-08-14 05:46:21 INFO juju.cmd supercommand.go:91 running jujud [2.8.1 0 16439b3d1c528b7a0e019a16c2122ccfcf6aa41f gc go1.14.4]
| 2020-08-14 05:46:21 DEBUG juju.cmd supercommand.go:92 args: []string{"/var/lib/juju/tools/machine-1/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "1", "--debug"}
| 2020-08-14 05:46:21 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 4
| 2020-08-14 05:46:21 DEBUG juju.agent agent.go:583 read agent config, format "2.0"
| 2020-08-14 05:46:21 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
| 2020-08-14 05:46:45 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:46:50 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:46:55 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:47:01 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:47:10 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:47:19 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:47:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:47:44 ERROR juju.service.systemd service.go:127 failed to list services for application "juju-db": List failed (/bin/systemctl list-unit-files --no-legend --no-page -l -t service | grep -o -P '^\w[\S]*(?=\.service)'): error executing "/bin/systemctl": Failed to list unit files: Connection timed out;
| 2020-08-14 05:47:44 ERROR juju.worker.dependency engine.go:671 "state" manifold worker returned unexpected error: failed to list services for application "juju-db": List failed (/bin/systemctl list-unit-files --no-legend --no-page -l -t service | grep -o -P '^\w[\S]*(?=\.service)'): error executing "/bin/systemctl": Failed to list unit files: Connection timed out;
| 2020-08-14 05:47:56 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:48:11 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:48:33 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
| 2020-08-14 05:49:44 ERROR juju.worker.raft.rafttransport streamlayer.go:121 streamLayer.Addr timed out waiting for API address
| 2020-08-14 05:49:44 ERROR juju.worker.dependency engine.go:671 "raft-transport" manifold worker returned unexpected error: timed out waiting for API address
| 2020-08-14 05:49:44 ERROR juju.apiserver.pubsub pubsub.go:104 pubsub receive error: read tcp 10.48.128.113:17070->10.48.128.131:52492: use of closed network connection
| 2020-08-14 05:49:44 ERROR juju.apiserver.pubsub pubsub.go:104 pubsub receive error: read tcp 10.48.128.113:17070->10.48.128.127:52750: use of closed network connection
| 2020-08-14 05:49:59 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [39f5e5] "machine-1" cannot open api: can't reset agent password: updating machine "1": read tcp 10.48.128.113:41626->10.48.128.113:37017: i/o timeout
| 2020-08-14 05:50:44 WARNING juju.worker.httpserver worker.go:214 timeout waiting for apiserver shutdown
| debug info written to /var/log/juju/apiserver-debug.log

apiserver-debug.log available via link below:

| https://private-fileshare.canonical.com/~hloeung/tmp/lmO4x2mI.log

Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

This log line is particularly interesting:

2020-08-14 05:47:44 ERROR juju.service.systemd service.go:127 failed to list services for application "juju-db": List failed (/bin/systemctl list-unit-files --no-legend --no-page -l -t service | grep -o -P '^\w[\S]*(?=\.service)'): error executing "/bin/systemctl": Failed to list unit files: Connection timed out;

Is this on a focal machine by the way? If so, can you also check journalctl output for any auth-related errors raised by systemd. I've seen errors like this with nested lxd containers running focal (in those case, the culprit seems to be apparmor).

Revision history for this message
Haw Loeung (hloeung) wrote :

We saw this on the Canonistack BOS01 JAAS controllers, they're running Bionic. No nested lxd containers.

Running the command above works:

| ubuntu@juju-3e49f3-controller-1:~$ /bin/systemctl list-unit-files --no-legend --no-page -l -t service | grep -o -P '^\w[\S]*(?=\.service)' | wc -l
| 224

Could it be related to load at the time? Is that 'Connection timed out' from systemctl itself? Or Juju waiting from output and timing out (Go channels/goroutine or whatever it is)?

Revision history for this message
Ian Booth (wallyworld) wrote :

That "Failed to list unit files: Connection timed out;" would be coming from systemctl itself trying to run the list-unit-files command. Juju uses that command at controller startup to check that mongo etc is installed, as well as when deploying units.

You could guess that it would be related to load on the machine. The fact that the timeout causes the agent to restart would be due to the fact that design of how Juju manages it worker routines is such that it is considered better to restart if there's an error rather than to try and maintain state and recover. There's perhaps an argument to be made that I/O timeout type errors should result in that operation being retried after a back off. However, that would involve being able to cleanly identify the root cause error in the 100s of places where errors can occur and adding code to those 100s of places to do handle the retry. It's more feasible to do the agent restart but backoff at that point to avoid contributing the the load on the machine which is causing the issue.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1891586] Re: 2.8.1 juju machine agent restarting itself

Interestingly, looking at the actual code, it is shelling out to systemd
because of a lack in the API, but the code comment mentions the API that it
would like to have, which was actually implemented back in 2015. I'm pretty
sure we could rely on that being available everywhere that we care
(xenial+), which would let us work around the CLI at least.
I don't know that it makes the request any more robust.

Is the issue that this

a) happened once
b) keeps happening and we are never able to recover

I don't think we've seen a failure talking to Systemd before.

John
=:->

On Tue, Sep 1, 2020 at 7:30 PM Ian Booth <email address hidden> wrote:

> That "Failed to list unit files: Connection timed out;" would be coming
> from systemctl itself trying to run the list-unit-files command. Juju
> uses that command at controller startup to check that mongo etc is
> installed, as well as when deploying units.
>
> You could guess that it would be related to load on the machine. The
> fact that the timeout causes the agent to restart would be due to the
> fact that design of how Juju manages it worker routines is such that it
> is considered better to restart if there's an error rather than to try
> and maintain state and recover. There's perhaps an argument to be made
> that I/O timeout type errors should result in that operation being
> retried after a back off. However, that would involve being able to
> cleanly identify the root cause error in the 100s of places where errors
> can occur and adding code to those 100s of places to do handle the
> retry. It's more feasible to do the agent restart but backoff at that
> point to avoid contributing the the load on the machine which is causing
> the issue.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1891586
>
> Title:
> 2.8.1 juju machine agent restarting itself
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1891586/+subscriptions
>

Tom Haddon (mthaddon)
tags: added: canonical-is-upgrades
Revision history for this message
John A Meinel (jameinel) wrote :

It would be good to understand the severity of this bug, and what it is blocking for you.
Juju should be restarting "gracefully" on a failure such as this. If it is continually failing and not able to progress, then it is certainly something worth investigating, though the fact that "systemctl" is returning a timeout isn't something we can control particularly well.
If the issue is that this sort of failure is preventing juju from cleanly restarting, then that is a different thing (also very much worth investigating).

Changed in juju:
importance: Undecided → High
status: New → Incomplete
John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.8.4
Tom Haddon (mthaddon)
tags: removed: canonical-is-upgrades
Revision history for this message
Joel Sing (jsing) wrote :

I've looked over the logs for these controllers and the only occurrence of this appears to be at the time of upgrade. I suspect that this is load related and likely outside of Juju's control. We'll leave this bug as incomplete and keep a look out for this happening again.

Changed in juju:
milestone: 2.8.4 → 2.8.5
Changed in juju:
milestone: 2.8.5 → 2.8.6
John A Meinel (jameinel)
Changed in juju:
milestone: 2.8.6 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Benjamin Allot (ballot) wrote :

Also had this on 2020-12-27 around 15:52. I've uploaded a juju-crashdump of the model controller on https://private-fileshare.canonical.com/~ballot/juju-crashdump-79cc120e-f1c3-4307-ae7b-8852a895c755.tar.xz

Haw Loeung (hloeung)
Changed in juju:
status: Expired → New
Revision history for this message
Pen Gale (pengale) wrote :

Triaging as medium, as this does appear to be a legitimate bug, and Juju should probably be handling the callout to systemd better. But it is intermittent, and does not put the controller into a blocked state.

Changed in juju:
importance: High → Medium
status: New → Triaged
Revision history for this message
Joseph Phillips (manadart) wrote :

Some time ago I had a WIP to list unit files via the DBus API instead of shelling out.

There was some obstacle to this, which I can't recall off the top of my head, but I think I have the branch on my desktop.

We updated the go-systemd dependency to a reasonably recent version when moving to go mod. I can revisit this and page it back in.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.