juju agent fails to reconnect to API server

Bug #1810712 reported by Joel Sing
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

The juju 2.4.7 agent (and likely newer) can get itself into a situation where it fails to reconnect to the API server. Usually this happens when the HA controllers are unavailable for a period (e.g. taken offline for repair/recovery).

The agent is currently finishes with:

2018-12-30 05:55:09 ERROR juju.worker.dependency engine.go:632 "migration-inactive-flag" manifold worker returned unexpected error: migration lookup failed: no reachable servers
2018-12-30 05:55:12 ERROR juju.worker.dependency engine.go:632 "migration-minion" manifold worker returned unexpected error: migration lookup: migration lookup failed: no reachable servers
2018-12-30 05:55:45 ERROR juju.worker.dependency engine.go:632 "upgrader" manifold worker returned unexpected error: cannot set agent version: cannot get machine 74: no reachable servers
2018-12-30 05:55:45 ERROR juju.worker.dependency engine.go:632 "migration-inactive-flag" manifold worker returned unexpected error: migration lookup failed: no reachable servers
2018-12-30 05:55:45 ERROR juju.worker.dependency engine.go:632 "migration-minion" manifold worker returned unexpected error: migration lookup: migration lookup failed: no reachable servers
2018-12-30 05:56:16 ERROR juju.worker.dependency engine.go:632 "migration-inactive-flag" manifold worker returned unexpected error: migration lookup failed: no reachable servers
2018-12-30 05:56:25 ERROR juju.worker.dependency engine.go:632 "migration-minion" manifold worker returned unexpected error: migration lookup: migration lookup failed: no reachable servers
2018-12-30 05:56:25 ERROR juju.worker.dependency engine.go:632 "upgrader" manifold worker returned unexpected error: cannot set agent version: cannot get machine 74: no reachable servers
2018-12-30 05:56:50 ERROR juju.worker.dependency engine.go:632 "migration-inactive-flag" manifold worker returned unexpected error: migration lookup failed: no reachable servers
2018-12-30 05:57:27 ERROR juju.worker.dependency engine.go:632 "migration-inactive-flag" manifold worker returned unexpected error: migration lookup failed: no reachable servers
2018-12-30 05:57:29 ERROR juju.worker.dependency engine.go:632 "migration-minion" manifold worker returned unexpected error: migration lookup: migration lookup failed: no reachable servers
2018-12-30 05:57:35 ERROR juju.worker.dependency engine.go:632 "upgrader" manifold worker returned unexpected error: cannot set agent version: cannot get machine 74: no reachable servers

Even though controllers are reachable:

$ nc -vz 10.25.2.111 17070
Connection to 10.25.2.111 17070 port [tcp/*] succeeded!

Restarting the juju-machine-X service via systemctl results in the agent immediately re-establishing a connection.

The juju_goroutines output is attached.

Revision history for this message
Joel Sing (jsing) wrote :
Revision history for this message
Joel Sing (jsing) wrote :

Full logs from juju-machine-74 machine.log are available at https://private-fileshare.canonical.com/~jsing/lp1810712/juju-machine-74.log.gz

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1810712] Re: juju agent fails to reconnect to API server

These particular lines seem suspicious:
2018-12-30 01:09:29 ERROR juju.worker.dependency engine.go:632
"migration-inactive-flag" manifold worker returned unexpected error: txn
watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted

That looks like it got an error, and then tried to stop and restart cleanly
internally (without a process restart), but everything is getting errors
while trying to restart.
It then gets stuck in this loop:
2018-12-30 01:10:36 ERROR juju.worker.dependency engine.go:632 "upgrader"
manifold worker returned unexpected error: cannot set agent version: cannot
get machine 74: no reachable servers
2018-12-30 01:10:39 ERROR juju.worker.dependency engine.go:632
"migration-inactive-flag" manifold worker returned unexpected error:
migration lookup failed: no reachable servers
2018-12-30 01:10:41 ERROR juju.worker.dependency engine.go:632
"migration-minion" manifold worker returned unexpected error: migration
lookup: migration lookup failed: no reachable servers

Note there is also a huge time gap before this:
2018-12-29 17:42:25 ERROR juju.worker.dependency engine.go:632 "api-caller"
manifold worker returned unexpected error: [d3ab9c] "machine-74" cannot
open api: try again (try again)
2018-12-30 01:03:35 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted

Now, it may be that it successfully connected, and because the default
logging level is WARNING, we just don't see the success and happy messages.

AFAICT, it seems the 'api-caller' worker, which is responsible for
maintaining the connection to the controller didn't actually reconnect, and
the errors that the other workers were getting weren't causing it to retry.

Is machine-74 actually a controller and not just a worker node? Because
messages like:
./vendor/gopkg.in/mgo.v2/cluster.go:603:
return nil, errors.New("no reachable servers")

Would indicate that we were failing to connect to Mongo, not failing to
connect to the API server. (I don't see the string "no reachable servers"
elsewhere in the Juju codebase.)

Also, I *think* the txn watcher is only run on controllers, but it might be
possible that over the API we had been exposing the failure.

On Mon, Jan 7, 2019 at 10:45 AM Joel Sing <email address hidden>
wrote:

> Full logs from juju-machine-74 machine.log are available at https
> ://private-fileshare.canonical.com/~jsing/lp1810712/juju-
> machine-74.log.gz
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1810712
>
> Title:
> juju agent fails to reconnect to API server
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1810712/+subscriptions
>

Revision history for this message
Joel Sing (jsing) wrote :

This is a machine in a non-controller model. Given the repeated controller failures were primarily due to mongo becoming desynced, it could make sense that the controller was failing to connect to mongo - if so, this would be fairly easy to reproduce/test (i.e. shutdown mongo on the controllers).

Revision history for this message
Richard Harding (rharding) wrote :

This seems like something we can dig into better messaging and possibly double checking the failure case of a controller getting desync'd and how the agents talking to that controller recover.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.6-beta1
Changed in juju:
milestone: 2.6-beta1 → 2.6-beta2
Changed in juju:
milestone: 2.6-beta2 → 2.6-rc1
Changed in juju:
milestone: 2.6-rc1 → 2.6-rc2
Changed in juju:
milestone: 2.6-rc2 → 2.6.1
Changed in juju:
milestone: 2.6.1 → 2.6.2
Changed in juju:
milestone: 2.6.2 → 2.6.3
Changed in juju:
milestone: 2.6.3 → 2.6.4
Changed in juju:
milestone: 2.6.4 → 2.6.5
Changed in juju:
milestone: 2.6.5 → 2.6.6
Changed in juju:
milestone: 2.6.6 → 2.6.7
Changed in juju:
milestone: 2.6.7 → 2.7-beta1
Changed in juju:
milestone: 2.7-beta1 → 2.7-rc1
Changed in juju:
milestone: 2.7-rc1 → none
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.