Comment 3 for bug 1810712

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1810712] Re: juju agent fails to reconnect to API server

These particular lines seem suspicious:
2018-12-30 01:09:29 ERROR juju.worker.dependency engine.go:632
"migration-inactive-flag" manifold worker returned unexpected error: txn
watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted
2018-12-30 01:09:29 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted

That looks like it got an error, and then tried to stop and restart cleanly
internally (without a process restart), but everything is getting errors
while trying to restart.
It then gets stuck in this loop:
2018-12-30 01:10:36 ERROR juju.worker.dependency engine.go:632 "upgrader"
manifold worker returned unexpected error: cannot set agent version: cannot
get machine 74: no reachable servers
2018-12-30 01:10:39 ERROR juju.worker.dependency engine.go:632
"migration-inactive-flag" manifold worker returned unexpected error:
migration lookup failed: no reachable servers
2018-12-30 01:10:41 ERROR juju.worker.dependency engine.go:632
"migration-minion" manifold worker returned unexpected error: migration
lookup: migration lookup failed: no reachable servers

Note there is also a huge time gap before this:
2018-12-29 17:42:25 ERROR juju.worker.dependency engine.go:632 "api-caller"
manifold worker returned unexpected error: [d3ab9c] "machine-74" cannot
open api: try again (try again)
2018-12-30 01:03:35 ERROR juju.api.watcher watcher.go:92 error trying to
stop watcher: txn watcher restarted

Now, it may be that it successfully connected, and because the default
logging level is WARNING, we just don't see the success and happy messages.

AFAICT, it seems the 'api-caller' worker, which is responsible for
maintaining the connection to the controller didn't actually reconnect, and
the errors that the other workers were getting weren't causing it to retry.

Is machine-74 actually a controller and not just a worker node? Because
messages like:
./vendor/gopkg.in/mgo.v2/cluster.go:603:
return nil, errors.New("no reachable servers")

Would indicate that we were failing to connect to Mongo, not failing to
connect to the API server. (I don't see the string "no reachable servers"
elsewhere in the Juju codebase.)

Also, I *think* the txn watcher is only run on controllers, but it might be
possible that over the API we had been exposing the failure.

On Mon, Jan 7, 2019 at 10:45 AM Joel Sing <email address hidden>
wrote:

> Full logs from juju-machine-74 machine.log are available at https
> ://private-fileshare.canonical.com/~jsing/lp1810712/juju-
> machine-74.log.gz
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1810712
>
> Title:
> juju agent fails to reconnect to API server
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1810712/+subscriptions
>