api-server and http-server get stuck in "state: stopping"

Bug #1813261 reported by Christian Muirhead
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth

Bug Description

In a large 2.5.0 controller, the agents can get themselves into a state where they are running but the API is unavailable. There are lots of log messages complaining that workers in the controller can't connect to the local API:

2019-01-23 08:23:09 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [bfe052] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38658->127.0.0.1:17071: i/o timeout
2019-01-23 08:23:09 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [f4830b] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38660->127.0.0.1:17071: i/o timeout
2019-01-23 08:23:10 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [129152] "machine-2" cannot open api: unable to connect to API: read tcp 127.0.0.1:38666->127.0.0.1:17071: i/o timeout

Running juju_engine_report on the controller shows both http-server and api-server have "state: stopping", which doesn't change.

Before the system goes into this bad state we can see this error in the log:
2019-01-23 07:46:17 ERROR juju.worker.dependency engine.go:632 "state" manifold worker returned unexpected error: state ping failed: read tcp 127.0.0.1:59338->127.0.0.1:37017: i/o timeout

This error is caused by mongo load, particularly at startup. It causes the state worker to restart, which in turn causes other workers including http-server and api-server. They get stuck at stopping for reasons we haven't determined yet.

At the moment the mitigation for this is:
* firewall the controllers off (by blocking 17070 from non-controller machines)
* restarting the controllers
* once the connections and goroutines have stabilised for the controllers, then we can open the ports for the controller machines that aren't the mongo primary (in the hope of not causing mongo timeouts to trigger the problem).

Revision history for this message
Christian Muirhead (2-xtian) wrote :

We've also seen situations where only the http-server was stuck in stopping. That seems to be caused by the raft-transport bouncing - the api-server doesn't depend on it, so it wasn't restarting, which meant that the http-server would mux.Wait indefinitely.

This PR fixes that problem (but not the api-server issue yet).
https://github.com/juju/juju/pull/9675

Revision history for this message
Uros Jovanovic (uros-jovanovic) wrote :

Same error here: ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [a7f617] "machine-2" cannot open api: unable to connect to API: dial tcp 127.0.0.1:443: connect: connection refused

First it was just the mongodb connection failure, but restarting jujud eneded up with the error message above.

Note that this controller was mostly idle, not much traffic at all.

Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

Hard to track down the cause - all workers which use the http server mux which could block exit are supposed to close when the apiserver dying channel is closed. Perhaps some workers are blocked on reads.

We'll land a PR to exit the http server after 1 minute of waitong with the hope that closing the sockets will unblock any readers and they will also exit. In such a case, we dump the goroutines for further analysis.

https://github.com/juju/juju/pull/9684

Revision history for this message
Ian Booth (wallyworld) wrote :

I'll mark as fix committed pending an observed re-occurrence now that a stuck http-server work is terminated.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Christian Muirhead (2-xtian) wrote :

It seems like this could be caused by this bug: https://bugs.launchpad.net/juju/+bug/1815397

If so it is fixed in: https://github.com/juju/juju/pull/9730

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.