httpserver worker restart with controller-api-port gets stuck

Bug #1803484 reported by Tim Penhey on 2018-11-15
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
Critical
Tim Penhey

Bug Description

If the httpserver gets bounced due to an accept error, when the httpserver restarts it opens the controller-api-port and waits for the peergrouper to publish an event in order to open the api-port.

However the peergrouper has't been restarted, and the values haven't changed, so it doesn't publish the event.

The fix is to have the httpserver publish a peer grouper status event which will cause the peer grouper to publish the event.

To confirm the fix, we should add a wrench into the heldlistener Accept call.

Tim Penhey (thumper) wrote :

As a temporary work around, restarting a controller that is in this state will unblock it.

Richard Harding (rharding) wrote :

The controller-api-port is not unsettable in any way I can find. In an effort to have a temp work around I wanted to try to avoid using the secondary port but could not find any method to unset this config value.

Jamon Camisso (jamon) wrote :

Another data point after hitting this issue again today.

Restarting each controller unit in our HA cluster didn't appear to help. Each would intermittently flap and throw errors about being unable to reach itself on port 17071 (the controller-api-port):

2018-11-15 16:14:18 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: cannot open api: unable to connect to API: read tcp 127.0.0.1:57918->127.0.0.1:17071: i/o timeout

2018-11-15 16:14:18 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: cannot open api: unable to connect to API: read tcp 127.0.0.1:57920->127.0.0.1:17071: i/o timeout

2018-11-15 16:14:36 ERROR juju.worker.raft worker.go:358 last leader contact earlier than 1 minute ago

2018-11-15 16:14:36 ERROR juju.worker.dependency engine.go:632 "raft" manifold worker returned unexpected error: timed out waiting for leader contact

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ netstat -tupln |grep 1707
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::17070 :::* LISTEN -
tcp6 0 0 :::17071 :::* LISTEN -

Stopping all the jujud-machine-* agents, firewalling off port 17070, then starting the agents back up and waiting for the logs to show things were quiet before unfirewalling appears to have got things working again.

Tim Penhey (thumper) wrote :

The controller port can be disabled by using

  juju controller-config controller-api-port=0

As there is currently no --reset option for controller-config.

I'm also a bit at a loss as to why a controller would get an i/o timeout connecting to itself. Except if it was while the httpserver was restarting, in which case that is entirely expected.

I would expect to see a few messages during startup about not being able to
connect, and then for that message to stop as it successfully connects. It
usually takes a few 100ms for the server to be up and available and the
client is usually a bit antsy about it.

On Fri, Nov 16, 2018 at 8:50 AM Tim Penhey <email address hidden> wrote:

> https://github.com/juju/juju/pull/9472
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1803484
>
> Title:
> httpserver worker restart with controller-api-port gets stuck
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1803484/+subscriptions
>

Tim Penhey (thumper) on 2018-11-19
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers