Canonical Juju

httpserver worker restart with controller-api-port gets stuck

Bug #1803484 reported by Tim Penhey on 2018-11-15

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Tim Penhey	Canonical Juju 2.4.7

Bug Description

If the httpserver gets bounced due to an accept error, when the httpserver restarts it opens the controller-api-port and waits for the peergrouper to publish an event in order to open the api-port.

However the peergrouper has't been restarted, and the values haven't changed, so it doesn't publish the event.

The fix is to have the httpserver publish a peer grouper status event which will cause the peer grouper to publish the event.

To confirm the fix, we should add a wrench into the heldlistener Accept call.

Tags:

Revision history for this message

Tim Penhey (thumper) wrote on 2018-11-15:

As a temporary work around, restarting a controller that is in this state will unblock it.

Revision history for this message

Richard Harding (rharding) wrote on 2018-11-15:

The controller-api-port is not unsettable in any way I can find. In an effort to have a temp work around I wanted to try to avoid using the secondary port but could not find any method to unset this config value.

Revision history for this message

Jamon Camisso (jamon) wrote on 2018-11-15:

Another data point after hitting this issue again today.

Restarting each controller unit in our HA cluster didn't appear to help. Each would intermittently flap and throw errors about being unable to reach itself on port 17071 (the controller-api-port):

2018-11-15 16:14:18 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: cannot open api: unable to connect to API: read tcp 127.0.0.1:57918->127.0.0.1:17071: i/o timeout

2018-11-15 16:14:36 ERROR juju.worker.raft worker.go:358 last leader contact earlier than 1 minute ago

2018-11-15 16:14:36 ERROR juju.worker.dependency engine.go:632 "raft" manifold worker returned unexpected error: timed out waiting for leader contact

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1:~$ netstat -tupln |grep 1707
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::17070 :::* LISTEN -
tcp6 0 0 :::17071 :::* LISTEN -

Stopping all the jujud-machine-* agents, firewalling off port 17070, then starting the agents back up and waiting for the logs to show things were quiet before unfirewalling appears to have got things working again.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-11-15:

The controller port can be disabled by using

juju controller-config controller-api-port=0

As there is currently no --reset option for controller-config.

I'm also a bit at a loss as to why a controller would get an i/o timeout connecting to itself. Except if it was while the httpserver was restarting, in which case that is entirely expected.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-11-16:

https://github.com/juju/juju/pull/9472

Revision history for this message

John A Meinel (jameinel) wrote on 2018-11-16: Re: [Bug 1803484] Re: httpserver worker restart with controller-api-port gets stuck

I would expect to see a few messages during startup about not being able to
connect, and then for that message to stop as it successfully connects. It
usually takes a few 100ms for the server to be up and available and the
client is usually a bit antsy about it.

On Fri, Nov 16, 2018 at 8:50 AM Tim Penhey <email address hidden> wrote:

> https://github.com/juju/juju/pull/9472
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1803484
>
> Title:
> httpserver worker restart with controller-api-port gets stuck
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1803484/+subscriptions
>

Tim Penhey (thumper) on 2018-11-19

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2018-11-29

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.