Bug #1792299 “[2.4.3] API server unresponsive in HA env” : Bugs : Canonical Juju

Revision history for this message

Paul Gear (paulgear) wrote on 2018-09-13:

#1

https://private-fileshare.canonical.com/~paulgear/lp1792299/ has controller logs which hopefully cover the period of non-functionality and initial startup after the rolling restart, along with an engine report for each controller after rolling restart.

tags:

added: canonical-is

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-09-13:

#2

in machine 1 log, this is kind of interesting and different from the rest of it..

2018-09-13 01:24:02 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-09-13 01:24:02 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: context deadline exceeded
2018-09-13 01:24:02 WARNING juju.worker.httpserver log.go:172 http: TLS handshake error from 127.0.0.1:45226: EOF

2018-09-13 01:24:03 WARNING juju.worker.httpserver log.go:172 http: TLS handshake error from 127.0.0.1:45220: EOF

Revision history for this message

Richard Harding (rharding) wrote on 2018-09-27:

#3

With the "too many open files" and the EOF error were the systems out of file descriptors or the like? It seems very similar to when I've run out of kernel resources due to too many lxd containers to the like.

Changed in juju:
status:	New → Triaged
status:	Triaged → Incomplete

Revision history for this message

Paul Gear (paulgear) wrote on 2018-09-28:

#4

There were no other apparent issues with kernel resources experienced when investigating this issue. MongoDB appeared to be operating normally. As far as @thumper and I were able to tell, this was purely a jujud problem.

Changed in juju:
status:	Incomplete → Triaged

Revision history for this message

Junien F (axino) wrote on 2018-10-04:

#5

This happened again today. 2 controllers out of 3 were constantly logging :

ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)

Please add some debug information to the error above so we can root cause the problem, thanks !

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-04:

#6

Proxy updater should not panic. I'll have a look at it today.

I am pretty sure that this may not be the cause of the API connectivity issue but it looks like a starting point to un-wind a thread. Meanwhile, @Junien Fridrick (axino), are there any logs for today's incident you are describing?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-05:

#7

Proxy-updater panic is coming from the dead tomb according to the https://github.com/go-tomb/tomb/blob/v2/tomb.go#L154 [1]

On our side, the line causes the panic is https://github.com/juju/juju/blob/2.4/apiserver/common/watch.go#L112 which the call to the tomb.

What I am not sure about is what caused the tomb to die here and why it happens sometimes but not always. There is some context missing and logs are necessary. If something was done on the environment by the operators, especially if it's related to proxy updates, it'll be useful to know too.

One of the speculations is that for some reason the worker may have restarted while the reference to the watcher was still being used?.. The logs would have more info for this mystery.

[1]
Note that this is the current implementation of the tomb. Hence, the difference in the line count. The error message and the switch statement is the same as the version of tomb dependency that is deployed/ is running with Juju 2.4.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-05:

#8

On closer examination, https://github.com/juju/juju/blob/develop/worker/proxyupdater/proxyupdater.go#L62, does not have a "gopkg.in/juju/worker.v1/catacomb.Catacomb" which is how we usually tie lifecycles of internal watchers that workers use to that of a lifecycle of the worker itself. In other words, when the worker is stopped, the watchers are stopped too.

Revision history for this message

Junien F (axino) wrote on 2018-10-05:

#9

The 2 impacted controllers were just logging the same line (that I pasted) over and over again.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-10-08:

#10

After brainstorming the observation and the code, Ian and Tim came to the conclusion that this is a fallout from a subtle change in behavior between tomb.v1 and tomb.v2, so all code that calls tomb.Go within a loop needs to be reconsidered.

The fix in this case will be to change the call at https://github.com/juju/juju/blob/2.4/apiserver/common/watch.go#L112 to not go through the tomb but be a simple Go routine.

Other watchers maybe affected.

Changed in juju:
milestone:	none → 2.5-beta1
importance:	Undecided → High

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-08:

#11

What we are proposing is a partial fix, at least dealing with the proxy updater.

https://github.com/juju/juju/pull/9285

Still working on a general fix for things coming up.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-09:

#12

Conversation around this started on https://discourse.jujucharms.com/t/stable-controller-startup-under-heavy-agent-load/296

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-10: Re: [Bug 1792299] Re: [2.4.3] API server unresponsive in HA env

#13

404 Page Not Found seems very suspicious. It looks a lot like there is some
other HTTP server giving responses rather than the controller responding.
Most of our responses are not 404 (we generally give a permission denied
when looking for something that doesn't seem to be there).

Tim, do you think its actually because of something else going on in the
backend that causes the HTTP layer to start just giving 404?

On Tue, Oct 9, 2018 at 10:01 AM Anastasia <email address hidden>
wrote:

> ** Changed in: juju/2.4
> Assignee: Ian Booth (wallyworld) => (unassigned)
>
> ** Changed in: juju/2.4
> Assignee: (unassigned) => Tim Penhey (thumper)
>
> ** Changed in: juju/2.4
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1792299
>
> Title:
> [2.4.3] API server unresponsive in HA env
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1792299/+subscriptions
>

Revision history for this message

Junien F (axino) wrote on 2018-10-11:

#14

Another datapoint (in JAAS) : https://pastebin.canonical.com/p/HqygwDk4jy/

Revision history for this message

Junien F (axino) wrote on 2018-10-11:

#15

Download full text (5.0 KiB)

machine log full of 404 again today :

2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found ...

machine log full of 404 again today :

2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: i/o timeout
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: i/o timeout
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: i/o timeout
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: i/o timeout
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-11:

#16

This is very weird, and we are still looking into it :)

Revision history for this message

Tim Penhey (thumper) wrote on 2018-10-11:

#17

Actually I think I have found the source of the 404.

The httpserver worker starts up and opens the ports, creates the http server and starts listening for requests. This is all before the apiserver starts up and registers all the paths. Since the api-caller worker is trying to open a websocket to the api it is trying to hit an URL that hasn't been registered yet.

I believe the correct fix for this is to not start the http.Serve command until the apiserver has registered itself fully.

I'll add that to the work I'm doing.

Ian Booth (wallyworld) on 2018-10-16

Changed in juju:
assignee:	nobody → Christian Muirhead (2-xtian)

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2018-10-19:

#18

PR to fix the 404s from the API (2.4 branch) - https://github.com/juju/juju/pull/9335

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2018-10-19:

#19

PR for the same change for 2.5 - https://github.com/juju/juju/pull/9346

Richard Harding (rharding) on 2018-11-09

Changed in juju:
milestone:	2.5-beta1 → 2.5-beta2

Ian Booth (wallyworld) on 2018-11-14

Changed in juju:
status:	Triaged → Fix Committed
milestone:	2.5-beta2 → 2.5-beta1

Revision history for this message

Paul Gear (paulgear) wrote on 2019-02-19:

#20

Did this fix make it into 2.5.[01]?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-02-19:

#21

Yes, the fix was committed to 2.5-b1 in November and, hence, would be in all Juju versions from then on, 2.5.0 and 2.5.1 included :D

Anastasia (anastasia-macmood) on 2019-03-22

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-05-15:

#23

From Junien, on 16-May-2019:

This is still a problem in 2.5.4, re-opening the bug.

I'm fairly sure the 404 happens at the websocket establishment phase.
The following should return HTTP code 101 :

$ bash -x /tmp/curl
+ curl -v -k --no-buffer --header 'Connection: Upgrade' --header 'Upgrade: websocket' --header 'Host: example.com:80' --header 'Origin: http://example.com:80' --header 'Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==' --header 'Sec-WebSocket-Version: 13' https://localhost:17070/api
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 17070 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 596 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* common name: * (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: O=juju,CN=*
* start date: Wed, 08 May 2019 12:08:27 GMT
* expire date: Tue, 15 May 2029 12:08:27 GMT
* issuer: O=juju,CN=juju-generated CA for model \"juju-ca\",serialNumber=0118393d-aafc-404c-870c-506361d81a1a
* compression: NULL
* ALPN, server did not agree to a protocol
> GET /api HTTP/1.1
> Host: example.com:80
> User-Agent: curl/7.47.0
> Accept: */*
> Connection: Upgrade
> Upgrade: websocket
> Origin: http://example.com:80
> Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==
> Sec-WebSocket-Version: 13
>
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Wed, 15 May 2019 14:21:19 GMT
< Content-Length: 19
<
404 page not found
* Connection #0 to host localhost left intact

Anastasia (anastasia-macmood) on 2019-05-15

Changed in juju:
milestone:	2.5-beta1 → none
status:	Fix Released → Confirmed

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-16:

#24

You're getting the 404 because the controller is trying to restart the API server, but is failing to do so. Which is a genuine case of "the server isn't happy", which we should investigate. But I don't think its this bug.

Changed in juju:
status:	Confirmed → Fix Released

Canonical Juju

[2.4.3] API server unresponsive in HA env

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Christian Muirhead
2.4	Fix Released	High	Christian Muirhead	Canonical Juju 2.4.5
2.5	Fix Released	High	Christian Muirhead	Canonical Juju 2.5-beta1