[2.4.3] API server unresponsive in HA env

Bug #1792299 reported by Paul Gear
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead
2.4
Fix Released
High
Christian Muirhead
2.5
Fix Released
High
Christian Muirhead

Bug Description

Today when checking connectivity for adding a new machine, I discovered one of our main production controller machines was refusing connections, and it and another were reporting as down in juju status. @thumper assisted and we found that:

- API servers on both machines were not listening on :17070
- juju_engine_report was unresponsive; strace showed it waiting on a mutex.
- juju_presence_report showed only one machine with active connections in the controller model.
- there were panics in the machine log of the remaining functional controller: https://pastebin.canonical.com/p/JZK5ssptt4/ https://pastebin.canonical.com/p/3XHZ2J8tcw/ https://pastebin.canonical.com/p/C5DXZ4gKph/

I performed a rolling restart; during initial reconnect of agents, a large number of 'too many open files' messages were seen on at least 2 of the controller machines. This quieted down after all the remote agents reconnected and the load settled.

Tags: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

https://private-fileshare.canonical.com/~paulgear/lp1792299/ has controller logs which hopefully cover the period of non-functionality and initial startup after the rolling restart, along with an engine report for each controller after rolling restart.

tags: added: canonical-is
Revision history for this message
Anastasia (anastasia-macmood) wrote :

in machine 1 log, this is kind of interesting and different from the rest of it..

2018-09-13 01:24:02 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-09-13 01:24:02 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: dial tcp 127.0.0.1:17070: context deadline exceeded
2018-09-13 01:24:02 WARNING juju.worker.httpserver log.go:172 http: TLS handshake error from 127.0.0.1:45226: EOF

2018-09-13 01:24:03 WARNING juju.worker.httpserver log.go:172 http: TLS handshake error from 127.0.0.1:45220: EOF

Revision history for this message
Richard Harding (rharding) wrote :

With the "too many open files" and the EOF error were the systems out of file descriptors or the like? It seems very similar to when I've run out of kernel resources due to too many lxd containers to the like.

Changed in juju:
status: New → Triaged
status: Triaged → Incomplete
Revision history for this message
Paul Gear (paulgear) wrote :

There were no other apparent issues with kernel resources experienced when investigating this issue. MongoDB appeared to be operating normally. As far as @thumper and I were able to tell, this was purely a jujud problem.

Changed in juju:
status: Incomplete → Triaged
Revision history for this message
Junien F (axino) wrote :

This happened again today. 2 controllers out of 3 were constantly logging :

ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)

Please add some debug information to the error above so we can root cause the problem, thanks !

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Proxy updater should not panic. I'll have a look at it today.

I am pretty sure that this may not be the cause of the API connectivity issue but it looks like a starting point to un-wind a thread. Meanwhile, @Junien Fridrick (axino), are there any logs for today's incident you are describing?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Proxy-updater panic is coming from the dead tomb according to the https://github.com/go-tomb/tomb/blob/v2/tomb.go#L154 [1]

On our side, the line causes the panic is https://github.com/juju/juju/blob/2.4/apiserver/common/watch.go#L112 which the call to the tomb.

What I am not sure about is what caused the tomb to die here and why it happens sometimes but not always. There is some context missing and logs are necessary. If something was done on the environment by the operators, especially if it's related to proxy updates, it'll be useful to know too.

One of the speculations is that for some reason the worker may have restarted while the reference to the watcher was still being used?.. The logs would have more info for this mystery.

[1]
Note that this is the current implementation of the tomb. Hence, the difference in the line count. The error message and the switch statement is the same as the version of tomb dependency that is deployed/ is running with Juju 2.4.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

On closer examination, https://github.com/juju/juju/blob/develop/worker/proxyupdater/proxyupdater.go#L62, does not have a "gopkg.in/juju/worker.v1/catacomb.Catacomb" which is how we usually tie lifecycles of internal watchers that workers use to that of a lifecycle of the worker itself. In other words, when the worker is stopped, the watchers are stopped too.

Revision history for this message
Junien F (axino) wrote :

The 2 impacted controllers were just logging the same line (that I pasted) over and over again.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

After brainstorming the observation and the code, Ian and Tim came to the conclusion that this is a fallout from a subtle change in behavior between tomb.v1 and tomb.v2, so all code that calls tomb.Go within a loop needs to be reconsidered.

The fix in this case will be to change the call at https://github.com/juju/juju/blob/2.4/apiserver/common/watch.go#L112 to not go through the tomb but be a simple Go routine.

Other watchers maybe affected.

Changed in juju:
milestone: none → 2.5-beta1
importance: Undecided → High
Revision history for this message
Tim Penhey (thumper) wrote :

What we are proposing is a partial fix, at least dealing with the proxy updater.

https://github.com/juju/juju/pull/9285

Still working on a general fix for things coming up.

Revision history for this message
Tim Penhey (thumper) wrote :
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1792299] Re: [2.4.3] API server unresponsive in HA env

404 Page Not Found seems very suspicious. It looks a lot like there is some
other HTTP server giving responses rather than the controller responding.
Most of our responses are not 404 (we generally give a permission denied
when looking for something that doesn't seem to be there).

Tim, do you think its actually because of something else going on in the
backend that causes the HTTP layer to start just giving 404?

On Tue, Oct 9, 2018 at 10:01 AM Anastasia <email address hidden>
wrote:

> ** Changed in: juju/2.4
> Assignee: Ian Booth (wallyworld) => (unassigned)
>
> ** Changed in: juju/2.4
> Assignee: (unassigned) => Tim Penhey (thumper)
>
> ** Changed in: juju/2.4
> Status: Triaged => In Progress
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1792299
>
> Title:
> [2.4.3] API server unresponsive in HA env
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1792299/+subscriptions
>

Revision history for this message
Junien F (axino) wrote :

Another datapoint (in JAAS) : https://pastebin.canonical.com/p/HqygwDk4jy/

Revision history for this message
Junien F (axino) wrote :
Download full text (5.0 KiB)

machine log full of 404 again today :

2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found (Not Found)
2018-10-11 12:32:00 ERROR juju.worker.dependency engine.go:587 "api-caller" manifold worker returned unexpected error: cannot open api: 404 page not found ...

Read more...

Revision history for this message
Tim Penhey (thumper) wrote :

This is very weird, and we are still looking into it :)

Revision history for this message
Tim Penhey (thumper) wrote :

Actually I think I have found the source of the 404.

The httpserver worker starts up and opens the ports, creates the http server and starts listening for requests. This is all before the apiserver starts up and registers all the paths. Since the api-caller worker is trying to open a websocket to the api it is trying to hit an URL that hasn't been registered yet.

I believe the correct fix for this is to not start the http.Serve command until the apiserver has registered itself fully.

I'll add that to the work I'm doing.

Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Christian Muirhead (2-xtian)
Revision history for this message
Christian Muirhead (2-xtian) wrote :

PR to fix the 404s from the API (2.4 branch) - https://github.com/juju/juju/pull/9335

Revision history for this message
Christian Muirhead (2-xtian) wrote :

PR for the same change for 2.5 - https://github.com/juju/juju/pull/9346

Changed in juju:
milestone: 2.5-beta1 → 2.5-beta2
Ian Booth (wallyworld)
Changed in juju:
status: Triaged → Fix Committed
milestone: 2.5-beta2 → 2.5-beta1
Revision history for this message
Paul Gear (paulgear) wrote :

Did this fix make it into 2.5.[01]?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Yes, the fix was committed to 2.5-b1 in November and, hence, would be in all Juju versions from then on, 2.5.0 and 2.5.1 included :D

Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Anastasia (anastasia-macmood) wrote :

From Junien, on 16-May-2019:

This is still a problem in 2.5.4, re-opening the bug.

I'm fairly sure the 404 happens at the websocket establishment phase.
The following should return HTTP code 101 :

$ bash -x /tmp/curl
+ curl -v -k --no-buffer --header 'Connection: Upgrade' --header 'Upgrade: websocket' --header 'Host: example.com:80' --header 'Origin: http://example.com:80' --header 'Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==' --header 'Sec-WebSocket-Version: 13' https://localhost:17070/api
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 17070 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 596 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* common name: * (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: O=juju,CN=*
* start date: Wed, 08 May 2019 12:08:27 GMT
* expire date: Tue, 15 May 2029 12:08:27 GMT
* issuer: O=juju,CN=juju-generated CA for model \"juju-ca\",serialNumber=0118393d-aafc-404c-870c-506361d81a1a
* compression: NULL
* ALPN, server did not agree to a protocol
> GET /api HTTP/1.1
> Host: example.com:80
> User-Agent: curl/7.47.0
> Accept: */*
> Connection: Upgrade
> Upgrade: websocket
> Origin: http://example.com:80
> Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==
> Sec-WebSocket-Version: 13
>
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Wed, 15 May 2019 14:21:19 GMT
< Content-Length: 19
<
404 page not found
* Connection #0 to host localhost left intact

Changed in juju:
milestone: 2.5-beta1 → none
status: Fix Released → Confirmed
Revision history for this message
John A Meinel (jameinel) wrote :

You're getting the 404 because the controller is trying to restart the API server, but is failing to do so. Which is a genuine case of "the server isn't happy", which we should investigate. But I don't think its this bug.

Changed in juju:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.