Juju controller cannot connect to itself

Bug #1807735 reported by Alexander Balderson
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Tim Penhey

Bug Description

During bundle deployment, a juju controller became unable to connect to itself:

2018-12-07 14:10:26 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [cccf2a] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2018-12-07 14:10:30 WARNING juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
2018-12-07 14:10:30 WARNING juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.40.236:37017: connect: connection refused

See attached machine log.

Note - these errors/warnings don't necessarily indicate a problem. Only if they happen after a deploy starts:

01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [1b66e5] "machine-2" cannot open api: unable to
                connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 10.244.40.239:37017: connect: connection refused
01:01 < jhobbs> thumper: the controller logs exactly match the bug description; timeout is a secondary result of some other failure and I expect the triager suspected those errors/warnings.
                Are those logs red herrings?
01:04 -!- kelvinliu [kelvinliu@58.168.67.220] has quit [The TLS connection was non-properly terminated.]
01:05 < thumper> jhobbs: yeah... those are expected
01:05 < thumper> those happened during enable-ha
01:05 < thumper> when you are adding mongo's to the replicaset, it drops all connections
01:05 < thumper> these would only be a problem if we see them after the deploy has started
01:06 < thumper> mongo can take up to two minutes to settle with a new replicaset

Revision history for this message
Alexander Balderson (asbalderson) wrote :
Revision history for this message
Richard Harding (rharding) wrote :

Looks like the database wasn't reachable so can we get the syslogs on there so that we can see why mongo wasn't connectable? If mongo isn't up then the controller api won't come up.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Rick,

I've uploaded all the logs from the test run.

Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :

Attached the most recent juju controller logs (juju-crashdump-controller-2019-01-08-22.55.11.tar.gz) for this issue.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5.1
Revision history for this message
Alexander Balderson (asbalderson) wrote :

This bug has been moved to field-critical. It is blocking Solutions QA testing.

Revision history for this message
Richard Harding (rharding) wrote :

Do you have the the bundle used for this? We recently worked with field on an issue similar where the size/number of containers on the bundle caused mongodb to thrash itself to death and time out. There were disk performance tweaks needed with bcache to get to a successful state and there's work planned to help throttle connections to help prevent it going forward. I'm curious if this is indeed a related failure in the same vein.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ok, we will test with the bcache tuning from bug 1809029.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here is the bundle: http://paste.ubuntu.com/p/rfZFC2CnZm/

We never saw errors like this in 2.4. This is a new problem in 2.5. So, maybe bcache settings need to be tuned but it seems like something in 2.5 really aggravated things.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We are still seeing this even after applying the bcache tuning settings from bug 1809029.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Stay tuned, we may have not had the bcache settings set properly.

Revision history for this message
Tim Penhey (thumper) wrote :

This smells like the same shutdown / restart issue we are seeing with prodstack.

Root cause is likely to be an i/o timeout propagating from the mongo session pinging. This causes the state worker to restart, which causes a cascading restart across a bunch of other workers. Some of these other workers are not restarting properly, causing the api-server to not start, which will cause the problems we are observing here.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Even with the bcache settings definitely being set correctly, we're still seeing this.

Changed in juju:
importance: High → Critical
assignee: nobody → Tim Penhey (thumper)
tags: added: sts
Tim Penhey (thumper)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Tim Penhey (thumper) wrote :

Added some extra retry logic that should stop the issue with bundle deploy.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Alexander Balderson (asbalderson) wrote :

Attached logs from a 2.5.1 run

Revision history for this message
Tim Penhey (thumper) wrote :

From those controller logs, the apiserver is fine, and the client has disconnected at the following times:

from machine-1

2019-02-06 12:34:48 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.212:17070->91.189.92.9:59728: read: connection reset by peer

from machine-2

2019-02-06 14:19:47 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.211:17070->91.189.92.9:34850: read: connection reset by peer
2019-02-06 15:39:02 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.211:17070->91.189.92.9:46692: read: connection reset by peer

This isn't a problem with the apiserver.

tags: added: cdo-qa
tags: added: foundations-engine
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Talking with thumper, the errors described in this bug aren't really a problem, unless it's happening after the deploy has started:

01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [1b66e5] "machine-2" cannot open api: unable to
                connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 10.244.40.239:37017: connect: connection refused
01:01 < jhobbs> thumper: the controller logs exactly match the bug description; timeout is a secondary result of some other failure and I expect the triager suspected those errors/warnings.
                Are those logs red herrings?
01:04 -!- kelvinliu [kelvinliu@58.168.67.220] has quit [The TLS connection was non-properly terminated.]
01:05 < thumper> jhobbs: yeah... those are expected
01:05 < thumper> those happened during enable-ha
01:05 < thumper> when you are adding mongo's to the replicaset, it drops all connections
01:05 < thumper> these would only be a problem if we see them after the deploy has started
01:06 < thumper> mongo can take up to two minutes to settle with a new replicaset

description: updated
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.