Canonical Juju

Juju controller cannot connect to itself

Bug #1807735 reported by Alexander Balderson on 2018-12-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Tim Penhey	Canonical Juju 2.5.1

Bug Description

During bundle deployment, a juju controller became unable to connect to itself:

2018-12-07 14:10:26 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [cccf2a] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2018-12-07 14:10:30 WARNING juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
2018-12-07 14:10:30 WARNING juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 10.244.40.236:37017: connect: connection refused

See attached machine log.

Note - these errors/warnings don't necessarily indicate a problem. Only if they happen after a deploy starts:

01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [1b66e5] "machine-2" cannot open api: unable to
connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
01:01 < jhobbs> debug_log.txt:machine-2: 20:27:46 WARNING juju.mongo mongodb connection failed, will retry: dial tcp 10.244.40.239:37017: connect: connection refused
01:01 < jhobbs> thumper: the controller logs exactly match the bug description; timeout is a secondary result of some other failure and I expect the triager suspected those errors/warnings.
Are those logs red herrings?
01:04 -!- kelvinliu [kelvinliu@58.168.67.220] has quit [The TLS connection was non-properly terminated.]
01:05 < thumper> jhobbs: yeah... those are expected
01:05 < thumper> those happened during enable-ha
01:05 < thumper> when you are adding mongo's to the replicaset, it drops all connections
01:05 < thumper> these would only be a problem if we see them after the deploy has started
01:06 < thumper> mongo can take up to two minutes to settle with a new replicaset

See original description

Tags:

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2018-12-10:

machine-1.log Edit (13.8 KiB, text/plain)

Revision history for this message

Richard Harding (rharding) wrote on 2018-12-10:

Looks like the database wasn't reachable so can we get the syslogs on there so that we can see why mongo wasn't connectable? If mongo isn't up then the controller api won't come up.

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2018-12-10:

juju-crashdump-controller-2018-12-07-14.23.12.tar.gz Edit (2.2 MiB, application/x-tar)

Rick,

I've uploaded all the logs from the test run.

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-12-21:

logs-2018-12-21-14.30.03.tar Edit (46.8 MiB, application/x-tar)

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-12-21:

juju-crashdump-controller-2018-12-21-14.30.49.tar.gz Edit (1.3 MiB, application/x-tar)

Revision history for this message

John George (jog) wrote on 2019-01-09:

juju-crashdump-controller-2019-01-08-22.55.11.tar.gz Edit (2.2 MiB, application/x-tar)

Revision history for this message

John George (jog) wrote on 2019-01-09:

Attached the most recent juju controller logs (juju-crashdump-controller-2019-01-08-22.55.11.tar.gz) for this issue.

Richard Harding (rharding) on 2019-01-14

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.5.1

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2019-01-23:

This bug has been moved to field-critical. It is blocking Solutions QA testing.

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-24:

Do you have the the bundle used for this? We recently worked with field on an issue similar where the size/number of containers on the bundle caused mongodb to thrash itself to death and time out. There were disk performance tweaks needed with bcache to get to a successful state and there's work planned to help throttle connections to help prevent it going forward. I'm curious if this is indeed a related failure in the same vein.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-01-25:

#10

Ok, we will test with the bcache tuning from bug 1809029.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-01-25:

#11

Here is the bundle: http://paste.ubuntu.com/p/rfZFC2CnZm/

We never saw errors like this in 2.4. This is a new problem in 2.5. So, maybe bcache settings need to be tuned but it seems like something in 2.5 really aggravated things.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-01-26:

#12

We are still seeing this even after applying the bcache tuning settings from bug 1809029.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-01-26:

#13

Stay tuned, we may have not had the bcache settings set properly.

Revision history for this message

Tim Penhey (thumper) wrote on 2019-01-27:

#14

This smells like the same shutdown / restart issue we are seeing with prodstack.

Root cause is likely to be an i/o timeout propagating from the mongo session pinging. This causes the state worker to restart, which causes a cascading restart across a bunch of other workers. Some of these other workers are not restarting properly, causing the api-server to not start, which will cause the problems we are observing here.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-01-28:

#15

Even with the bcache settings definitely being set correctly, we're still seeing this.

Richard Harding (rharding) on 2019-01-28

Changed in juju:
importance:	High → Critical
assignee:	nobody → Tim Penhey (thumper)

Edward Hope-Morley (hopem) on 2019-01-28

tags:

added: sts

Tim Penhey (thumper) on 2019-01-29

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Tim Penhey (thumper) wrote on 2019-01-30:

#16

Added some extra retry logic that should stop the issue with bundle deploy.

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2019-02-07:

#17

juju-crashdump-controller-2019-02-06-15.41.35.tar.gz Edit (6.9 MiB, application/x-tar)

Attached logs from a 2.5.1 run

Revision history for this message

Tim Penhey (thumper) wrote on 2019-02-07:

#18

From those controller logs, the apiserver is fine, and the client has disconnected at the following times:

from machine-1

2019-02-06 12:34:48 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.212:17070->91.189.92.9:59728: read: connection reset by peer

from machine-2

2019-02-06 14:19:47 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.211:17070->91.189.92.9:34850: read: connection reset by peer
2019-02-06 15:39:02 ERROR juju.apiserver apiserver.go:807 error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.40.211:17070->91.189.92.9:46692: read: connection reset by peer

This isn't a problem with the apiserver.

Alexander Balderson (asbalderson) on 2019-02-07

tags:	added: cdo-qa
tags:	added: foundations-engine

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-02-08:

#19

Talking with thumper, the errors described in this bug aren't really a problem, unless it's happening after the deploy has started:

description:

updated

Canonical Juju QA Bot (juju-qa-bot) on 2019-02-15

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.