Bug #1720251 “new controller machines stay in “adding-vote” fore...” : Bugs : Canonical Juju

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-28:

#1

juju-crashdump-f0493590-dba8-45d1-8611-204436dc0974.tar.xz Edit (1.5 MiB, application/x-tar)

Anastasia (anastasia-macmood) on 2017-10-03

tags:

added: intermittent-failure

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-10-03:

#2

Related to mongo on these additional controller machines, I am seeing this in 1/baremetal/log/syslog (for example, machine 1):

Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] MongoDB starting : pid=11659 port=37017 dbpath=/var/lib/juju/db 64-bit host=juju-2
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] db version v3.2.12
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] allocator: tcmalloc
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] modules: none
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] build environment:
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] distarch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] target_arch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] options: { net: { ipv6: true, port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "syslog", quiet: true } }
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 20000 processes, 65000 files. Number of processes should be at least 32500 : 0.5 times number of files.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local voted for document at startup.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument: Did not find replica set configuration document in local.system.replset
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/juju/db/diagnostic.data'
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] waiting for connections on port 37017 ssl
Sep 28 20:46:11 juju-2 mongod.37017[11659]: [conn1] note: no users configured in admin.system.users, allowing localhost access

Related to mongo on these additional controller machines, I am seeing this in 1/baremetal/log/syslog (for example, machine 1):

Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] MongoDB starting : pid=11659 port=37017 dbpath=/var/lib/juju/db 64-bit host=juju-2
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] db version v3.2.12
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] allocator: tcmalloc
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] modules: none
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] build environment:
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten]     distarch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten]     target_arch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] options: { net: { ipv6: true, port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "syslog", quiet: true } }
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 20000 processes, 65000 files. Number of processes should be at least 32500 : 0.5 times number of files.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local voted for document at startup.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/juju/db/diagnostic.data'
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] waiting for connections on port 37017 ssl
Sep 28 20:46:11 juju-2 mongod.37017[11659]: [conn1] note: no users configured in admin.system.users, allowing localhost access

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-10-03:

#3

Apart from that, I do not see anything obvious... Juju reports that mngo-db has started (albeit, we cannot connect) as per machine-1.log:

2017-09-28 20:45:56 DEBUG juju.cmd.jujud machine.go:1649 mongodb service is installed
2017-09-28 20:46:09 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1
2017-09-28 20:46:09 DEBUG juju.state open.go:130 dialing mongo
2017-09-28 20:46:09 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused

Revision history for this message

John A Meinel (jameinel) wrote on 2017-10-04:

#4

connection refused to 37017 would indicate that mongo isn't actually up and running.
It would be good to know if there is a service configured for mongo. Something like:
systemctl status juju-db

might be helpful, in case there is some reason why we are failing to start Mongo.

Some of the other comments seem to hint that maybe we failed to initialize Mongo with appropriate admin user. (IIRC, we install mongo, start it in local-only mode without any users, and then configure some users, and restart it into replicaset mode, etc.

I suppose it is possible that something went wrong and left us in a state that we didn't realize we needed to reinitialize.

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-09:

#5

Mongo was up and running on the machines - I could telnet into 127.0.0.1:37017

José Pekkarinen (koalinux) on 2017-11-24

tags:

added: cpe-onsite

Revision history for this message

José Pekkarinen (koalinux) wrote on 2017-11-24:

#6

Download full text (5.0 KiB)

Hi,

We hit this with the slight difference of redundant controllers never getting
operative. Juju version is 2.2.6 as stated in the output:

$ juju controllers
Use --refresh flag with this command to see the latest information.

Controller Model User Access Cloud/Region Models Machines HA Version
my_controller* default admin superuser my_region 2 142 1/3 2.2.6

$ juju status -m controller
Model Controller Cloud/Region Version SLA
controller my_controller my_region 2.2.6 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started X.X.X.18 8dghdf xenial FC1 Deployed
42 started X.X.X.62 tyqpxr xenial FC2 Deployed
43 started X.X.X.29 d4dx38 xenial FC3 Deployed

$ juju ssh -m controller 42

Lessing the machine log I see:

...
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:504 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 ERROR juju.worker.dependency engine.go:551 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:553 stack trace:
no reachable servers
github.com/juju/juju/state/open.go:527: cannot connect to mongodb
github.com/juju/juju/state/open.go:103:
github.com/juju/juju/worker/state/manifold.go:64:
2017-11-24 12:09:31 DEBUG juju.cmd.jujud machine.go:1632 mongo is already initialized
2017-11-24 12:09:31 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-42
2017-11-24 12:09:31 DEBUG juju.state open.go:130 dialing mongo
2017-11-24 12:09:31 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:32 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:33 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:34 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:35 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:36 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
...

$ sudo lsof -i | grep LISTEN
sshd 3657 root 3u IPv4 21999 0t0 TCP *:ssh (LISTEN)
mongod 15856 root 6u IPv4 22147 0t0 TCP *:37017 (LISTEN)

On primary node mongo I see the following:

juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-11-24T12:12:22.569Z"),
        "myState" : 1,
        "term" : NumberLong(42),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 2767413,
                        "optime" : {
                                "ts" : Timestamp(151...

Hi,

We hit this with the slight difference of redundant controllers never getting
operative. Juju version is 2.2.6 as stated in the output:

$ juju controllers
Use --refresh flag with this command to see the latest information.

Controller  Model    User   Access     Cloud/Region  Models  Machines   HA  Version
my_controller*      default  admin  superuser  my_region              2       142  1/3  2.2.6

$ juju status -m controller
Model       Controller  Cloud/Region  Version  SLA
controller  my_controller       my_region         2.2.6    unsupported

App  Version  Status  Scale  Charm  Store  Rev  OS  Notes

Unit  Workload  Agent  Machine  Public address  Ports  Message

Machine  State    DNS             Inst id  Series  AZ   Message
0        started  X.X.X.18    8dghdf   xenial  FC1  Deployed
42       started  X.X.X.62   tyqpxr   xenial  FC2  Deployed
43       started  X.X.X.29  d4dx38   xenial  FC3  Deployed

$ juju ssh -m controller 42

Lessing the machine log I see:

...
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:504 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 ERROR juju.worker.dependency engine.go:551 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:553 stack trace:
no reachable servers
github.com/juju/juju/state/open.go:527: cannot connect to mongodb
github.com/juju/juju/state/open.go:103: 
github.com/juju/juju/worker/state/manifold.go:64: 
2017-11-24 12:09:31 DEBUG juju.cmd.jujud machine.go:1632 mongo is already initialized
2017-11-24 12:09:31 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-42
2017-11-24 12:09:31 DEBUG juju.state open.go:130 dialing mongo
2017-11-24 12:09:31 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:32 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:33 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:34 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:35 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:36 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
...

$ sudo lsof -i | grep LISTEN
sshd     3657   root    3u  IPv4   21999      0t0  TCP *:ssh (LISTEN)
mongod  15856   root    6u  IPv4   22147      0t0  TCP *:37017 (LISTEN)

On primary node mongo I see the following:

juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-11-24T12:12:22.569Z"),
        "myState" : 1,
        "term" : NumberLong(42),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 2767413,
                        "optime" : {
                                "ts" : Timestamp(1511525542, 279),
                                "t" : NumberLong(42)
                        },
                        "optimeDate" : ISODate("2017-11-24T12:12:22Z"),
                        "electionTime" : Timestamp(1508758130, 1),
                        "electionDate" : ISODate("2017-10-23T11:28:50Z"),
                        "configVersion" : 94205,
                        "self" : true
                }
        ],
        "ok" : 1
}

juju:PRIMARY> rs.config()
{
        "_id" : "juju",
        "version" : 94205,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 1,
                        "host" : "127.0.0.1:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
                                "juju-machine-id" : "0"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {

},
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("58f2d2d4824342d527fee1b2")
        }
}

juju:PRIMARY> db.machines.find({ _id: "27633ebe-aa9d-47f1-8465-f03fcd64a673:42" })
... "novote" : false, "hasvote" : false...
juju:PRIMARY> db.machines.find({ _id: "27633ebe-aa9d-47f1-8465-f03fcd64a673:43" })
... "novote" : false, "hasvote" : false...

So no extra nodes added to the replica set, no vote set for the new controllers.

José.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-24:

#7

So the bug that we're seeing with Jose's model is that it keeps trying to use 127.0.0.1 as the IP address for each of the machines.
What is weird is that we have a valid "mongo-space-name": "oam" and "mongo-space-status": "valid".
So Juju should be trying to select the IP address from that space to start with.
I know that Andrew has done an explicit path to refuse 127.0.0.1 and I believe ::1 as possible addresses to use in the rs.config, which should help here, but we still need to understand why Juju would have thought 127.0.0.1 would be the best addresses to use.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-07:

#8

systemctl status dump http://paste.ubuntu.com/26133356/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-07:

#9

juju-crashdump-1cde6f02-b896-4a51-be5d-23ac617cacde.tar.gz Edit (22.4 MiB, application/x-tar)

crashdump of controller model from a juju2.3-rc2 failure.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-07:

#10

After adding 'enable-ha' back to our test setup yesterday, we're hitting this issue more often than not with both juju 2.3~rc2 and juju 2.2.6.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-07:

#11

This bug has entered the 'Field High' SLA process.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-12-08:

#12

@Jason, we'll have someone focus on this, probably starting next week (since it is EOW)

Changed in juju:
milestone:	none → 2.3.1
importance:	Medium → High

Canonical Juju QA Bot (juju-qa-bot) on 2017-12-08

Changed in juju:
milestone:	2.3.1 → none

Jason Hobbs (jason-hobbs) on 2017-12-08

summary:	- new controller machines stay in "adding-vote" for over an hour + new controller machines stay in "adding-vote" forever
description:	updated

Tim Penhey (thumper) on 2017-12-11

Changed in juju:
milestone:	none → 2.3.2

Revision history for this message

Chris Gregan (cgregan) wrote on 2017-12-11:

#13

Bumped to field critical

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-12: Re: [Bug 1720251] Re: new controller machines stay in "adding-vote" forever

#14

Getting information about what rs.config says inside the Mongo controller,
and what log output is happening for the peergrouper are both relevant for
this.

Most likely this is a bug we've seen where the peer grouper decides that
the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
which is critically the wrong one.
A fix has already gone into 2.3 so that localhost is never selected. It is
possible that we would backport that fix.

On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <email address hidden>
wrote:

> Bumped to field critical
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1720251
>
> Title:
> new controller machines stay in "adding-vote" forever
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-12:

#16

This is failing on 2.3, so I don't think that fix worked.

On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
wrote:

> Getting information about what rs.config says inside the Mongo controller,
> and what log output is happening for the peergrouper are both relevant for
> this.
>
> Most likely this is a bug we've seen where the peer grouper decides that
> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
> which is critically the wrong one.
> A fix has already gone into 2.3 so that localhost is never selected. It is
> possible that we would backport that fix.
>
>
> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <<email address hidden>
> >
> wrote:
>
> > Bumped to field critical
> >
> > --
> > You received this bug notification because you are a member of Canonical
> > Field Critical, which is subscribed to the bug report.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1720251
> >
> > Title:
> > new controller machines stay in "adding-vote" forever
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1720251
>
> Title:
> new controller machines stay in "adding-vote" forever
>
> Status in juju:
> Triaged
>
> Bug description:
> juju version: 2.2.4
> maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>
> After running enable-ha, my two new controller machines have been in
> 'adding-vote' state for over an hour. Their machine logs have this:
>
> 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
> manifold worker returned unexpected error: cannot connect to mongodb:
> no reachable servers
>
> Here is the status:
> http://paste.ubuntu.com/25635827/
>
> I've attached a crashdump from the deployment.
>
> This doesn't happen everytime - we've had quite a few runs where it
> didn't happen and this is the first where it did.
>
> The longest I've let it stay in this state is 8 hours - it seems like
> it will be stuck forever.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>

This is failing on 2.3, so I don't think that fix worked.

On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <john@arbash-meinel.com>
wrote:

> Getting information about what rs.config says inside the Mongo controller,
> and what log output is happening for the peergrouper are both relevant for
> this.
>
> Most likely this is a bug we've seen where the peer grouper decides that
> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
> which is critically the wrong one.
> A fix has already gone into 2.3 so that localhost is never selected. It is
> possible that we would backport that fix.
>
>
> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <chris.gregan@canonical.com
> >
> wrote:
>
> > Bumped to field critical
> >
> > --
> > You received this bug notification because you are a member of Canonical
> > Field Critical, which is subscribed to the bug report.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1720251
> >
> > Title:
> >   new controller machines stay in "adding-vote" forever
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1720251
>
> Title:
>   new controller machines stay in "adding-vote" forever
>
> Status in juju:
>   Triaged
>
> Bug description:
>   juju version: 2.2.4
>   maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>
>   After running enable-ha, my two new controller machines have been in
>   'adding-vote' state for over an hour.  Their machine logs have this:
>
>   2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
>   manifold worker returned unexpected error: cannot connect to mongodb:
>   no reachable servers
>
>   Here is the status:
>   http://paste.ubuntu.com/25635827/
>
>   I've attached a crashdump from the deployment.
>
>   This doesn't happen everytime - we've had quite a few runs where it
>   didn't happen and this is the first where it did.
>
>   The longest I've let it stay in this state is 8 hours - it seems like
>   it will be stuck forever.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-12-12:

#15

"Getting information about what rs.config says inside the Mongo controller,
and what log output is happening for the peergrouper are both relevant for
this."

How do I do this?

On Tue, Dec 12, 2017 at 6:44 AM, Jason Hobbs <email address hidden>
wrote:

> This is failing on 2.3, so I don't think that fix worked.
>
> On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
> wrote:
>
>> Getting information about what rs.config says inside the Mongo controller,
>> and what log output is happening for the peergrouper are both relevant for
>> this.
>>
>> Most likely this is a bug we've seen where the peer grouper decides that
>> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
>> which is critically the wrong one.
>> A fix has already gone into 2.3 so that localhost is never selected. It is
>> possible that we would backport that fix.
>>
>>
>> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <
>> <email address hidden>>
>> wrote:
>>
>> > Bumped to field critical
>> >
>> > --
>> > You received this bug notification because you are a member of Canonical
>> > Field Critical, which is subscribed to the bug report.
>> > Matching subscriptions: juju bugs
>> > https://bugs.launchpad.net/bugs/1720251
>> >
>> > Title:
>> > new controller machines stay in "adding-vote" forever
>> >
>> > To manage notifications about this bug go to:
>> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>> >
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1720251
>>
>> Title:
>> new controller machines stay in "adding-vote" forever
>>
>> Status in juju:
>> Triaged
>>
>> Bug description:
>> juju version: 2.2.4
>> maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>>
>> After running enable-ha, my two new controller machines have been in
>> 'adding-vote' state for over an hour. Their machine logs have this:
>>
>> 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
>> manifold worker returned unexpected error: cannot connect to mongodb:
>> no reachable servers
>>
>> Here is the status:
>> http://paste.ubuntu.com/25635827/
>>
>> I've attached a crashdump from the deployment.
>>
>> This doesn't happen everytime - we've had quite a few runs where it
>> didn't happen and this is the first where it did.
>>
>> The longest I've let it stay in this state is 8 hours - it seems like
>> it will be stuck forever.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>>
>
>

"Getting information about what rs.config says inside the Mongo controller,
and what log output is happening for the peergrouper are both relevant for
this."

How do I do this?

On Tue, Dec 12, 2017 at 6:44 AM, Jason Hobbs <jason.hobbs@canonical.com>
wrote:

> This is failing on 2.3, so I don't think that fix worked.
>
> On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <john@arbash-meinel.com>
> wrote:
>
>> Getting information about what rs.config says inside the Mongo controller,
>> and what log output is happening for the peergrouper are both relevant for
>> this.
>>
>> Most likely this is a bug we've seen where the peer grouper decides that
>> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
>> which is critically the wrong one.
>> A fix has already gone into 2.3 so that localhost is never selected. It is
>> possible that we would backport that fix.
>>
>>
>> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <
>> chris.gregan@canonical.com>
>> wrote:
>>
>> > Bumped to field critical
>> >
>> > --
>> > You received this bug notification because you are a member of Canonical
>> > Field Critical, which is subscribed to the bug report.
>> > Matching subscriptions: juju bugs
>> > https://bugs.launchpad.net/bugs/1720251
>> >
>> > Title:
>> >   new controller machines stay in "adding-vote" forever
>> >
>> > To manage notifications about this bug go to:
>> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>> >
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1720251
>>
>> Title:
>>   new controller machines stay in "adding-vote" forever
>>
>> Status in juju:
>>   Triaged
>>
>> Bug description:
>>   juju version: 2.2.4
>>   maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>>
>>   After running enable-ha, my two new controller machines have been in
>>   'adding-vote' state for over an hour.  Their machine logs have this:
>>
>>   2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
>>   manifold worker returned unexpected error: cannot connect to mongodb:
>>   no reachable servers
>>
>>   Here is the status:
>>   http://paste.ubuntu.com/25635827/
>>
>>   I've attached a crashdump from the deployment.
>>
>>   This doesn't happen everytime - we've had quite a few runs where it
>>   didn't happen and this is the first where it did.
>>
>>   The longest I've let it stay in this state is 8 hours - it seems like
>>   it will be stuck forever.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>>
>
>

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-12:

#17

Download full text (3.5 KiB)

If you can ssh into the controller machine, and connect to Mongo, you
should be able to run:
> rs.config()
> rs.status()

on the Mongo console.
If you can at least ssh into the machine that has the primary mongo, this
should get you access to the Mongo shell:
agent=$(cd /var/lib/juju/agents; echo machine-*)
pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut
'-d ' -sf2)
/usr/lib/juju/mongo3.2/bin/mongo --ssl -u ${agent} -p $pw
--authenticationDatabase admin --sslAllowInvalidHostnames
--sslAllowInvalidCertificates localhost:37017/juju

At this point, I'd probably be more interested in 'juju debug-log
--include-model juju.peergrouper --replay'

On Tue, Dec 12, 2017 at 4:44 PM, Jason Hobbs <email address hidden>
wrote:

> This is failing on 2.3, so I don't think that fix worked.
>
> On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
> wrote:
>
> > Getting information about what rs.config says inside the Mongo
> controller,
> > and what log output is happening for the peergrouper are both relevant
> for
> > this.
> >
> > Most likely this is a bug we've seen where the peer grouper decides that
> > the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
> > which is critically the wrong one.
> > A fix has already gone into 2.3 so that localhost is never selected. It
> is
> > possible that we would backport that fix.
> >
> >
> > On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <
> <email address hidden>
> > >
> > wrote:
> >
> > > Bumped to field critical
> > >
> > > --
> > > You received this bug notification because you are a member of
> Canonical
> > > Field Critical, which is subscribed to the bug report.
> > > Matching subscriptions: juju bugs
> > > https://bugs.launchpad.net/bugs/1720251
> > >
> > > Title:
> > > new controller machines stay in "adding-vote" forever
> > >
> > > To manage notifications about this bug go to:
> > > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> > >
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1720251
> >
> > Title:
> > new controller machines stay in "adding-vote" forever
> >
> > Status in juju:
> > Triaged
> >
> > Bug description:
> > juju version: 2.2.4
> > maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
> >
> > After running enable-ha, my two new controller machines have been in
> > 'adding-vote' state for over an hour. Their machine logs have this:
> >
> > 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
> > manifold worker returned unexpected error: cannot connect to mongodb:
> > no reachable servers
> >
> > Here is the status:
> > http://paste.ubuntu.com/25635827/
> >
> > I've attached a crashdump from the deployment.
> >
> > This doesn't happen everytime - we've had quite a few runs where it
> > didn't happen and this is the first where it did.
> >
> > The longest I've let it stay in this state is 8 hours - it seems like
> > it will be stuck forever.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> >
>
> ...

Canonical Juju

new controller machines stay in "adding-vote" forever

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)

Changed in juju:
assignee:	John A Meinel (jameinel) → Tim Penhey (thumper)

Changed in juju:
status:	Fix Committed → Fix Released