new controller machines stay in "adding-vote" forever

Bug #1720251 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey

Bug Description

juju version: 2.2.4
maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1

After running enable-ha, my two new controller machines have been in 'adding-vote' state for over an hour. Their machine logs have this:

2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers

Here is the status:
http://paste.ubuntu.com/25635827/

I've attached a crashdump from the deployment.

This doesn't happen everytime - we've had quite a few runs where it didn't happen and this is the first where it did.

The longest I've let it stay in this state is 8 hours - it seems like it will be stuck forever.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
tags: added: intermittent-failure
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Related to mongo on these additional controller machines, I am seeing this in 1/baremetal/log/syslog (for example, machine 1):

Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] MongoDB starting : pid=11659 port=37017 dbpath=/var/lib/juju/db 64-bit host=juju-2
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] db version v3.2.12
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] allocator: tcmalloc
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] modules: none
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] build environment:
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] distarch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] target_arch: x86_64
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] options: { net: { ipv6: true, port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "syslog", quiet: true } }
Sep 28 20:45:56 juju-2 mongod.37017[11659]: [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten]
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 20000 processes, 65000 files. Number of processes should be at least 32500 : 0.5 times number of files.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local voted for document at startup.
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument: Did not find replica set configuration document in local.system.replset
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/juju/db/diagnostic.data'
Sep 28 20:46:10 juju-2 mongod.37017[11659]: [initandlisten] waiting for connections on port 37017 ssl
Sep 28 20:46:11 juju-2 mongod.37017[11659]: [conn1] note: no users configured in admin.system.users, allowing localhost access

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Apart from that, I do not see anything obvious... Juju reports that mngo-db has started (albeit, we cannot connect) as per machine-1.log:

2017-09-28 20:45:56 DEBUG juju.cmd.jujud machine.go:1649 mongodb service is installed
2017-09-28 20:46:09 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1
2017-09-28 20:46:09 DEBUG juju.state open.go:130 dialing mongo
2017-09-28 20:46:09 WARNING juju.mongo open.go:159 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused

Revision history for this message
John A Meinel (jameinel) wrote :

connection refused to 37017 would indicate that mongo isn't actually up and running.
It would be good to know if there is a service configured for mongo. Something like:
systemctl status juju-db

might be helpful, in case there is some reason why we are failing to start Mongo.

Some of the other comments seem to hint that maybe we failed to initialize Mongo with appropriate admin user. (IIRC, we install mongo, start it in local-only mode without any users, and then configure some users, and restart it into replicaset mode, etc.

I suppose it is possible that something went wrong and left us in a state that we didn't realize we needed to reinitialize.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Mongo was up and running on the machines - I could telnet into 127.0.0.1:37017

tags: added: cpe-onsite
Revision history for this message
José Pekkarinen (koalinux) wrote :
Download full text (5.0 KiB)

Hi,

We hit this with the slight difference of redundant controllers never getting
operative. Juju version is 2.2.6 as stated in the output:

$ juju controllers
Use --refresh flag with this command to see the latest information.

Controller Model User Access Cloud/Region Models Machines HA Version
my_controller* default admin superuser my_region 2 142 1/3 2.2.6

$ juju status -m controller
Model Controller Cloud/Region Version SLA
controller my_controller my_region 2.2.6 unsupported

App Version Status Scale Charm Store Rev OS Notes

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ Message
0 started X.X.X.18 8dghdf xenial FC1 Deployed
42 started X.X.X.62 tyqpxr xenial FC2 Deployed
43 started X.X.X.29 d4dx38 xenial FC3 Deployed

$ juju ssh -m controller 42

Lessing the machine log I see:

...
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:504 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 ERROR juju.worker.dependency engine.go:551 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-11-24 12:09:28 DEBUG juju.worker.dependency engine.go:553 stack trace:
no reachable servers
github.com/juju/juju/state/open.go:527: cannot connect to mongodb
github.com/juju/juju/state/open.go:103:
github.com/juju/juju/worker/state/manifold.go:64:
2017-11-24 12:09:31 DEBUG juju.cmd.jujud machine.go:1632 mongo is already initialized
2017-11-24 12:09:31 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-42
2017-11-24 12:09:31 DEBUG juju.state open.go:130 dialing mongo
2017-11-24 12:09:31 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:32 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:33 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:34 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:35 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
2017-11-24 12:09:36 DEBUG juju.mongo open.go:173 dialled mongodb server at "127.0.0.1:37017"
...

$ sudo lsof -i | grep LISTEN
sshd 3657 root 3u IPv4 21999 0t0 TCP *:ssh (LISTEN)
mongod 15856 root 6u IPv4 22147 0t0 TCP *:37017 (LISTEN)

On primary node mongo I see the following:

juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-11-24T12:12:22.569Z"),
        "myState" : 1,
        "term" : NumberLong(42),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 2767413,
                        "optime" : {
                                "ts" : Timestamp(151...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

So the bug that we're seeing with Jose's model is that it keeps trying to use 127.0.0.1 as the IP address for each of the machines.
What is weird is that we have a valid "mongo-space-name": "oam" and "mongo-space-status": "valid".
So Juju should be trying to select the IP address from that space to start with.
I know that Andrew has done an explicit path to refuse 127.0.0.1 and I believe ::1 as possible addresses to use in the rs.config, which should help here, but we still need to understand why Juju would have thought 127.0.0.1 would be the best addresses to use.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

systemctl status dump http://paste.ubuntu.com/26133356/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

crashdump of controller model from a juju2.3-rc2 failure.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

After adding 'enable-ha' back to our test setup yesterday, we're hitting this issue more often than not with both juju 2.3~rc2 and juju 2.2.6.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This bug has entered the 'Field High' SLA process.

Revision history for this message
Tim Penhey (thumper) wrote :

@Jason, we'll have someone focus on this, probably starting next week (since it is EOW)

Changed in juju:
milestone: none → 2.3.1
importance: Medium → High
Changed in juju:
milestone: 2.3.1 → none
summary: - new controller machines stay in "adding-vote" for over an hour
+ new controller machines stay in "adding-vote" forever
description: updated
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.3.2
Revision history for this message
Chris Gregan (cgregan) wrote :

Bumped to field critical

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1720251] Re: new controller machines stay in "adding-vote" forever

Getting information about what rs.config says inside the Mongo controller,
and what log output is happening for the peergrouper are both relevant for
this.

Most likely this is a bug we've seen where the peer grouper decides that
the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
which is critically the wrong one.
A fix has already gone into 2.3 so that localhost is never selected. It is
possible that we would backport that fix.

On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <email address hidden>
wrote:

> Bumped to field critical
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1720251
>
> Title:
> new controller machines stay in "adding-vote" forever
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is failing on 2.3, so I don't think that fix worked.

On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
wrote:

> Getting information about what rs.config says inside the Mongo controller,
> and what log output is happening for the peergrouper are both relevant for
> this.
>
> Most likely this is a bug we've seen where the peer grouper decides that
> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
> which is critically the wrong one.
> A fix has already gone into 2.3 so that localhost is never selected. It is
> possible that we would backport that fix.
>
>
> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <<email address hidden>
> >
> wrote:
>
> > Bumped to field critical
> >
> > --
> > You received this bug notification because you are a member of Canonical
> > Field Critical, which is subscribed to the bug report.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1720251
> >
> > Title:
> > new controller machines stay in "adding-vote" forever
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1720251
>
> Title:
> new controller machines stay in "adding-vote" forever
>
> Status in juju:
> Triaged
>
> Bug description:
> juju version: 2.2.4
> maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>
> After running enable-ha, my two new controller machines have been in
> 'adding-vote' state for over an hour. Their machine logs have this:
>
> 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
> manifold worker returned unexpected error: cannot connect to mongodb:
> no reachable servers
>
> Here is the status:
> http://paste.ubuntu.com/25635827/
>
> I've attached a crashdump from the deployment.
>
> This doesn't happen everytime - we've had quite a few runs where it
> didn't happen and this is the first where it did.
>
> The longest I've let it stay in this state is 8 hours - it seems like
> it will be stuck forever.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

"Getting information about what rs.config says inside the Mongo controller,
and what log output is happening for the peergrouper are both relevant for
this."

How do I do this?

On Tue, Dec 12, 2017 at 6:44 AM, Jason Hobbs <email address hidden>
wrote:

> This is failing on 2.3, so I don't think that fix worked.
>
> On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
> wrote:
>
>> Getting information about what rs.config says inside the Mongo controller,
>> and what log output is happening for the peergrouper are both relevant for
>> this.
>>
>> Most likely this is a bug we've seen where the peer grouper decides that
>> the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
>> which is critically the wrong one.
>> A fix has already gone into 2.3 so that localhost is never selected. It is
>> possible that we would backport that fix.
>>
>>
>> On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <
>> <email address hidden>>
>> wrote:
>>
>> > Bumped to field critical
>> >
>> > --
>> > You received this bug notification because you are a member of Canonical
>> > Field Critical, which is subscribed to the bug report.
>> > Matching subscriptions: juju bugs
>> > https://bugs.launchpad.net/bugs/1720251
>> >
>> > Title:
>> > new controller machines stay in "adding-vote" forever
>> >
>> > To manage notifications about this bug go to:
>> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>> >
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1720251
>>
>> Title:
>> new controller machines stay in "adding-vote" forever
>>
>> Status in juju:
>> Triaged
>>
>> Bug description:
>> juju version: 2.2.4
>> maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
>>
>> After running enable-ha, my two new controller machines have been in
>> 'adding-vote' state for over an hour. Their machine logs have this:
>>
>> 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
>> manifold worker returned unexpected error: cannot connect to mongodb:
>> no reachable servers
>>
>> Here is the status:
>> http://paste.ubuntu.com/25635827/
>>
>> I've attached a crashdump from the deployment.
>>
>> This doesn't happen everytime - we've had quite a few runs where it
>> didn't happen and this is the first where it did.
>>
>> The longest I've let it stay in this state is 8 hours - it seems like
>> it will be stuck forever.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
>>
>
>

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.5 KiB)

If you can ssh into the controller machine, and connect to Mongo, you
should be able to run:
> rs.config()
> rs.status()

on the Mongo console.
If you can at least ssh into the machine that has the primary mongo, this
should get you access to the Mongo shell:
agent=$(cd /var/lib/juju/agents; echo machine-*)
pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut
'-d ' -sf2)
/usr/lib/juju/mongo3.2/bin/mongo --ssl -u ${agent} -p $pw
--authenticationDatabase admin --sslAllowInvalidHostnames
--sslAllowInvalidCertificates localhost:37017/juju

At this point, I'd probably be more interested in 'juju debug-log
--include-model juju.peergrouper --replay'

On Tue, Dec 12, 2017 at 4:44 PM, Jason Hobbs <email address hidden>
wrote:

> This is failing on 2.3, so I don't think that fix worked.
>
> On Tue, Dec 12, 2017 at 2:20 AM, John A Meinel <email address hidden>
> wrote:
>
> > Getting information about what rs.config says inside the Mongo
> controller,
> > and what log output is happening for the peergrouper are both relevant
> for
> > this.
> >
> > Most likely this is a bug we've seen where the peer grouper decides that
> > the best addresses to assign for all of the Mongo replicas is "127.0.0.1"
> > which is critically the wrong one.
> > A fix has already gone into 2.3 so that localhost is never selected. It
> is
> > possible that we would backport that fix.
> >
> >
> > On Tue, Dec 12, 2017 at 12:54 AM, Chris Gregan <
> <email address hidden>
> > >
> > wrote:
> >
> > > Bumped to field critical
> > >
> > > --
> > > You received this bug notification because you are a member of
> Canonical
> > > Field Critical, which is subscribed to the bug report.
> > > Matching subscriptions: juju bugs
> > > https://bugs.launchpad.net/bugs/1720251
> > >
> > > Title:
> > > new controller machines stay in "adding-vote" forever
> > >
> > > To manage notifications about this bug go to:
> > > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> > >
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1720251
> >
> > Title:
> > new controller machines stay in "adding-vote" forever
> >
> > Status in juju:
> > Triaged
> >
> > Bug description:
> > juju version: 2.2.4
> > maas version: 2.2.3-6114-g672ff26-0ubuntu1~16.04.1
> >
> > After running enable-ha, my two new controller machines have been in
> > 'adding-vote' state for over an hour. Their machine logs have this:
> >
> > 2017-09-28 21:59:47 ERROR juju.worker.dependency engine.go:546 "state"
> > manifold worker returned unexpected error: cannot connect to mongodb:
> > no reachable servers
> >
> > Here is the status:
> > http://paste.ubuntu.com/25635827/
> >
> > I've attached a crashdump from the deployment.
> >
> > This doesn't happen everytime - we've had quite a few runs where it
> > didn't happen and this is the first where it did.
> >
> > The longest I've let it stay in this state is 8 hours - it seems like
> > it will be stuck forever.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1720251/+subscriptions
> >
>
> ...

Read more...

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

That debug-log call just hangs for me. I'm going to email you instructions on how to log into the environment so you can inspect it.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.2 KiB)

On a machine demonstrating this, we get:
 rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-12-13T16:35:05.020Z"),
        "myState" : 1,
        "term" : NumberLong(1),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "10.245.208.204:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 76881,
                        "optime" : {
                                "ts" : Timestamp(1513182904, 5),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2017-12-13T16:35:04Z"),
                        "electionTime" : Timestamp(1513106027, 1),
                        "electionDate" : ISODate("2017-12-12T19:13:47Z"),
                        "configVersion" : 1,
                        "self" : true
                }
        ],
        "ok" : 1
}

and
> rs.config()
{
        "_id" : "juju",
        "version" : 1,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 1,
                        "host" : "10.245.208.204:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
                                "juju-machine-id" : "0"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("5a302a6ac77b4ca613f5d47d")
        }
}

and machine-0.log is saying:

2017-12-13 16:35:43 DEBUG juju.worker.peergrouper publish.go:43 API host ports have not changed
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:43 members: ...
   &peergrouper.machine{id: "0", wantsVote: true, hostPorts: [10.245.208.204:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=10.245.208.204:37017
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:44 extra: []replicaset.Member(nil)
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:45 maxId: 1
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:117 assessing possible peer group changes:
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:124 machine "0" is already voting
2017-12-13 16:35:43 DEBUG juju.worker.peergrouper desired.go:142 assessed
2017-12-13 16:35:43 DEBUG juju.mongo mongo.go:288 selecting mongo peer hostPort by scope from [10.245...

Read more...

Tim Penhey (thumper)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

So, somehow, the state in the database appears to be corrupted. Specifically:
> db.controllers.find({"_id": "e"}).pretty()
{
...
        "machineids" : ["0"],
        "votingmachineids" : ["0","1","2"],
...
}

The comments on those attributes are

MachineIds holds the ids of all machines configured to run a controller. It includes all the machine ids in VotingMachineIds.

However, it is clearly not true that "machineids" is a superset of "votingmachineids".

Now, all of the machine records do have the Job set:
> db.machines.find({"model-uuid": "f27da689-56d0-4c2f-8e80-4485b7e9d074"}, {"_id": 1, "jobs": 1}).pretty()
{ "_id" : "f27da689-56d0-4c2f-8e80-4485b7e9d074:0", "jobs" : [ 2, 1 ] }
{ "_id" : "f27da689-56d0-4c2f-8e80-4485b7e9d074:1", "jobs" : [ 1, 2 ] }
{ "_id" : "f27da689-56d0-4c2f-8e80-4485b7e9d074:2", "jobs" : [ 1, 2 ] }

(they all have JobHostUnit and JobManageModel).

What I don't understand is that the transaction that adds new ids to the votingmachineids also adds the ids to the machineids section.

There is a transaction for "promote" and "demote" that will add and remove from the votingmachineids section, but we shouldn't be promoting if tehy aren't already in the machineids.

Revision history for this message
John A Meinel (jameinel) wrote :

Just updating the machineids field with:
db.controllers.update({"_id": "e"}, {$set: {"machineids": ["0", "1", "2"]}})

was sufficient to cause everything to actually go into 'ha-enabled'.

I don't know what caused that setting to not get updated correctly in the first place, though.

Revision history for this message
John George (jog) wrote :

Tested with juju edge 2.4-beta1-xenial-amd64 and experienced no change.
https://solutions.qa.canonical.com/#/qa/testRun/b0320685-3abe-4e89-9ee8-2f7419f0ec6e

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

I can confirm that John's solution to update machineids works.
But apparently the machine agent needs to be restarted after the update too so:

service jujud-machine-0 restart

Revision history for this message
Chris Gregan (cgregan) wrote :

Editing the DB after deployment is not a usable workaround in Solutions QA CI or the field. Please provide details on how this issue can be fixed.

Tim Penhey (thumper)
Changed in juju:
assignee: John A Meinel (jameinel) → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :

OK, I think we've found the culprit.

I provided a custom binary today with extra logging of the txn ops for any transaction that touched the controllers "e" document. And what we ended up with was somewhat eye-opening.

The txn.op said to add "1" and "2" to both "machineids" and "votingmachineids".
The rs.Config() in mongo showed only one machine.
The document in the database had the observed symptom of
{"machineids": ["0"], "votingmachineids": ["0", "1", "2"}]}.
The txns collection said that it had added "1" and "2" to "machineids" -- and only "machineids". This was clearly a lie.

A big key here was the written out transaction showed a top level key of "$addToSet", and the txn.op had two of these.

It seemed that somehow during the execution process the two commands got flattened to one.

Revision history for this message
Tim Penhey (thumper) wrote :
Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.