[2.2.1] juju-agent loses connection to controller and doesn't retry to connect

Bug #1701275 reported by Thomas Zylla
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Critical
John A Meinel
2.2
Triaged
Critical
John A Meinel

Bug Description

After the deployment is done and works successfully, after 1 day nearly all agents lost the connection to the controller.
The juju-agents doesn't retry to connect to the controllers.

Log output:
https://paste.ubuntu.com/24981508/

init@maas:~$ juju show-controller
XXX:
  details:
    uuid: c8910be4-4c24-40f3-83dc-d81f2523f7ca
    api-endpoints: ['10.20.30.69:17070', '10.20.30.67:17070', '10.20.30.68:17070']
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      XXX
      -----END CERTIFICATE-----
    cloud: XXX
    agent-version: 2.2.1
  controller-machines:
    "0":
      instance-id: 7p3wy7
      ha-status: ha-enabled
    "1":
      instance-id: ypm4xs
      ha-status: ha-enabled
    "2":
      instance-id: cqgktg
      ha-status: ha-enabled
  models:
    controller:
      uuid: def665b3-ad2d-459a-8db4-0f471d8f6172
      machine-count: 3
      core-count: 6
    default:
      uuid: 2a0acada-13a9-43e4-8e34-9a6b24f58b16
    openstack:
      uuid: b3b6050e-7cde-45e4-8b3a-96c58ebfe938
      machine-count: 64
      core-count: 317
  current-model: admin/controller
  account:
    user: admin
    access: superuser

Revision history for this message
Thomas Zylla (thomas.zylla) wrote :

connectivity between unit and controller works:
root@controller01:~# telnet 10.20.30.67 17070
Trying 10.20.30.67...
Connected to 10.20.30.67.
Escape character is '^]'.

tags: added: cpe cpe-sa
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Thomas,

Could you also provide controller logs to find out what happened at about the same time that these 'connection reset' errors happened?

2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "upgrader" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "9-container-watcher": worker "9-container-watcher" exited: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "ssh-authkeys-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "lxd-provisioner": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "stateconverter": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-address-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "proxy-config-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "reboot-executor" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "migration-inactive-flag" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer

Revision history for this message
Thomas Zylla (thomas.zylla) wrote :

juju status for the controller:
https://paste.ubuntu.com/24981939/

juju controller-log from the same time:
https://paste.ubuntu.com/24981936/

looks like the HA-failover didn't notify the juju-agents since the primary controller changed at the same time:
https://paste.ubuntu.com/24982031/

Revision history for this message
Thomas Zylla (thomas.zylla) wrote :

output from all juju agents with "connection reset by peer" grep:
https://paste.ubuntu.com/24982141/

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.1 KiB)

It seems that the issue happens during failover to a different controller/mongo judging by the time stamps provided by Thomas:

Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] New replica set config in use: { _id: "juju", version: 4, protocolVersion: 1, members: [ { _id: 2, host: "10.20.30.68:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "1" }, slaveDe
lay: 0, votes: 1 }, { _id: 3, host: "10.20.30.69:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "2" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "10.20.30.67:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machi
ne-id: "0" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5943ef0e14d7709b0f6d704a') } }
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] This node is 10.20.30.67:37017 in the config
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to STARTUP2
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Starting replication applier threads
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to RECOVERING
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.68:37017 is now in state SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.69:37017 is now in state PRIMARY

https://paste.ubuntu.com/24982141/

----

If so, this is a regression from 2.1.2.

----

Reproduced this independently in a lab but with slightly different symptoms.

Looking at mongod logs in my environment, I cannot correlate the agent loss with failovers - there was only one transition to PRIMARY on a single controller:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep transition'
- MachineId: "1"
  Stdout: |
    Jun 29 22:33:18 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to STARTUP2
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to RECOVERING
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to SECONDARY
- MachineId: "2"
  Stdout: |
    Jun 29 22:33:18 controller-1 mongod.37017[5046]: [ReplicationExecutor] transition to S...

Read more...

summary: - juju-agent loses connection to controller and doesn't retry to connect
+ [2.2.1] juju-agent loses connection to controller and doesn't retry to
+ connect
Tim Penhey (thumper)
Changed in juju:
status: New → Triaged
importance: Undecided → Critical
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.2.2
tags: added: 4010
Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

not sure if it's the same but it seems to be related, in our customer environment after enabling ha, the two new controllers have been deployed correctly but ha status is still pending.

production-controller:
  details:
    uuid: ccf049d7-c886-4c9b-81b2-e579f65016dd
    api-endpoints: ['100.107.2.43:17070', '100.107.2.218:17070', '100.107.2.219:17070']
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      -----END CERTIFICATE-----
    cloud: maas
    agent-version: 2.2.1.1
  controller-machines:
    "0":
      instance-id: skxxw7
      ha-status: ha-enabled
    "1":
      instance-id: 43gbdf
      ha-status: ha-pending
    "2":
      instance-id: s4mhpb
      ha-status: ha-pending
  models:
    controller:
      uuid: 66978ec6-9ba7-45ed-87a7-6609437b4bf1
      machine-count: 3
      core-count: 48
    default:
      uuid: 80bea1dd-4401-4cf6-8bb5-d0769b1d8d80
    production:
      uuid: 1eb3f40f-7842-4f62-8ebe-b11c0f622b29
      machine-count: 134
      core-count: 1616
  current-model: admin/production
  account:
    user: admin
    access: superuser

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

Machines 1 and 2 in controller model are logging in /var/log/juju/machine-[1-2].log

2017-07-03 10:03:49 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-03 10:03:52 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-2
2017-07-03 10:04:22 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-03 10:04:25 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-2

Revision history for this message
Tim Penhey (thumper) wrote :

@Paolo, can you look in /var/log/syslog for mongod entries please?

Also, take a look in /var/lib/juju/db to see if there are actually database files on machines 1 and 2.

Revision history for this message
Tim Penhey (thumper) wrote :

I have been having difficulty trying to reproduce. Are you able to reproduce with the following logging config? Please set this logging config for both the controller model, and the deployment model.

juju model-config -m controller logging-config="juju=INFO;unit=WARN;juju.worker.dependency=TRACE"

If this generates a reproduction, then I'd like the machine-x.log files off all the controller machines, and a log file (or two) from the agents that are not reconnecting.

Thanks.

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

@Tim database files are there on both the controllers but no logs about mongo in syslog, increasing the log level the error that I got before it's a bit more verbose but always the same. Basically jujud is not able to connect to mongo on localhost, but mongo process is running and netstat says that the port is open.

All the agents are connected to the first controller that is working.

### Mongo process
17967 ? Ssl 42:56 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword=xxxxxxx --port 37017 --syslog --journal --replSet juju --quiet --oplogSize 1024 --auth --keyFile /var/lib/juju/shared-secret --sslMode requireSSL --storageEngine wiredTiger --wiredTigerCacheSizeGB 1

### Netstat
tcp 0 0 0.0.0.0:37017 0.0.0.0:* LISTEN 17967/mongod

### Log
2017-07-04 06:44:59 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:442 failed to start "state" manifold worker: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 DEBUG juju.worker.dependency engine.go:499 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:421 starting "state" manifold worker in 3.015s...
2017-07-04 06:45:33 TRACE juju.worker.dependency engine.go:430 starting "state" manifold worker
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "agent" resource
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "state-config-watcher" resource
2017-07-04 06:45:33 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :
Download full text (5.1 KiB)

I have restarted mongo and some log is in syslog but same result, jujud is not able to connect to mongo locally.

Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] Shutting down full-time diagnostic data capture
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] now exiting
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close listening sockets...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 6
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 7
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] removing socket file: /tmp/mongodb-37017.sock
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to flush diaglog...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close sockets...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] WiredTigerKVEngine shutting down
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: removing fs lock...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] dbexit: rc: 0
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod[20333]: 2017-07-04T06:51:18.681+0000 W CONTROL [main] No SSL certificate validation can be performed since no CA file has been provided; please specify an sslCAFile parameter
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] MongoDB starting : pid=20333 port=37017 dbpath=/var/lib/juju/db 64-bit host=sitename-mgm231juju2-hr
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] db version v3.2.12
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] allocator: tcmalloc
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] modules: none
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] build environment:
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] distarch: x86_64
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] target_arch: x86_64
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] options: { net: { port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTig...

Read more...

Revision history for this message
Tim Penhey (thumper) wrote :

As far as not being able to connect to the local mongo, I see only two possibilities.

One is that the agent has an invalid password, but I'm pretty sure that that would give a different error message.

The other is that a proxy setting is getting in the way.

What are the various model proxy values set for the controller model?

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Paolo: Are you able to connect to MongoDB using the mongo shell?

If so, can you please check the replicaset status using the "rs.status()" command?

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

There are no proxies set.

Attribute From Value
agent-metadata-url default ""
agent-stream default released
agent-version model 2.2.1.1
apt-ftp-proxy default ""
apt-http-proxy default ""
apt-https-proxy default ""
apt-mirror default ""
apt-no-proxy default ""
automatically-retry-hooks default true
default-series default xenial
development default false
disable-network-management default false
enable-os-refresh-update default true
enable-os-upgrade default true
firewall-mode default instance
ftp-proxy default ""
http-proxy default ""
https-proxy default ""
ignore-machine-addresses default false
image-metadata-url default ""
image-stream default released
logforward-enabled default false
logging-config model juju=INFO;unit=WARN;juju.worker.dependency=TRACE
max-status-history-age default 336h
max-status-history-size default 5G
net-bond-reconfigure-delay default 17
no-proxy default 127.0.0.1,localhost,::1
provisioner-harvest-mode default destroyed
proxy-ssh default false
resource-tags model {}
ssl-hostname-verification default true
storage-default-block-source model maas
test-mode default false
transmit-vendor-metrics default true
update-status-hook-interval default 5m

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

Connecting with mongo cli it seems that the password is wrong I'm still using "oldpassword" to connect is that right ?

mongo --ssl -u admin -p $(grep oldpassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/admin

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1701275] Re: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect

oldpassword and admin is only relevant on the machine that bootstrapped.
you want to use "statepassword" and the agent name (eg machine-2) on all
other machines.

John
=:->

On Jul 5, 2017 15:11, "Paolo de Rosa" <email address hidden> wrote:

> Connecting with mongo cli it seems that the password is wrong I'm still
> using "oldpassword" to connect is that right ?
>
> mongo --ssl -u admin -p $(grep oldpassword
> /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}')
> localhost:37017/admin
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1701275
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1701275/+subscriptions
>

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

same result.

root@site-mgm231juju2-hr:~# mongo --ssl -u admin -p $(grep statepassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/machine-1
MongoDB shell version: 2.6.10
connecting to: localhost:37017/machine-1
2017-07-05T15:22:24.234+0000 Error: 18 { ok: 0.0, errmsg: "auth failed", code: 18 } at src/mongo/shell/db.js:1287
exception: login failed

Revision history for this message
Paolo de Rosa (paolo-de-rosa) wrote :

mongo --ssl -u machine-1 -p $(grep statepassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/machine-1
MongoDB shell version: 2.6.10
connecting to: localhost:37017/machine-1
2017-07-05T15:24:27.386+0000 Error: 18 { ok: 0.0, errmsg: "auth failed", code: 18 } at src/mongo/shell/db.js:1287
exception: login failed

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I think you need to provide the "--authenticationDatabase" arg. You're also passing the username where the database name goes (but that's probably not the problem). Try something like this:

mongopass=`sudo grep statepassword /var/lib/juju/agents/machine-1/agent.conf | cut -d' ' -f2`

mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "machine-1" --password "$mongopass"

Revision history for this message
Tim Penhey (thumper) wrote :

I have used the following to get access to mongo on any API server:

```
user=$(basename `ls -d /var/lib/juju/agents/machine-*`)
password=`sudo grep statepassword /var/lib/juju/agents/machine-*/agent.conf | cut -d' ' -f2`
/usr/lib/juju/mongo*/bin/mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"
```

Which is now also defined in https://github.com/juju/juju/wiki/Diagnosing-MongoDB-Performance

Is it possible to get a developer to work with you to try and work this out?

Revision history for this message
Alvaro Uria (aluria) wrote :

With regard to the same environment reported by Paolo on previous comments, we can't log in on "secondary-to-be" mongo services.

0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
1) "JUJU_MODEL=controller juju status --format yaml": http://pastebin.ubuntu.com/25031101/
2) rs.status() on PRIMARY: http://pastebin.ubuntu.com/25031086/
3) when trying to log in on "secondary-to-be" controllers: http://pastebin.ubuntu.com/25031117/
3.1) syslog shows:
"""
Jul 6 09:50:07 controller2 mongod.37017[20333]: [conn168826] SCRAM-SHA-1 authentication failed for machine-1 on admin from client 127.0.0.1 ; UserNotFound: Could not find user machine-1@admin
"""

It looks like mongo replicas are not able to properly bootstrap and reach the primary server.

Please let me know if I can help you further or need more information of any kind.

Kind regards,
-Alvaro.

Revision history for this message
John A Meinel (jameinel) wrote :

Given that as far as we can see the other mongo's haven't joined the
replica set, and the local controllers aren't actually able to log into
mongo, I wonder if its time to take the 'nuclear' option. If you stopped
jujud, and then sudo apt remove juju-mongodb3.2; and sudo rm -rf
/var/lib/juju/db
I wonder if that would be enough for the jujud agents to reinstall mongo
and re-initialize a local mongo replica.

Note that I haven't tried doing any of those steps myself, so I can't
guarantee what would work, but *something* like that may work for those
agents. They should want to create a local mongo replica and make it sync
on startup.

You could probably test this sort of thing on an 'lxd' provider just by
doing "juju bootstrap lxd; juju enable-ha -n3", and then stopping and
deleting mongo on one of the machines.
If you wanted to take it a step further, you could potentially firewall
port 37017 before running 'enable-ha' so that the other mongo daemons
wouldn't be able to join the replica set.

John
=:->

On Thu, Jul 6, 2017 at 1:56 PM, Alvaro Uría <email address hidden>
wrote:

> With regard to the same environment reported by Paolo on previous
> comments, we can't log in on "secondary-to-be" mongo services.
>
> 0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
> 1) "JUJU_MODEL=controller juju status --format yaml":
> http://pastebin.ubuntu.com/25031101/
> 2) rs.status() on PRIMARY: http://pastebin.ubuntu.com/25031086/
> 3) when trying to log in on "secondary-to-be" controllers:
> http://pastebin.ubuntu.com/25031117/
> 3.1) syslog shows:
> """
> Jul 6 09:50:07 controller2 mongod.37017[20333]: [conn168826] SCRAM-SHA-1
> authentication failed for machine-1 on admin from client 127.0.0.1 ;
> UserNotFound: Could not find user machine-1@admin
> """
>
> It looks like mongo replicas are not able to properly bootstrap and
> reach the primary server.
>
> Please let me know if I can help you further or need more information of
> any kind.
>
> Kind regards,
> -Alvaro.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1701275
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1701275/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.4 KiB)

The issue is reproducible after a restart of all agents - not only a single time after deployment:

1) {unit,machine} agent loss reported
2) juju run --all 'sudo systemctl restart jujud*'
3) wait for a day
4) {unit,machine} agent loss reported

There were no failovers after an initial deployment and two cases agent loss:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep PRIMARY'
- MachineId: "1"
  Stdout: |
    Jul 05 23:31:05 controller-2 mongod.37017[4569]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "2"
  Stdout: |
    Jul 05 23:31:19 controller-1 mongod.37017[4557]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "0"
  Stdout: |
    Jul 05 23:24:46 controller-juju mongod.37017[5295]: [ReplicationExecutor] transition to PRIMARY
    Jul 08 12:36:15 controller-juju mongod.37017[5295]: [conn17] command juju.actions command: insert { insert: "actions", documents: [ { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:0492f660-6ee0-4137-8c0c-ad9d930cb3c2", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "0/lxd/2", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517375000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {}, txn-revno: 2, txn-queue: [ "5960d1be86d3b11563e97d6e_f3e2ce96" ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 155ms
    Jul 08 12:36:24 controller-juju mongod.37017[5295]: [conn17] command juju.txns command: insert { insert: "txns", documents: [ { _id: ObjectId('5960d1c886d3b11563e97da6'), s: 1, o: [ { c: "machines", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4", a: { life: { $ne: 2 } } }, { c: "actions", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517385000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {} } }, { c: "actionnotifications", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", actionid: "2d8f6c6b-75db-4799-85c5-e57c90fa3cae" } } ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Co...

Read more...

Chris Gregan (cgregan)
tags: added: cdo-qa-blocker
Revision history for this message
Alvaro Uria (aluria) wrote :

In my case, I was not able to reproduce the issue in a staging environment (using an openstack endpoint as backend). Prod env uses MAAS as endpoint.

1)
Prod: juju bootstrap prod-cloud --constraints tags=juju
Staging: juju bootstrap canonistack --constraints "cores=2 mem=4G"

2) juju enable-ha

3)
Prod: node#0 has-vote; node#1,2 "adding-vote". PRIMARY mongo does not show replicasets, but jujud-machine-{1,2} are started and mongo installed (although no machine-{1,2} users are created in juju-db).
Staging:
3.1) iptables --dport 37017 -j DROP
PRIMARY shows replicasets created, but in STARTUP state (no SECONDARY).
Stopping jujud-machine-{1,2}, "apt remove juju-mongodb3.2" and restarting jujud-machine-{1,2} makes juju-mongodb3.2 being reinstalled.

3.2) iptables --dport 17070 -j DROP (no other DROPs)
PRIMARY does not show new replicasets, only the primary mongo.
juju-mongodb3.2 is installed and started, but no machine-{1,2} user exists.
"juju status" shows nodes#1,#2 as "pending" and "juju ssh 1" (or "2") does not work because pub key has not been imported (no access to jujud-machine-0 to gather data).

At any time, flushing iptables rules recovers the environment (juju-db might need a restart on node#1,#2 to complete replication).

TL;DR:
The diversion between Prod and Staging is that Prod's jujud-machine-{1,2} connects to jujud-machine-0 but replicasets are not created in PRIMARY juju-db (needed to step for the other mongos to be able to connect to the primary instance).

John A Meinel (jameinel)
Changed in juju:
milestone: 2.2.2 → 2.3-alpha1
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
John A Meinel (jameinel) wrote :

So I have a theory that this *might* be related to the Presence issues in 2.2.1.
(bug #1703526).

I'm investigating now, but the theory is that if
$ juju bootstrap
# wait for presence to break from bug #1703526
$ juju enable-ha -n3

At this point, the machines come up, but because Presence is broken, they don't get detected as really being available, and thus don't get added to the replica set.
The code I'm curious about is:
  // Don't add the machine unless it is "Started"
  machineStatus, err := stm.Status()
  if err != nil {
   return false, errors.Annotatef(err, "cannot get status for machine %q", id)
  }
  if machineStatus.Status == status.Started {
   logger.Debugf("machine %q has started, adding it to peergrouper list", id)
   tracker, err := newMachineTracker(stm, w.machineChanges)
   if err != nil {
    return false, errors.Trace(err)
   }
   if err := w.catacomb.Add(tracker); err != nil {
    return false, errors.Trace(err)
   }
   w.machineTrackers[id] = tracker
   changed = true
  } else {
   logger.Debugf("machine %q not ready: %v", id, machineStatus.Status)
  }

So if Presence is broken, can the machine go into "status.Started".

Revision history for this message
John A Meinel (jameinel) wrote :

My theory appears to be incorrect. I bootstrapped lxd, and broke the ping batcher. I could see that "juju status" reported all agents as lost.
I then ran "juju enable-ha -n3", and it actually demoted machine 0 because it was seen as dead, and promoted 3 new machines (1,2,3).
Those machines all came up successfully, and were added to the replica set, which now reports machines 1 2 and 3 as in the replica set. 'rs.Status()' also reports 1,2,3 as being in the set.

Revision history for this message
John A Meinel (jameinel) wrote :

Something else appears to be going wrong. All of the new machines that were started are happily participating in the new replicaset, but they are all flagged as "ha-pending" instead of "ha-enabled" and machine 0 is refusing to step down (stuck in removing-vote).

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (8.2 KiB)

I'm pretty sure the initial issue is actually just bug #1703526

Looking at the original paste, I think the issue is that we have:
2017-06-28 10:30:10 DEBUG juju.worker.logger logger.go:63 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2017-06-28 10:32:42 WARNING juju.provisioner broker.go:100 incomplete DNS config found, discovering host's DNS config

By default log level is WARNING. So you see errors when we get disconnected, but you *don't* see any messages pertaining to reconnecting, because those are at best INFO level.

And it appears like the agent isn't trying to reconnect (which is true) but is because it is already happily connected. bug #1703526 is because one of the workers on the controller side could get restarted, and then we would stop tracking agent presence correctly.

In the next "controller logs" I think we have these lines:
no kvm containers possible
2017-06-29 09:39:17 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded
2017-06-29 09:39:21 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded

^- I think this is bug #1701438

and
2017-06-29 13:43:39 WARNING juju.apiserver log.go:168 http: TLS handshake error from 10.20.30.1:46188: tls: oversized record received with length 64774
^- I'm guessing this is testing Telnet.

for https://paste.ubuntu.com/24982031/
that looks like it is just when you did 'enable-ha' we create a replica set, and once that is active, mongo disconnects everyone and triggers a vote which may change the primary.
Which should be what was going on around:
Jun 29 09:39:15

For the next item around "ha-pending", I think we should track that as a different bug, as this bug has since grown a bit too hairy to really tell which thing is which.
I have a system which shows the same symptom (all the new agents are in the replicaset and happy, but the old controller is still listed as "removing-vote" and the new ones are "adding-vote" even though they all *have* the right votes.)

Note that my case seems to differ from Paolo's because I do see the machines showing up as part of the replica set, but they are still listed as "adding-vote", while he never saw them get added to Mongo's configuration.

Looking at the debug-log under my condition I see:
machine-1: 14:10:08 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-0: 14:10:11 DEBUG juju.worker.peergrouper API host ports have not changed
machine-0: 14:10:11 DEBUG juju.worker.peergrouper calculating desired peer group
machine-0: 14:10:11 DEBUG juju.worker.peergrouper members: ...
   &peergrouper.machine{id: "3", wantsVote: true, hostPorts: [10.67.99.26:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=3, rs_addr=10.67.99.26:37017
   &peergrouper.machine{id: "0", wantsVote: false, hostPorts: [10.67.99.160:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=10.67.99.160:37017
   &peergrouper.machine{id: "2", wantsVote: true, hostPorts: [10.67.99.184:3701...

Read more...

Revision history for this message
Alvaro Uria (aluria) wrote :

Initial status (prod env): rs.status() on PRIMARY does not show any replicaset.

Following suggestion on #22,
1) I stopped jujud-machine-1 and its juju-db (not able to connect to it as machine-1 user does not exist)
2) I also stopped jujud-machine-2 and its juju-db (same as step 1)) - I also backed up /var/lib/juju/db
3) On machine-1, I removed juju-mongodb3.2 pkg and started jujud-machine-1 (find log attached since start)

On machine-0, I checked rs.status() again and it remains the same (no replicasets, just the PRIMARY). When I tested this in staging, replicaset was configured on primary mongo.

Revision history for this message
Alvaro Uria (aluria) wrote :

BTW, this is the same environment reported by Paolo.

More info apart from #29. machine-0.log shows:
2017-07-12 14:22:12 ERROR juju.worker.peergrouper worker.go:211 cannot set replicaset: Found two member configurations with same _id field, members.0._id == members.1._id == 1

juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-07-12T14:24:14.847Z"),
        "myState" : 1,
        "term" : NumberLong(1),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 1661145,
                        "optime" : {
                                "ts" : Timestamp(1499869454, 100),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2017-07-12T14:24:14Z"),
                        "electionTime" : Timestamp(1498208312, 2),
                        "electionDate" : ISODate("2017-06-23T08:58:32Z"),
                        "configVersion" : 2,
                        "self" : true
                }
        ],
        "ok" : 1
}

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The "Found two member configurations with same _id" error is a new one for me. It would be helpful to see what "rs.conf()" says.

Revision history for this message
John A Meinel (jameinel) wrote :

I think the original 'agent loses connection' bug is just a dupe of the existing bug, and the work on fixing Replica set should be split out into a separate bug.

Revision history for this message
John A Meinel (jameinel) wrote :

Split out of Mongo issues moved to bug #1704072

Revision history for this message
Alvaro Uria (aluria) wrote :

Menno, I've filed bug 1704075 (re: "Found two member configurations with same _id").

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.