Bug #1701275 “[2.2.1] juju-agent loses connection to controller ...” : Bugs : Canonical Juju

Revision history for this message

Thomas Zylla (thomas.zylla) wrote on 2017-06-29:

#1

connectivity between unit and controller works:
root@controller01:~# telnet 10.20.30.67 17070
Trying 10.20.30.67...
Connected to 10.20.30.67.
Escape character is '^]'.

Dmitrii Shcherbakov (dmitriis) on 2017-06-29

tags:

added: cpe cpe-sa

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-06-29:

#2

Thomas,

Could you also provide controller logs to find out what happened at about the same time that these 'connection reset' errors happened?

2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "upgrader" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "9-container-watcher": worker "9-container-watcher" exited: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "ssh-authkeys-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "lxd-provisioner": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "stateconverter": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-address-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "proxy-config-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "reboot-executor" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "migration-inactive-flag" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer

Thomas,

Could you also provide controller logs to find out what happened at about the same time that these 'connection reset' errors happened?

2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "upgrader" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "9-container-watcher": worker "9-container-watcher" exited: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "ssh-authkeys-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "lxd-provisioner": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "stateconverter": codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-address-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "proxy-config-updater" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "reboot-executor" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "migration-inactive-flag" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer
2017-06-29 09:39:32 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.20.37.52:39874->10.20.30.67:17070: read: connection reset by peer

Revision history for this message

Thomas Zylla (thomas.zylla) wrote on 2017-06-29:

#3

juju status for the controller:
https://paste.ubuntu.com/24981939/

juju controller-log from the same time:
https://paste.ubuntu.com/24981936/

looks like the HA-failover didn't notify the juju-agents since the primary controller changed at the same time:
https://paste.ubuntu.com/24982031/

Revision history for this message

Thomas Zylla (thomas.zylla) wrote on 2017-06-29:

#4

output from all juju agents with "connection reset by peer" grep:
https://paste.ubuntu.com/24982141/

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-06-30:

#5

Download full text (4.1 KiB)

It seems that the issue happens during failover to a different controller/mongo judging by the time stamps provided by Thomas:

Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] New replica set config in use: { _id: "juju", version: 4, protocolVersion: 1, members: [ { _id: 2, host: "10.20.30.68:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "1" }, slaveDe
lay: 0, votes: 1 }, { _id: 3, host: "10.20.30.69:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "2" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "10.20.30.67:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machi
ne-id: "0" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5943ef0e14d7709b0f6d704a') } }
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] This node is 10.20.30.67:37017 in the config
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to STARTUP2
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Starting replication applier threads
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to RECOVERING
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.68:37017 is now in state SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.69:37017 is now in state PRIMARY

https://paste.ubuntu.com/24982141/

----

If so, this is a regression from 2.1.2.

----

Reproduced this independently in a lab but with slightly different symptoms.

Looking at mongod logs in my environment, I cannot correlate the agent loss with failovers - there was only one transition to PRIMARY on a single controller:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep transition'
- MachineId: "1"
  Stdout: |
    Jun 29 22:33:18 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to STARTUP2
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to RECOVERING
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to SECONDARY
- MachineId: "2"
  Stdout: |
    Jun 29 22:33:18 controller-1 mongod.37017[5046]: [ReplicationExecutor] transition to S...

It seems that the issue happens during failover to a different controller/mongo judging by the time stamps provided by Thomas:

Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] New replica set config in use: { _id: "juju", version: 4, protocolVersion: 1, members: [ { _id: 2, host: "10.20.30.68:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "1" }, slaveDe
lay: 0, votes: 1 }, { _id: 3, host: "10.20.30.69:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machine-id: "2" }, slaveDelay: 0, votes: 1 }, { _id: 1, host: "10.20.30.67:37017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: { juju-machi
ne-id: "0" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5943ef0e14d7709b0f6d704a') } }
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] This node is 10.20.30.67:37017 in the config
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to STARTUP2
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Starting replication applier threads
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to RECOVERING
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] transition to SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.68:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.68:37017 is now in state SECONDARY
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.20.30.69:37017
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExecutor] Member 10.20.30.69:37017 is now in state PRIMARY

https://paste.ubuntu.com/24982141/

----

If so, this is a regression from 2.1.2.

----

Reproduced this independently in a lab but with slightly different symptoms.

Looking at mongod logs in my environment, I cannot correlate the agent loss with failovers - there was only one transition to PRIMARY on a single controller:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep transition'
- MachineId: "1"
  Stdout: |
    Jun 29 22:33:18 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to STARTUP2
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to RECOVERING
    Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExecutor] transition to SECONDARY
- MachineId: "2"
  Stdout: |
    Jun 29 22:33:18 controller-1 mongod.37017[5046]: [ReplicationExecutor] transition to STARTUP2
    Jun 29 22:33:42 controller-1 mongod.37017[5046]: [ReplicationExecutor] transition to RECOVERING
    Jun 29 22:33:42 controller-1 mongod.37017[5046]: [ReplicationExecutor] transition to SECONDARY
- MachineId: "0"
  Stdout: |
    Jun 29 22:26:39 controller-juju mongod.37017[5807]: [ReplicationExecutor] transition to STARTUP2
    Jun 29 22:26:39 controller-juju mongod.37017[5807]: [ReplicationExecutor] transition to RECOVERING
    Jun 29 22:26:39 controller-juju mongod.37017[5807]: [ReplicationExecutor] transition to SECONDARY
    Jun 29 22:26:39 controller-juju mongod.37017[5807]: [ReplicationExecutor] transition to PRIMARY
    Jun 29 22:26:40 controller-juju mongod.37017[5807]: [rsSync] transition to primary complete; database writes are now permitted

Restarting jujud* services results in connectivity being restored:

http://paste.ubuntu.com/24987408/ - before restarting

juju run --all 'sudo systemctl restart jujud*'

http://paste.ubuntu.com/24987645/ - after restarting

Dmitrii Shcherbakov (dmitriis) on 2017-06-30

summary:

- juju-agent loses connection to controller and doesn't retry to connect
+ [2.2.1] juju-agent loses connection to controller and doesn't retry to
+ connect

Tim Penhey (thumper) on 2017-07-02

Changed in juju:
status:	New → Triaged
importance:	Undecided → Critical

Tim Penhey (thumper) on 2017-07-02

Changed in juju:
milestone:	none → 2.2.2

Paolo de Rosa (paolo-de-rosa) on 2017-07-03

tags:

added: 4010

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-03:

#6

not sure if it's the same but it seems to be related, in our customer environment after enabling ha, the two new controllers have been deployed correctly but ha status is still pending.

production-controller:
  details:
    uuid: ccf049d7-c886-4c9b-81b2-e579f65016dd
    api-endpoints: ['100.107.2.43:17070', '100.107.2.218:17070', '100.107.2.219:17070']
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      -----END CERTIFICATE-----
    cloud: maas
    agent-version: 2.2.1.1
  controller-machines:
    "0":
      instance-id: skxxw7
      ha-status: ha-enabled
    "1":
      instance-id: 43gbdf
      ha-status: ha-pending
    "2":
      instance-id: s4mhpb
      ha-status: ha-pending
  models:
    controller:
      uuid: 66978ec6-9ba7-45ed-87a7-6609437b4bf1
      machine-count: 3
      core-count: 48
    default:
      uuid: 80bea1dd-4401-4cf6-8bb5-d0769b1d8d80
    production:
      uuid: 1eb3f40f-7842-4f62-8ebe-b11c0f622b29
      machine-count: 134
      core-count: 1616
  current-model: admin/production
  account:
    user: admin
    access: superuser

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-03:

#7

Machines 1 and 2 in controller model are logging in /var/log/juju/machine-[1-2].log

2017-07-03 10:03:49 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-03 10:03:52 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-2
2017-07-03 10:04:22 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-03 10:04:25 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-2

Revision history for this message

Tim Penhey (thumper) wrote on 2017-07-03:

#8

@Paolo, can you look in /var/log/syslog for mongod entries please?

Also, take a look in /var/lib/juju/db to see if there are actually database files on machines 1 and 2.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-07-03:

#9

I have been having difficulty trying to reproduce. Are you able to reproduce with the following logging config? Please set this logging config for both the controller model, and the deployment model.

juju model-config -m controller logging-config="juju=INFO;unit=WARN;juju.worker.dependency=TRACE"

If this generates a reproduction, then I'd like the machine-x.log files off all the controller machines, and a log file (or two) from the agents that are not reconnecting.

Thanks.

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-04:

#10

@Tim database files are there on both the controllers but no logs about mongo in syslog, increasing the log level the error that I got before it's a bit more verbose but always the same. Basically jujud is not able to connect to mongo on localhost, but mongo process is running and netstat says that the port is open.

All the agents are connected to the first controller that is working.

### Mongo process
17967 ? Ssl 42:56 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword=xxxxxxx --port 37017 --syslog --journal --replSet juju --quiet --oplogSize 1024 --auth --keyFile /var/lib/juju/shared-secret --sslMode requireSSL --storageEngine wiredTiger --wiredTigerCacheSizeGB 1

### Netstat
tcp 0 0 0.0.0.0:37017 0.0.0.0:* LISTEN 17967/mongod

### Log
2017-07-04 06:44:59 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:442 failed to start "state" manifold worker: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 DEBUG juju.worker.dependency engine.go:499 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:421 starting "state" manifold worker in 3.015s...
2017-07-04 06:45:33 TRACE juju.worker.dependency engine.go:430 starting "state" manifold worker
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "agent" resource
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "state-config-watcher" resource
2017-07-04 06:45:33 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1

@Tim database files are there on both the controllers but no logs about mongo in syslog, increasing the log level the error that I got before it's a bit more verbose but always the same. Basically jujud is not able to connect to mongo on localhost, but mongo process is running and netstat says that the port is open.

All the agents are connected to the first controller that is working.

### Mongo process
17967 ?        Ssl   42:56 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword=xxxxxxx --port 37017 --syslog --journal --replSet juju --quiet --oplogSize 1024 --auth --keyFile /var/lib/juju/shared-secret --sslMode requireSSL --storageEngine wiredTiger --wiredTigerCacheSizeGB 1

### Netstat
tcp        0      0 0.0.0.0:37017           0.0.0.0:*               LISTEN      17967/mongod

### Log
2017-07-04 06:44:59 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:442 failed to start "state" manifold worker: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 DEBUG juju.worker.dependency engine.go:499 "state" manifold worker stopped: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 ERROR juju.worker.dependency engine.go:546 "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
2017-07-04 06:45:30 TRACE juju.worker.dependency engine.go:421 starting "state" manifold worker in 3.015s...
2017-07-04 06:45:33 TRACE juju.worker.dependency engine.go:430 starting "state" manifold worker
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "agent" resource
2017-07-04 06:45:33 TRACE juju.worker.dependency context.go:43 "state" manifold requested "state-config-watcher" resource
2017-07-04 06:45:33 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:37017"]; entity machine-1

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-04:

#11

Download full text (5.1 KiB)

I have restarted mongo and some log is in syslog but same result, jujud is not able to connect to mongo locally.

Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] Shutting down full-time diagnostic data capture
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] now exiting
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close listening sockets...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 6
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 7
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] removing socket file: /tmp/mongodb-37017.sock
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to flush diaglog...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close sockets...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] WiredTigerKVEngine shutting down
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: removing fs lock...
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] dbexit: rc: 0
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod[20333]: 2017-07-04T06:51:18.681+0000 W CONTROL [main] No SSL certificate validation can be performed since no CA file has been provided; please specify an sslCAFile parameter
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] MongoDB starting : pid=20333 port=37017 dbpath=/var/lib/juju/db 64-bit host=sitename-mgm231juju2-hr
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] db version v3.2.12
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] allocator: tcmalloc
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] modules: none
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] build environment:
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] distarch: x86_64
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] target_arch: x86_64
Jul 4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] options: { net: { port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTig...

I have restarted mongo and some log is in syslog but same result, jujud is not able to connect to mongo locally.

Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] Shutting down full-time diagnostic data capture
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] now exiting
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close listening sockets...
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 6
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] closing listening socket: 7
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] removing socket file: /tmp/mongodb-37017.sock
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to flush diaglog...
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: going to close sockets...
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] WiredTigerKVEngine shutting down
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] shutdown: removing fs lock...
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[17967]: [signalProcessingThread] dbexit:  rc: 0
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod[20333]: 2017-07-04T06:51:18.681+0000 W CONTROL  [main] No SSL certificate validation can be performed since no CA file has been provided; please specify an sslCAFile parameter
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] MongoDB starting : pid=20333 port=37017 dbpath=/var/lib/juju/db 64-bit host=sitename-mgm231juju2-hr
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] db version v3.2.12
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] allocator: tcmalloc
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] modules: none
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] build environment:
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten]     distarch: x86_64
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten]     target_arch: x86_64
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] options: { net: { port: 37017, ssl: { PEMKeyFile: "/var/lib/juju/server.pem", PEMKeyPassword: "<password>", mode: "requireSSL" } }, replication: { oplogSizeMB: 1024, replSet: "juju" }, security: { authorization: "enabled", keyFile: "/var/lib/juju/shared-secret" }, storage: { dbPath: "/var/lib/juju/db", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 1 } } }, systemLog: { destination: "syslog", quiet: true } }
Jul  4 06:51:18 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten]
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten]
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 20000 processes, 65000 files. Number of processes should be at least 32500 : 0.5 times number of files.
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] Did not find local voted for document at startup.
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/juju/db/diagnostic.data'
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
Jul  4 06:51:19 sitename-mgm231juju2-hr mongod.37017[20333]: [initandlisten] waiting for connections on port 37017 ssl
Jul  4 06:51:20 sitename-mgm231juju2-hr mongod.37017[20333]: [conn1] note: no users configured in admin.system.users, allowing localhost access

Revision history for this message

Tim Penhey (thumper) wrote on 2017-07-04:

#12

As far as not being able to connect to the local mongo, I see only two possibilities.

One is that the agent has an invalid password, but I'm pretty sure that that would give a different error message.

The other is that a proxy setting is getting in the way.

What are the various model proxy values set for the controller model?

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-07-04:

#13

Paolo: Are you able to connect to MongoDB using the mongo shell?

If so, can you please check the replicaset status using the "rs.status()" command?

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-05:

#14

There are no proxies set.

Attribute From Value
agent-metadata-url default ""
agent-stream default released
agent-version model 2.2.1.1
apt-ftp-proxy default ""
apt-http-proxy default ""
apt-https-proxy default ""
apt-mirror default ""
apt-no-proxy default ""
automatically-retry-hooks default true
default-series default xenial
development default false
disable-network-management default false
enable-os-refresh-update default true
enable-os-upgrade default true
firewall-mode default instance
ftp-proxy default ""
http-proxy default ""
https-proxy default ""
ignore-machine-addresses default false
image-metadata-url default ""
image-stream default released
logforward-enabled default false
logging-config model juju=INFO;unit=WARN;juju.worker.dependency=TRACE
max-status-history-age default 336h
max-status-history-size default 5G
net-bond-reconfigure-delay default 17
no-proxy default 127.0.0.1,localhost,::1
provisioner-harvest-mode default destroyed
proxy-ssh default false
resource-tags model {}
ssl-hostname-verification default true
storage-default-block-source model maas
test-mode default false
transmit-vendor-metrics default true
update-status-hook-interval default 5m

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-05:

#15

Connecting with mongo cli it seems that the password is wrong I'm still using "oldpassword" to connect is that right ?

mongo --ssl -u admin -p $(grep oldpassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/admin

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-05: Re: [Bug 1701275] Re: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect

#16

oldpassword and admin is only relevant on the machine that bootstrapped.
you want to use "statepassword" and the agent name (eg machine-2) on all
other machines.

John
=:->

On Jul 5, 2017 15:11, "Paolo de Rosa" <email address hidden> wrote:

> Connecting with mongo cli it seems that the password is wrong I'm still
> using "oldpassword" to connect is that right ?
>
> mongo --ssl -u admin -p $(grep oldpassword
> /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}')
> localhost:37017/admin
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1701275
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1701275/+subscriptions
>

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-05:

#17

same result.

root@site-mgm231juju2-hr:~# mongo --ssl -u admin -p $(grep statepassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/machine-1
MongoDB shell version: 2.6.10
connecting to: localhost:37017/machine-1
2017-07-05T15:22:24.234+0000 Error: 18 { ok: 0.0, errmsg: "auth failed", code: 18 } at src/mongo/shell/db.js:1287
exception: login failed

Revision history for this message

Paolo de Rosa (paolo-de-rosa) wrote on 2017-07-05:

#18

mongo --ssl -u machine-1 -p $(grep statepassword /var/lib/juju/agents/machine-1/agent.conf | awk -e '{print $2}') localhost:37017/machine-1
MongoDB shell version: 2.6.10
connecting to: localhost:37017/machine-1
2017-07-05T15:24:27.386+0000 Error: 18 { ok: 0.0, errmsg: "auth failed", code: 18 } at src/mongo/shell/db.js:1287
exception: login failed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-07-05:

#19

I think you need to provide the "--authenticationDatabase" arg. You're also passing the username where the database name goes (but that's probably not the problem). Try something like this:

mongopass=`sudo grep statepassword /var/lib/juju/agents/machine-1/agent.conf | cut -d' ' -f2`

mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "machine-1" --password "$mongopass"

Revision history for this message

Tim Penhey (thumper) wrote on 2017-07-05:

#20

I have used the following to get access to mongo on any API server:

```
user=$(basename `ls -d /var/lib/juju/agents/machine-*`)
password=`sudo grep statepassword /var/lib/juju/agents/machine-*/agent.conf | cut -d' ' -f2`
/usr/lib/juju/mongo*/bin/mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"
```

Which is now also defined in https://github.com/juju/juju/wiki/Diagnosing-MongoDB-Performance

Is it possible to get a developer to work with you to try and work this out?

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-07-06:

#21

With regard to the same environment reported by Paolo on previous comments, we can't log in on "secondary-to-be" mongo services.

0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
1) "JUJU_MODEL=controller juju status --format yaml": http://pastebin.ubuntu.com/25031101/
2) rs.status() on PRIMARY: http://pastebin.ubuntu.com/25031086/
3) when trying to log in on "secondary-to-be" controllers: http://pastebin.ubuntu.com/25031117/
3.1) syslog shows:
"""
Jul 6 09:50:07 controller2 mongod.37017[20333]: [conn168826] SCRAM-SHA-1 authentication failed for machine-1 on admin from client 127.0.0.1 ; UserNotFound: Could not find user machine-1@admin
"""

It looks like mongo replicas are not able to properly bootstrap and reach the primary server.

Please let me know if I can help you further or need more information of any kind.

Kind regards,
-Alvaro.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-06:

#22

Given that as far as we can see the other mongo's haven't joined the
replica set, and the local controllers aren't actually able to log into
mongo, I wonder if its time to take the 'nuclear' option. If you stopped
jujud, and then sudo apt remove juju-mongodb3.2; and sudo rm -rf
/var/lib/juju/db
I wonder if that would be enough for the jujud agents to reinstall mongo
and re-initialize a local mongo replica.

Note that I haven't tried doing any of those steps myself, so I can't
guarantee what would work, but *something* like that may work for those
agents. They should want to create a local mongo replica and make it sync
on startup.

You could probably test this sort of thing on an 'lxd' provider just by
doing "juju bootstrap lxd; juju enable-ha -n3", and then stopping and
deleting mongo on one of the machines.
If you wanted to take it a step further, you could potentially firewall
port 37017 before running 'enable-ha' so that the other mongo daemons
wouldn't be able to join the replica set.

John
=:->

On Thu, Jul 6, 2017 at 1:56 PM, Alvaro Uría <email address hidden>
wrote:

> With regard to the same environment reported by Paolo on previous
> comments, we can't log in on "secondary-to-be" mongo services.
>
> 0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
> 1) "JUJU_MODEL=controller juju status --format yaml":
> http://pastebin.ubuntu.com/25031101/
> 2) rs.status() on PRIMARY: http://pastebin.ubuntu.com/25031086/
> 3) when trying to log in on "secondary-to-be" controllers:
> http://pastebin.ubuntu.com/25031117/
> 3.1) syslog shows:
> """
> Jul 6 09:50:07 controller2 mongod.37017[20333]: [conn168826] SCRAM-SHA-1
> authentication failed for machine-1 on admin from client 127.0.0.1 ;
> UserNotFound: Could not find user machine-1@admin
> """
>
> It looks like mongo replicas are not able to properly bootstrap and
> reach the primary server.
>
> Please let me know if I can help you further or need more information of
> any kind.
>
> Kind regards,
> -Alvaro.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1701275
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1701275/+subscriptions
>

Given that as far as we can see the other mongo's haven't joined the
replica set, and the local controllers aren't actually able to log into
mongo, I wonder if its time to take the 'nuclear' option. If you stopped
jujud, and then sudo apt remove juju-mongodb3.2; and sudo rm -rf
/var/lib/juju/db
I wonder if that would be enough for the jujud agents to reinstall mongo
and re-initialize a local mongo replica.

Note that I haven't tried doing any of those steps myself, so I can't
guarantee what would work, but *something* like that may work for those
agents. They should want to create a local mongo replica and make it sync
on startup.

You could probably test this sort of thing on an 'lxd' provider just by
doing "juju bootstrap lxd; juju enable-ha -n3", and then stopping and
deleting mongo on one of the machines.
If you wanted to take it a step further, you could potentially firewall
port 37017 before running 'enable-ha' so that the other mongo daemons
wouldn't be able to join the replica set.

John
=:->

On Thu, Jul 6, 2017 at 1:56 PM, Alvaro Uría <1701275@bugs.launchpad.net>
wrote:

> With regard to the same environment reported by Paolo on previous
> comments, we can't log in on "secondary-to-be" mongo services.
>
> 0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
> 1) "JUJU_MODEL=controller juju status --format yaml":
> http://pastebin.ubuntu.com/25031101/
> 2) rs.status() on PRIMARY: http://pastebin.ubuntu.com/25031086/
> 3) when trying to log in on "secondary-to-be" controllers:
> http://pastebin.ubuntu.com/25031117/
> 3.1) syslog shows:
> """
> Jul  6 09:50:07 controller2 mongod.37017[20333]: [conn168826] SCRAM-SHA-1
> authentication failed for machine-1 on admin from client 127.0.0.1 ;
> UserNotFound: Could not find user machine-1@admin
> """
>
> It looks like mongo replicas are not able to properly bootstrap and
> reach the primary server.
>
> Please let me know if I can help you further or need more information of
> any kind.
>
> Kind regards,
> -Alvaro.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1701275
>
> Title:
>   [2.2.1] juju-agent loses connection to controller and doesn't retry to
>   connect
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1701275/+subscriptions
>

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-07-08:

#23

Download full text (3.4 KiB)

The issue is reproducible after a restart of all agents - not only a single time after deployment:

1) {unit,machine} agent loss reported
2) juju run --all 'sudo systemctl restart jujud*'
3) wait for a day
4) {unit,machine} agent loss reported

There were no failovers after an initial deployment and two cases agent loss:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep PRIMARY'
- MachineId: "1"
  Stdout: |
    Jul 05 23:31:05 controller-2 mongod.37017[4569]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "2"
  Stdout: |
    Jul 05 23:31:19 controller-1 mongod.37017[4557]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "0"
  Stdout: |
    Jul 05 23:24:46 controller-juju mongod.37017[5295]: [ReplicationExecutor] transition to PRIMARY
    Jul 08 12:36:15 controller-juju mongod.37017[5295]: [conn17] command juju.actions command: insert { insert: "actions", documents: [ { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:0492f660-6ee0-4137-8c0c-ad9d930cb3c2", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "0/lxd/2", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517375000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {}, txn-revno: 2, txn-queue: [ "5960d1be86d3b11563e97d6e_f3e2ce96" ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 155ms
    Jul 08 12:36:24 controller-juju mongod.37017[5295]: [conn17] command juju.txns command: insert { insert: "txns", documents: [ { _id: ObjectId('5960d1c886d3b11563e97da6'), s: 1, o: [ { c: "machines", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4", a: { life: { $ne: 2 } } }, { c: "actions", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517385000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {} } }, { c: "actionnotifications", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", actionid: "2d8f6c6b-75db-4799-85c5-e57c90fa3cae" } } ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Co...

The issue is reproducible after a restart of all agents - not only a single time after deployment:

1) {unit,machine} agent loss reported
2) juju run --all 'sudo systemctl restart jujud*'
3) wait for a day
4) {unit,machine} agent loss reported

There were no failovers after an initial deployment and two cases agent loss:

ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep PRIMARY'
- MachineId: "1"
  Stdout: |
    Jul 05 23:31:05 controller-2 mongod.37017[4569]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "2"
  Stdout: |
    Jul 05 23:31:19 controller-1 mongod.37017[4557]: [ReplicationExecutor] Member 10.232.7.241:37017 is now in state PRIMARY
- MachineId: "0"
  Stdout: |
    Jul 05 23:24:46 controller-juju mongod.37017[5295]: [ReplicationExecutor] transition to PRIMARY
    Jul 08 12:36:15 controller-juju mongod.37017[5295]: [conn17] command juju.actions command: insert { insert: "actions", documents: [ { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:0492f660-6ee0-4137-8c0c-ad9d930cb3c2", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "0/lxd/2", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517375000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {}, txn-revno: 2, txn-queue: [ "5960d1be86d3b11563e97d6e_f3e2ce96" ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 155ms
    Jul 08 12:36:24 controller-juju mongod.37017[5295]: [conn17] command juju.txns command: insert { insert: "txns", documents: [ { _id: ObjectId('5960d1c886d3b11563e97da6'), s: 1, o: [ { c: "machines", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4", a: { life: { $ne: 2 } } }, { c: "actions", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", name: "juju-run", parameters: { command: "journalctl -u juju-db | grep PRIMARY", timeout: 300000000000 }, enqueued: new Date(1499517385000), started: new Date(-62135596800000), completed: new Date(-62135596800000), status: "pending", message: "", results: {} } }, { c: "actionnotifications", d: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", a: "d-", i: { _id: "95fd14d9-518f-4dbb-872d-060f8ca79886:3/lxd/4_a_2d8f6c6b-75db-4799-85c5-e57c90fa3cae", model-uuid: "95fd14d9-518f-4dbb-872d-060f8ca79886", receiver: "3/lxd/4", actionid: "2d8f6c6b-75db-4799-85c5-e57c90fa3cae" } } ] } ], writeConcern: { getLastError: 1, j: true }, ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 105ms

Also, although {unit,machine} agents are reported as lost, events seem to be delivered and handled properly:
http://paste.ubuntu.com/25046190/

Chris Gregan (cgregan) on 2017-07-10

tags:

added: cdo-qa-blocker

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-07-10:

#24

In my case, I was not able to reproduce the issue in a staging environment (using an openstack endpoint as backend). Prod env uses MAAS as endpoint.

1)
Prod: juju bootstrap prod-cloud --constraints tags=juju
Staging: juju bootstrap canonistack --constraints "cores=2 mem=4G"

2) juju enable-ha

3)
Prod: node#0 has-vote; node#1,2 "adding-vote". PRIMARY mongo does not show replicasets, but jujud-machine-{1,2} are started and mongo installed (although no machine-{1,2} users are created in juju-db).
Staging:
3.1) iptables --dport 37017 -j DROP
PRIMARY shows replicasets created, but in STARTUP state (no SECONDARY).
Stopping jujud-machine-{1,2}, "apt remove juju-mongodb3.2" and restarting jujud-machine-{1,2} makes juju-mongodb3.2 being reinstalled.

3.2) iptables --dport 17070 -j DROP (no other DROPs)
PRIMARY does not show new replicasets, only the primary mongo.
juju-mongodb3.2 is installed and started, but no machine-{1,2} user exists.
"juju status" shows nodes#1,#2 as "pending" and "juju ssh 1" (or "2") does not work because pub key has not been imported (no access to jujud-machine-0 to gather data).

At any time, flushing iptables rules recovers the environment (juju-db might need a restart on node#1,#2 to complete replication).

TL;DR:
The diversion between Prod and Staging is that Prod's jujud-machine-{1,2} connects to jujud-machine-0 but replicasets are not created in PRIMARY juju-db (needed to step for the other mongos to be able to connect to the primary instance).

John A Meinel (jameinel) on 2017-07-11

Changed in juju:
milestone:	2.2.2 → 2.3-alpha1
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-12:

#25

So I have a theory that this *might* be related to the Presence issues in 2.2.1.
(bug #1703526).

I'm investigating now, but the theory is that if
$ juju bootstrap
# wait for presence to break from bug #1703526
$ juju enable-ha -n3

At this point, the machines come up, but because Presence is broken, they don't get detected as really being available, and thus don't get added to the replica set.
The code I'm curious about is:
  // Don't add the machine unless it is "Started"
  machineStatus, err := stm.Status()
  if err != nil {
   return false, errors.Annotatef(err, "cannot get status for machine %q", id)
  }
  if machineStatus.Status == status.Started {
   logger.Debugf("machine %q has started, adding it to peergrouper list", id)
   tracker, err := newMachineTracker(stm, w.machineChanges)
   if err != nil {
    return false, errors.Trace(err)
   }
   if err := w.catacomb.Add(tracker); err != nil {
    return false, errors.Trace(err)
   }
   w.machineTrackers[id] = tracker
   changed = true
  } else {
   logger.Debugf("machine %q not ready: %v", id, machineStatus.Status)
  }

So if Presence is broken, can the machine go into "status.Started".

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-12:

#26

My theory appears to be incorrect. I bootstrapped lxd, and broke the ping batcher. I could see that "juju status" reported all agents as lost.
I then ran "juju enable-ha -n3", and it actually demoted machine 0 because it was seen as dead, and promoted 3 new machines (1,2,3).
Those machines all came up successfully, and were added to the replica set, which now reports machines 1 2 and 3 as in the replica set. 'rs.Status()' also reports 1,2,3 as being in the set.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-12:

#27

Something else appears to be going wrong. All of the new machines that were started are happily participating in the new replicaset, but they are all flagged as "ha-pending" instead of "ha-enabled" and machine 0 is refusing to step down (stuck in removing-vote).

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-12:

#28

Download full text (8.2 KiB)

I'm pretty sure the initial issue is actually just bug #1703526

Looking at the original paste, I think the issue is that we have:
2017-06-28 10:30:10 DEBUG juju.worker.logger logger.go:63 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2017-06-28 10:32:42 WARNING juju.provisioner broker.go:100 incomplete DNS config found, discovering host's DNS config

By default log level is WARNING. So you see errors when we get disconnected, but you *don't* see any messages pertaining to reconnecting, because those are at best INFO level.

And it appears like the agent isn't trying to reconnect (which is true) but is because it is already happily connected. bug #1703526 is because one of the workers on the controller side could get restarted, and then we would stop tracking agent presence correctly.

In the next "controller logs" I think we have these lines:
no kvm containers possible
2017-06-29 09:39:17 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded
2017-06-29 09:39:21 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded

^- I think this is bug #1701438

and
2017-06-29 13:43:39 WARNING juju.apiserver log.go:168 http: TLS handshake error from 10.20.30.1:46188: tls: oversized record received with length 64774
^- I'm guessing this is testing Telnet.

for https://paste.ubuntu.com/24982031/
that looks like it is just when you did 'enable-ha' we create a replica set, and once that is active, mongo disconnects everyone and triggers a vote which may change the primary.
Which should be what was going on around:
Jun 29 09:39:15

For the next item around "ha-pending", I think we should track that as a different bug, as this bug has since grown a bit too hairy to really tell which thing is which.
I have a system which shows the same symptom (all the new agents are in the replicaset and happy, but the old controller is still listed as "removing-vote" and the new ones are "adding-vote" even though they all *have* the right votes.)

Note that my case seems to differ from Paolo's because I do see the machines showing up as part of the replica set, but they are still listed as "adding-vote", while he never saw them get added to Mongo's configuration.

Looking at the debug-log under my condition I see:
machine-1: 14:10:08 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-0: 14:10:11 DEBUG juju.worker.peergrouper API host ports have not changed
machine-0: 14:10:11 DEBUG juju.worker.peergrouper calculating desired peer group
machine-0: 14:10:11 DEBUG juju.worker.peergrouper members: ...
   &peergrouper.machine{id: "3", wantsVote: true, hostPorts: [10.67.99.26:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=3, rs_addr=10.67.99.26:37017
   &peergrouper.machine{id: "0", wantsVote: false, hostPorts: [10.67.99.160:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=10.67.99.160:37017
   &peergrouper.machine{id: "2", wantsVote: true, hostPorts: [10.67.99.184:3701...

I'm pretty sure the initial issue is actually just bug #1703526

Looking at the original paste, I think the issue is that we have:
2017-06-28 10:30:10 DEBUG juju.worker.logger logger.go:63 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2017-06-28 10:32:42 WARNING juju.provisioner broker.go:100 incomplete DNS config found, discovering host's DNS config

By default log level is WARNING. So you see errors when we get disconnected, but you *don't* see any messages pertaining to reconnecting, because those are at best INFO level.

And it appears like the agent isn't trying to reconnect (which is true) but is because it is already happily connected. bug #1703526 is because one of the workers on the controller side could get restarted, and then we would stop tracking agent presence correctly.

In the next "controller logs" I think we have these lines:
no kvm containers possible
2017-06-29 09:39:17 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded
2017-06-29 09:39:21 ERROR juju.worker.dependency engine.go:546 "api-caller" manifold worker returned unexpected error: cannot open api: cannot log in: context deadline exceeded

^- I think this is bug #1701438

and
2017-06-29 13:43:39 WARNING juju.apiserver log.go:168 http: TLS handshake error from 10.20.30.1:46188: tls: oversized record received with length 64774
^- I'm guessing this is testing Telnet.

for https://paste.ubuntu.com/24982031/
that looks like it is just when you did 'enable-ha' we create a replica set, and once that is active, mongo disconnects everyone and triggers a vote which may change the primary.
Which should be what was going on around:
Jun 29 09:39:15

For the next item around "ha-pending", I think we should track that as a different bug, as this bug has since grown a bit too hairy to really tell which thing is which.
I have a system which shows the same symptom (all the new agents are in the replicaset and happy, but the old controller is still listed as "removing-vote" and the new ones are "adding-vote" even though they all *have* the right votes.)

Note that my case seems to differ from Paolo's because I do see the machines showing up as part of the replica set, but they are still listed as "adding-vote", while he never saw them get added to Mongo's configuration.

Looking at the debug-log under my condition I see:
machine-1: 14:10:08 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-0: 14:10:11 DEBUG juju.worker.peergrouper API host ports have not changed
machine-0: 14:10:11 DEBUG juju.worker.peergrouper calculating desired peer group
machine-0: 14:10:11 DEBUG juju.worker.peergrouper members: ...
   &peergrouper.machine{id: "3", wantsVote: true, hostPorts: [10.67.99.26:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=3, rs_addr=10.67.99.26:37017
   &peergrouper.machine{id: "0", wantsVote: false, hostPorts: [10.67.99.160:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=10.67.99.160:37017
   &peergrouper.machine{id: "2", wantsVote: true, hostPorts: [10.67.99.184:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=2, rs_addr=10.67.99.184:37017
machine-0: 14:10:11 DEBUG juju.worker.peergrouper extra: []replicaset.Member(nil)
machine-0: 14:10:11 DEBUG juju.worker.peergrouper maxId: 3
machine-0: 14:10:11 DEBUG juju.worker.peergrouper assessing possible peer group changes:
machine-0: 14:10:11 DEBUG juju.worker.peergrouper machine "2" is a potential voter
machine-0: 14:10:11 DEBUG juju.worker.peergrouper machine "3" is a potential voter
machine-0: 14:10:11 DEBUG juju.worker.peergrouper machine "0" is a potential non-voter
machine-1: 14:10:42 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-1: 14:10:45 INFO juju.state opening state, mongo addresses: ["localhost:37017"]; entity machine-1
machine-1: 14:11:16 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-1: 14:11:19 INFO juju.state opening state, mongo addresses: ["localhost:37017"]; entity machine-1
machine-0: 14:11:34 INFO juju.apiserver.metricsender nothing to send
machine-0: 14:11:34 INFO juju.apiserver.metricsender metrics collection summary for model-4c867992-c32f-4ea1-827b-ba53c3ac6e58: sent:0 unsent:0 held:0 (0 sent metrics stored)
machine-2: 14:11:35 INFO juju.apiserver.metricsender nothing to send
machine-2: 14:11:35 INFO juju.apiserver.metricsender metrics collection summary for model-054f88f7-013f-4d76-8ad5-18c5f79924bb: sent:0 unsent:0 held:0 (0 sent metrics stored)
machine-1: 14:11:50 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-1: 14:11:53 INFO juju.state opening state, mongo addresses: ["localhost:37017"]; entity machine-1
unit-ul-0: 14:12:16 INFO juju.worker.uniter no operations in progress; waiting for changes
unit-ul-0: 14:12:16 INFO juju.worker.uniter.operation skipped "update-status" hook (missing)
unit-ul-0: 14:12:16 INFO juju.worker.uniter no operations in progress; waiting for changes
machine-1: 14:12:24 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-1: 14:12:27 INFO juju.state opening state, mongo addresses: ["localhost:37017"]; entity machine-1
machine-2: 14:12:32 DEBUG juju.worker.peergrouper API host ports have not changed
machine-2: 14:12:32 DEBUG juju.worker.peergrouper calculating desired peer group
machine-2: 14:12:32 DEBUG juju.worker.peergrouper members: ...
   &peergrouper.machine{id: "0", wantsVote: false, hostPorts: [10.67.99.160:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=1, rs_addr=10.67.99.160:37017
   &peergrouper.machine{id: "2", wantsVote: true, hostPorts: [10.67.99.184:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=2, rs_addr=10.67.99.184:37017
   &peergrouper.machine{id: "3", wantsVote: true, hostPorts: [10.67.99.26:37017 127.0.0.1:37017 [::1]:37017]}: rs_id=3, rs_addr=10.67.99.26:37017
machine-2: 14:12:32 DEBUG juju.worker.peergrouper extra: []replicaset.Member(nil)
machine-2: 14:12:32 DEBUG juju.worker.peergrouper maxId: 3
machine-2: 14:12:32 DEBUG juju.worker.peergrouper assessing possible peer group changes:
machine-2: 14:12:32 DEBUG juju.worker.peergrouper machine "0" is a potential non-voter
machine-2: 14:12:32 DEBUG juju.worker.peergrouper machine "2" is a potential voter
machine-2: 14:12:32 DEBUG juju.worker.peergrouper machine "3" is a potential voter
machine-2: 14:12:32 DEBUG juju.worker.peergrouper machine "1" is not ready (status: STARTUP, healthy: false)
machine-2: 14:12:32 DEBUG juju.worker.peergrouper assessed
machine-2: 14:12:32 DEBUG juju.worker.peergrouper desired peer group members:
    Id: 1, Tags: map[juju-machine-id:0], Vote: false
    Id: 2, Tags: map[juju-machine-id:2], Vote: true
    Id: 3, Tags: map[juju-machine-id:3], Vote: false
    Id: 4, Tags: map[juju-machine-id:1], Vote: false
machine-2: 14:12:32 INFO juju.worker.peergrouper setting HasVote=true on machines [2]
machine-2: 14:12:32 INFO juju.worker.peergrouper setting HasVote=false on machines [2]
machine-2: 14:12:32 ERROR juju.worker.peergrouper cannot set replicaset: This node, 10.67.99.160:37017, with _id 1 is not electable under the new configuration version 4 for replica set juju
machine-2: 14:12:34 INFO juju.tools.lxdclient using LXD API version "1.0"
machine-2: 14:12:34 INFO juju.apiserver.common.networkingcommon provider network config not supported: model "controller" networking not supported
machine-1: 14:12:58 ERROR juju.worker.dependency "state" manifold worker returned unexpected error: cannot connect to mongodb: no reachable servers
machine-1: 14:13:01 INFO juju.state opening state, mongo addresses: ["localhost:37017"]; entity machine-1

It looks like my issue is that machine 0 is the current primary, but it is trying to get demoted, but Mongo is refusing to allow that voting configuration? (a mongo primary cannot vote itself out of the new replicaset?)
What's weird is that machine-2 is the one complaining, but its complaining about machine-0 (_id = 1).

Anyway, "agents fail to reconnect" is probably a dupe of the other bug, and I'm still investigating if we have problems with enable-ha.

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-07-12:

#29

machine-1.log Edit (29.0 KiB, text/plain)

Initial status (prod env): rs.status() on PRIMARY does not show any replicaset.

Following suggestion on #22,
1) I stopped jujud-machine-1 and its juju-db (not able to connect to it as machine-1 user does not exist)
2) I also stopped jujud-machine-2 and its juju-db (same as step 1)) - I also backed up /var/lib/juju/db
3) On machine-1, I removed juju-mongodb3.2 pkg and started jujud-machine-1 (find log attached since start)

On machine-0, I checked rs.status() again and it remains the same (no replicasets, just the PRIMARY). When I tested this in staging, replicaset was configured on primary mongo.

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-07-12:

#30

BTW, this is the same environment reported by Paolo.

More info apart from #29. machine-0.log shows:
2017-07-12 14:22:12 ERROR juju.worker.peergrouper worker.go:211 cannot set replicaset: Found two member configurations with same _id field, members.0._id == members.1._id == 1

juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2017-07-12T14:24:14.847Z"),
        "myState" : 1,
        "term" : NumberLong(1),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 1,
                        "name" : "127.0.0.1:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 1661145,
                        "optime" : {
                                "ts" : Timestamp(1499869454, 100),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2017-07-12T14:24:14Z"),
                        "electionTime" : Timestamp(1498208312, 2),
                        "electionDate" : ISODate("2017-06-23T08:58:32Z"),
                        "configVersion" : 2,
                        "self" : true
                }
        ],
        "ok" : 1
}

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-07-13:

#31

The "Found two member configurations with same _id" error is a new one for me. It would be helpful to see what "rs.conf()" says.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-13:

#32

I think the original 'agent loses connection' bug is just a dupe of the existing bug, and the work on fixing Replica set should be split out into a separate bug.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-07-13:

#33

Split out of Mongo issues moved to bug #1704072

Revision history for this message

Alvaro Uria (aluria) wrote on 2017-07-13:

#34

Menno, I've filed bug 1704075 (re: "Found two member configurations with same _id").

Canonical Juju

[2.2.1] juju-agent loses connection to controller and doesn't retry to connect

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Critical	John A Meinel	Canonical Juju 2.3-beta1
	2.2	Triaged	Critical	John A Meinel	Canonical Juju 2.2.2