[2.2.1] juju-agent loses connection to controller and doesn't retry to connect
- Series 2.2
- Bug #1701275
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Critical
|
John A Meinel | ||
2.2 |
Triaged
|
Critical
|
John A Meinel |
Bug Description
After the deployment is done and works successfully, after 1 day nearly all agents lost the connection to the controller.
The juju-agents doesn't retry to connect to the controllers.
Log output:
https:/
init@maas:~$ juju show-controller
XXX:
details:
uuid: c8910be4-
api-endpoints: ['10.20.
ca-cert: |
-----BEGIN CERTIFICATE-----
XXX
-----END CERTIFICATE-----
cloud: XXX
agent-version: 2.2.1
controller-
"0":
instance-id: 7p3wy7
ha-status: ha-enabled
"1":
instance-id: ypm4xs
ha-status: ha-enabled
"2":
instance-id: cqgktg
ha-status: ha-enabled
models:
controller:
uuid: def665b3-
machine-
core-count: 6
default:
uuid: 2a0acada-
openstack:
uuid: b3b6050e-
machine-
core-count: 317
current-model: admin/controller
account:
user: admin
access: superuser
Thomas Zylla (thomas.zylla) wrote : | #1 |
tags: | added: cpe cpe-sa |
Dmitrii Shcherbakov (dmitriis) wrote : | #2 |
Thomas,
Could you also provide controller logs to find out what happened at about the same time that these 'connection reset' errors happened?
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "9-container-
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "lxd-provisioner": codec.ReadHeader error: error receiving message: read tcp 10.20.37.
2017-06-29 09:39:32 ERROR juju.worker runner.go:381 fatal "stateconverter": codec.ReadHeader error: error receiving message: read tcp 10.20.37.
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker.
2017-06-29 09:39:32 ERROR juju.worker.
Thomas Zylla (thomas.zylla) wrote : | #3 |
juju status for the controller:
https:/
juju controller-log from the same time:
https:/
looks like the HA-failover didn't notify the juju-agents since the primary controller changed at the same time:
https:/
Thomas Zylla (thomas.zylla) wrote : | #4 |
output from all juju agents with "connection reset by peer" grep:
https:/
Dmitrii Shcherbakov (dmitriis) wrote : | #5 |
It seems that the issue happens during failover to a different controller/mongo judging by the time stamps provided by Thomas:
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
lay: 0, votes: 1 }, { _id: 3, host: "10.20.
ne-id: "0" }, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatInterv
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfa
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfa
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfa
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [NetworkInterfa
Jun 29 09:39:15 juju-controller01 mongod.37017[1237]: [ReplicationExe
https:/
----
If so, this is a regression from 2.1.2.
----
Reproduced this independently in a lab but with slightly different symptoms.
Looking at mongod logs in my environment, I cannot correlate the agent loss with failovers - there was only one transition to PRIMARY on a single controller:
ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep transition'
- MachineId: "1"
Stdout: |
Jun 29 22:33:18 controller-2 mongod.37017[5026]: [ReplicationExe
Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExe
Jun 29 22:33:43 controller-2 mongod.37017[5026]: [ReplicationExe
- MachineId: "2"
Stdout: |
Jun 29 22:33:18 controller-1 mongod.37017[5046]: [ReplicationExe
summary: |
- juju-agent loses connection to controller and doesn't retry to connect + [2.2.1] juju-agent loses connection to controller and doesn't retry to + connect |
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in juju: | |
milestone: | none → 2.2.2 |
tags: | added: 4010 |
Paolo de Rosa (paolo-de-rosa) wrote : | #6 |
not sure if it's the same but it seems to be related, in our customer environment after enabling ha, the two new controllers have been deployed correctly but ha status is still pending.
production-
details:
uuid: ccf049d7-
api-endpoints: ['100.107.
ca-cert: |
-----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----
cloud: maas
agent-version: 2.2.1.1
controller-
"0":
instance-id: skxxw7
ha-status: ha-enabled
"1":
instance-id: 43gbdf
ha-status: ha-pending
"2":
instance-id: s4mhpb
ha-status: ha-pending
models:
controller:
uuid: 66978ec6-
machine-
core-count: 48
default:
uuid: 80bea1dd-
production:
uuid: 1eb3f40f-
machine-
core-count: 1616
current-model: admin/production
account:
user: admin
access: superuser
Paolo de Rosa (paolo-de-rosa) wrote : | #7 |
Machines 1 and 2 in controller model are logging in /var/log/
2017-07-03 10:03:49 ERROR juju.worker.
2017-07-03 10:03:52 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:
2017-07-03 10:04:22 ERROR juju.worker.
2017-07-03 10:04:25 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:
Tim Penhey (thumper) wrote : | #8 |
@Paolo, can you look in /var/log/syslog for mongod entries please?
Also, take a look in /var/lib/juju/db to see if there are actually database files on machines 1 and 2.
Tim Penhey (thumper) wrote : | #9 |
I have been having difficulty trying to reproduce. Are you able to reproduce with the following logging config? Please set this logging config for both the controller model, and the deployment model.
juju model-config -m controller logging-
If this generates a reproduction, then I'd like the machine-x.log files off all the controller machines, and a log file (or two) from the agents that are not reconnecting.
Thanks.
Paolo de Rosa (paolo-de-rosa) wrote : | #10 |
@Tim database files are there on both the controllers but no logs about mongo in syslog, increasing the log level the error that I got before it's a bit more verbose but always the same. Basically jujud is not able to connect to mongo on localhost, but mongo process is running and netstat says that the port is open.
All the agents are connected to the first controller that is working.
### Mongo process
17967 ? Ssl 42:56 /usr/lib/
### Netstat
tcp 0 0 0.0.0.0:37017 0.0.0.0:* LISTEN 17967/mongod
### Log
2017-07-04 06:44:59 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:
2017-07-04 06:45:30 TRACE juju.worker.
2017-07-04 06:45:30 DEBUG juju.worker.
2017-07-04 06:45:30 ERROR juju.worker.
2017-07-04 06:45:30 TRACE juju.worker.
2017-07-04 06:45:33 TRACE juju.worker.
2017-07-04 06:45:33 TRACE juju.worker.
2017-07-04 06:45:33 TRACE juju.worker.
2017-07-04 06:45:33 INFO juju.state open.go:129 opening state, mongo addresses: ["localhost:
Paolo de Rosa (paolo-de-rosa) wrote : | #11 |
I have restarted mongo and some log is in syslog but same result, jujud is not able to connect to mongo locally.
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Jul 4 06:51:18 sitename-
Tim Penhey (thumper) wrote : | #12 |
As far as not being able to connect to the local mongo, I see only two possibilities.
One is that the agent has an invalid password, but I'm pretty sure that that would give a different error message.
The other is that a proxy setting is getting in the way.
What are the various model proxy values set for the controller model?
Menno Finlay-Smits (menno.smits) wrote : | #13 |
Paolo: Are you able to connect to MongoDB using the mongo shell?
If so, can you please check the replicaset status using the "rs.status()" command?
Paolo de Rosa (paolo-de-rosa) wrote : | #14 |
There are no proxies set.
Attribute From Value
agent-metadata-url default ""
agent-stream default released
agent-version model 2.2.1.1
apt-ftp-proxy default ""
apt-http-proxy default ""
apt-https-proxy default ""
apt-mirror default ""
apt-no-proxy default ""
automatically-
default-series default xenial
development default false
disable-
enable-
enable-os-upgrade default true
firewall-mode default instance
ftp-proxy default ""
http-proxy default ""
https-proxy default ""
ignore-
image-metadata-url default ""
image-stream default released
logforward-enabled default false
logging-config model juju=INFO;
max-status-
max-status-
net-bond-
no-proxy default 127.0.0.
provisioner-
proxy-ssh default false
resource-tags model {}
ssl-hostname-
storage-
test-mode default false
transmit-
update-
Paolo de Rosa (paolo-de-rosa) wrote : | #15 |
Connecting with mongo cli it seems that the password is wrong I'm still using "oldpassword" to connect is that right ?
mongo --ssl -u admin -p $(grep oldpassword /var/lib/
John A Meinel (jameinel) wrote : Re: [Bug 1701275] Re: [2.2.1] juju-agent loses connection to controller and doesn't retry to connect | #16 |
oldpassword and admin is only relevant on the machine that bootstrapped.
you want to use "statepassword" and the agent name (eg machine-2) on all
other machines.
John
=:->
On Jul 5, 2017 15:11, "Paolo de Rosa" <email address hidden> wrote:
> Connecting with mongo cli it seems that the password is wrong I'm still
> using "oldpassword" to connect is that right ?
>
> mongo --ssl -u admin -p $(grep oldpassword
> /var/lib/
> localhost:
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https:/
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https:/
>
Paolo de Rosa (paolo-de-rosa) wrote : | #17 |
same result.
root@site-
MongoDB shell version: 2.6.10
connecting to: localhost:
2017-07-
exception: login failed
Paolo de Rosa (paolo-de-rosa) wrote : | #18 |
mongo --ssl -u machine-1 -p $(grep statepassword /var/lib/
MongoDB shell version: 2.6.10
connecting to: localhost:
2017-07-
exception: login failed
Menno Finlay-Smits (menno.smits) wrote : | #19 |
I think you need to provide the "--authenticati
mongopass=`sudo grep statepassword /var/lib/
mongo 127.0.0.
Tim Penhey (thumper) wrote : | #20 |
I have used the following to get access to mongo on any API server:
```
user=$(basename `ls -d /var/lib/
password=`sudo grep statepassword /var/lib/
/usr/lib/
```
Which is now also defined in https:/
Is it possible to get a developer to work with you to try and work this out?
Alvaro Uria (aluria) wrote : | #21 |
With regard to the same environment reported by Paolo on previous comments, we can't log in on "secondary-to-be" mongo services.
0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
1) "JUJU_MODEL=
2) rs.status() on PRIMARY: http://
3) when trying to log in on "secondary-to-be" controllers: http://
3.1) syslog shows:
"""
Jul 6 09:50:07 controller2 mongod.
"""
It looks like mongo replicas are not able to properly bootstrap and reach the primary server.
Please let me know if I can help you further or need more information of any kind.
Kind regards,
-Alvaro.
John A Meinel (jameinel) wrote : | #22 |
Given that as far as we can see the other mongo's haven't joined the
replica set, and the local controllers aren't actually able to log into
mongo, I wonder if its time to take the 'nuclear' option. If you stopped
jujud, and then sudo apt remove juju-mongodb3.2; and sudo rm -rf
/var/lib/juju/db
I wonder if that would be enough for the jujud agents to reinstall mongo
and re-initialize a local mongo replica.
Note that I haven't tried doing any of those steps myself, so I can't
guarantee what would work, but *something* like that may work for those
agents. They should want to create a local mongo replica and make it sync
on startup.
You could probably test this sort of thing on an 'lxd' provider just by
doing "juju bootstrap lxd; juju enable-ha -n3", and then stopping and
deleting mongo on one of the machines.
If you wanted to take it a step further, you could potentially firewall
port 37017 before running 'enable-ha' so that the other mongo daemons
wouldn't be able to join the replica set.
John
=:->
On Thu, Jul 6, 2017 at 1:56 PM, Alvaro Uría <email address hidden>
wrote:
> With regard to the same environment reported by Paolo on previous
> comments, we can't log in on "secondary-to-be" mongo services.
>
> 0) Controllers are Xenial running Juju 2.2.1.1 and Mongo 3.2
> 1) "JUJU_MODEL=
> http://
> 2) rs.status() on PRIMARY: http://
> 3) when trying to log in on "secondary-to-be" controllers:
> http://
> 3.1) syslog shows:
> """
> Jul 6 09:50:07 controller2 mongod.
> authentication failed for machine-1 on admin from client 127.0.0.1 ;
> UserNotFound: Could not find user machine-1@admin
> """
>
> It looks like mongo replicas are not able to properly bootstrap and
> reach the primary server.
>
> Please let me know if I can help you further or need more information of
> any kind.
>
> Kind regards,
> -Alvaro.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https:/
>
> Title:
> [2.2.1] juju-agent loses connection to controller and doesn't retry to
> connect
>
> To manage notifications about this bug go to:
> https:/
>
Dmitrii Shcherbakov (dmitriis) wrote : | #23 |
The issue is reproducible after a restart of all agents - not only a single time after deployment:
1) {unit,machine} agent loss reported
2) juju run --all 'sudo systemctl restart jujud*'
3) wait for a day
4) {unit,machine} agent loss reported
There were no failovers after an initial deployment and two cases agent loss:
ubuntu@maas:~$ juju run -m controller --all 'journalctl -u juju-db | grep PRIMARY'
- MachineId: "1"
Stdout: |
Jul 05 23:31:05 controller-2 mongod.37017[4569]: [ReplicationExe
- MachineId: "2"
Stdout: |
Jul 05 23:31:19 controller-1 mongod.37017[4557]: [ReplicationExe
- MachineId: "0"
Stdout: |
Jul 05 23:24:46 controller-juju mongod.37017[5295]: [ReplicationExe
Jul 08 12:36:15 controller-juju mongod.37017[5295]: [conn17] command juju.actions command: insert { insert: "actions", documents: [ { _id: "95fd14d9-
Jul 08 12:36:24 controller-juju mongod.37017[5295]: [conn17] command juju.txns command: insert { insert: "txns", documents: [ { _id: ObjectId(
tags: | added: cdo-qa-blocker |
Alvaro Uria (aluria) wrote : | #24 |
In my case, I was not able to reproduce the issue in a staging environment (using an openstack endpoint as backend). Prod env uses MAAS as endpoint.
1)
Prod: juju bootstrap prod-cloud --constraints tags=juju
Staging: juju bootstrap canonistack --constraints "cores=2 mem=4G"
2) juju enable-ha
3)
Prod: node#0 has-vote; node#1,2 "adding-vote". PRIMARY mongo does not show replicasets, but jujud-machine-{1,2} are started and mongo installed (although no machine-{1,2} users are created in juju-db).
Staging:
3.1) iptables --dport 37017 -j DROP
PRIMARY shows replicasets created, but in STARTUP state (no SECONDARY).
Stopping jujud-machine-
3.2) iptables --dport 17070 -j DROP (no other DROPs)
PRIMARY does not show new replicasets, only the primary mongo.
juju-mongodb3.2 is installed and started, but no machine-{1,2} user exists.
"juju status" shows nodes#1,#2 as "pending" and "juju ssh 1" (or "2") does not work because pub key has not been imported (no access to jujud-machine-0 to gather data).
At any time, flushing iptables rules recovers the environment (juju-db might need a restart on node#1,#2 to complete replication).
TL;DR:
The diversion between Prod and Staging is that Prod's jujud-machine-{1,2} connects to jujud-machine-0 but replicasets are not created in PRIMARY juju-db (needed to step for the other mongos to be able to connect to the primary instance).
Changed in juju: | |
milestone: | 2.2.2 → 2.3-alpha1 |
assignee: | nobody → John A Meinel (jameinel) |
John A Meinel (jameinel) wrote : | #25 |
So I have a theory that this *might* be related to the Presence issues in 2.2.1.
(bug #1703526).
I'm investigating now, but the theory is that if
$ juju bootstrap
# wait for presence to break from bug #1703526
$ juju enable-ha -n3
At this point, the machines come up, but because Presence is broken, they don't get detected as really being available, and thus don't get added to the replica set.
The code I'm curious about is:
// Don't add the machine unless it is "Started"
machineStatus, err := stm.Status()
if err != nil {
return false, errors.
}
if machineStatus.
logger.
tracker, err := newMachineTrack
if err != nil {
return false, errors.Trace(err)
}
if err := w.catacomb.
return false, errors.Trace(err)
}
w.machineTra
changed = true
} else {
logger.
}
So if Presence is broken, can the machine go into "status.Started".
John A Meinel (jameinel) wrote : | #26 |
My theory appears to be incorrect. I bootstrapped lxd, and broke the ping batcher. I could see that "juju status" reported all agents as lost.
I then ran "juju enable-ha -n3", and it actually demoted machine 0 because it was seen as dead, and promoted 3 new machines (1,2,3).
Those machines all came up successfully, and were added to the replica set, which now reports machines 1 2 and 3 as in the replica set. 'rs.Status()' also reports 1,2,3 as being in the set.
John A Meinel (jameinel) wrote : | #27 |
Something else appears to be going wrong. All of the new machines that were started are happily participating in the new replicaset, but they are all flagged as "ha-pending" instead of "ha-enabled" and machine 0 is refusing to step down (stuck in removing-vote).
John A Meinel (jameinel) wrote : | #28 |
I'm pretty sure the initial issue is actually just bug #1703526
Looking at the original paste, I think the issue is that we have:
2017-06-28 10:30:10 DEBUG juju.worker.logger logger.go:63 reconfiguring logging from "<root>=DEBUG" to "<root>
2017-06-28 10:32:42 WARNING juju.provisioner broker.go:100 incomplete DNS config found, discovering host's DNS config
By default log level is WARNING. So you see errors when we get disconnected, but you *don't* see any messages pertaining to reconnecting, because those are at best INFO level.
And it appears like the agent isn't trying to reconnect (which is true) but is because it is already happily connected. bug #1703526 is because one of the workers on the controller side could get restarted, and then we would stop tracking agent presence correctly.
In the next "controller logs" I think we have these lines:
no kvm containers possible
2017-06-29 09:39:17 ERROR juju.worker.
2017-06-29 09:39:21 ERROR juju.worker.
^- I think this is bug #1701438
and
2017-06-29 13:43:39 WARNING juju.apiserver log.go:168 http: TLS handshake error from 10.20.30.1:46188: tls: oversized record received with length 64774
^- I'm guessing this is testing Telnet.
for https:/
that looks like it is just when you did 'enable-ha' we create a replica set, and once that is active, mongo disconnects everyone and triggers a vote which may change the primary.
Which should be what was going on around:
Jun 29 09:39:15
For the next item around "ha-pending", I think we should track that as a different bug, as this bug has since grown a bit too hairy to really tell which thing is which.
I have a system which shows the same symptom (all the new agents are in the replicaset and happy, but the old controller is still listed as "removing-vote" and the new ones are "adding-vote" even though they all *have* the right votes.)
Note that my case seems to differ from Paolo's because I do see the machines showing up as part of the replica set, but they are still listed as "adding-vote", while he never saw them get added to Mongo's configuration.
Looking at the debug-log under my condition I see:
machine-1: 14:10:08 ERROR juju.worker.
machine-0: 14:10:11 DEBUG juju.worker.
machine-0: 14:10:11 DEBUG juju.worker.
machine-0: 14:10:11 DEBUG juju.worker.
&peergrouper
&peergrouper
&peergrouper
Alvaro Uria (aluria) wrote : | #29 |
- machine-1.log Edit (29.0 KiB, text/plain)
Initial status (prod env): rs.status() on PRIMARY does not show any replicaset.
Following suggestion on #22,
1) I stopped jujud-machine-1 and its juju-db (not able to connect to it as machine-1 user does not exist)
2) I also stopped jujud-machine-2 and its juju-db (same as step 1)) - I also backed up /var/lib/juju/db
3) On machine-1, I removed juju-mongodb3.2 pkg and started jujud-machine-1 (find log attached since start)
On machine-0, I checked rs.status() again and it remains the same (no replicasets, just the PRIMARY). When I tested this in staging, replicaset was configured on primary mongo.
Alvaro Uria (aluria) wrote : | #30 |
BTW, this is the same environment reported by Paolo.
More info apart from #29. machine-0.log shows:
2017-07-12 14:22:12 ERROR juju.worker.
juju:PRIMARY> rs.status()
{
"set" : "juju",
"date" : ISODate(
"myState" : 1,
"term" : NumberLong(1),
"members" : [
{
}
],
"ok" : 1
}
Menno Finlay-Smits (menno.smits) wrote : | #31 |
The "Found two member configurations with same _id" error is a new one for me. It would be helpful to see what "rs.conf()" says.
John A Meinel (jameinel) wrote : | #32 |
I think the original 'agent loses connection' bug is just a dupe of the existing bug, and the work on fixing Replica set should be split out into a separate bug.
John A Meinel (jameinel) wrote : | #33 |
Split out of Mongo issues moved to bug #1704072
Alvaro Uria (aluria) wrote : | #34 |
Menno, I've filed bug 1704075 (re: "Found two member configurations with same _id").
connectivity between unit and controller works: 01:~# telnet 10.20.30.67 17070
root@controller
Trying 10.20.30.67...
Connected to 10.20.30.67.
Escape character is '^]'.