jujud on machine 0 stops listening to port 17070/tcp WSS api

Bug #1287718 reported by Eduardo Damato
64
This bug affects 12 people
Affects Status Importance Assigned to Milestone
juju-core
Invalid
High
Unassigned

Bug Description

Encountered a problem where jujud stops listening on 17070/tcp while still running on machine 0. All juju clients stop working as they can't communicate to WSS on machine 0

Juju agent version: 1.17.4

1- We can see the following processes. So mongod is running fine, and the machine agent is running:

root@juju-cs2-machine-0:~# ps auxww | grep juju
root 5439 2.4 4.7 483664 48564 ? Ssl Mar03 41:48 mongod --auth --dbpath=/var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword xxxxxxx --bind_ip 0.0.0.0 --port 37017 --noprealloc --syslog --smallfiles
root 5465 9.6 11.4 1167408 116616 ? Ssl Mar03 162:46 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
root 12661 0.3 1.8 640832 19180 ? Ssl Mar03 4:52 /var/lib/juju/tools/unit-ci-juju-gui-0/jujud unit --data-dir /var/lib/juju --unit-name ci-juju-gui/0 --debug
root 13040 0.0 2.2 153120 23188 ? Ss Mar03 0:00 /usr/bin/python /usr/local/bin/runserver.py --logging=info --guiroot=/var/lib/juju-gui/juju-gui/build-prod --sslpath=/etc/ssl/juju-gui --charmworldurl=https://manage.jujucharms.com/ --apiurl=wss://10.55.32.46:17070 --apiversion=go

2- However port 17070/tcp isn't open by jujud machine0 agent:

root@juju-cs2-machine-0:~# netstat -anp | grep 17070

3- In the logs we have hundreds of:

014-03-03 12:18:28 ERROR juju.cmd.jujud agent.go:241 closeWorker: close error: error receiving message: write tcp 127.0.0.1:17070: connection reset by peer
2014-03-03 12:18:28 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-03-03 12:18:31 ERROR juju apiclient.go:118 state/api: websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2014-03-03 12:18:31 ERROR juju runner.go:220 worker: exited "api": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused

4- upon restart of jujud machine 0 agent all processes started to work fine and juju status worked. I could see that now jujud was listening on port 17070:

tcp6 0 0 :::17070 :::* LISTEN 1634/jujud

The process listening is 1634, which is compatible with the below:

root@juju-cs2-machine-0:~# ps auxwww | grep juju
root 1634 1.6 5.0 957760 51716 ? Ssl 13:54 0:23 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
root 5439 2.4 5.0 549200 51584 ? Ssl Mar03 42:25 mongod --auth --dbpath=/var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword xxxxxxx --bind_ip 0.0.0.0 --port 37017 --noprealloc --syslog --smallfiles

Full log from machine-0 agent attached.

thanks,
Eduardo.

Revision history for this message
Eduardo Damato (edamato) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: mongodb state-server
Changed in juju-core:
milestone: none → 1.18.0
tags: added: cts-cloud-review
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

fwiw this is roughly the similiar to the issue behind bug:1284183 (disconnect all clients on juju) except normally restarts the api server there.

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

it would be helpful to attach the mongo logs as well for this bug

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

mongo.log from reported environment attached

Revision history for this message
Para Siva (psivaa) wrote :

I've been having this issue where I am unable to deploy services after bootstrap in the cts cloud. I've tried restarting the instance but to no avail. /var/log/juju/machine-0.log is attached herewith. Restarting the jujud-machine-0 agent does not help either.

Revision history for this message
Para Siva (psivaa) wrote :

Full log containing info aftter juju agent restart.

Changed in juju-core:
milestone: 1.20.0 → next-stable
Revision history for this message
Alex Chiang (achiang) wrote :

I left a comment here, which may be related:

https://bugs.launchpad.net/juju-core/+bug/1307434/comments/11

tags: added: cts
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.21 → 1.22
Changed in juju-core:
milestone: 1.22-alpha1 → 1.23
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → none
importance: High → Medium
Revision history for this message
Samantha Jian-Pielak (samantha-jian) wrote :

Hi, one of our server partners hit this issue after a power outage on the juju bootstrap node and has tried all the workarounds (reboot, restart all juju services), and still no response from the juju state server. This is not a production environment, so we can build a new environment, but is there anything we can do to recover as much as we can? For example, can we export the MongoDB from the old one and import it to a new env?

Revision history for this message
Junien F (axino) wrote :

Hi,

We're having the same issue on one of our juju environment after an upgrade from 1.18.3.1 to 1.20.11. The log of a manual launch is attached. mongo seems to be running fine.

It looks like jujud is not even trying to start the apiserver, for reasons unknown to me.

Could this be because of bad data in the mongo database ?

Let me know if I can provide any additionnal details, thanks.

Revision history for this message
David Lawson (deej) wrote :

Interesting but anecdotal, both environments we see this with had problems install mongo during the upgrade process. Is it possible something that should have been written to mongo never was?

Changed in juju-core:
importance: Medium → High
milestone: none → 1.24-alpha1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Revision history for this message
John A Meinel (jameinel) wrote :

So all those lines about:
2015-04-06 15:29:41 DEBUG juju.state open.go:48 dialing mongo
2015-04-06 15:29:41 INFO juju.mongo open.go:104 dialled mongo successfully
2015-04-06 15:29:42 INFO juju.mongo open.go:104 dialled mongo successfully
2015-04-06 15:29:43 INFO juju.mongo open.go:104 dialled mongo successfully

the fact that it is repeated over and over tends to mean that it successfully accessed the port, but was unable to login. (we unfortunately aren't logging that information correctly.) And if we are unable to connect to Mongo, then we won't start the API server (connecting to the DB is a prereq for the API server to start). There are also bits of the process that connect back to itself on the API port, which is why you can see the API server not being available.

Also a bit confusingly it is this line
2015-04-06 15:29:41 INFO juju.worker runner.go:260 start "state"

which says that we are starting the "state" worker, which is responsible for connecting to Mongo and then starting the API Server.
The job here:
2015-04-06 15:29:44 INFO juju.worker runner.go:260 start "api"
Is for workers that *use* the API, not ones that *are* the API.

I'm not entirely sure what would happen in upgrading from 1.18 to 1.20 that might cause this, I also find it odd from Alex Chiang's note that restarting jujud fixed it.
Looking at his log we have:
 ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout

Which is that mongo is just not responding on its socket. I believe I've seen that when Mongo got overloaded. Now I think we had a problem in the past where we could overload a given Session, and then we were failing to call Session.Reset() to let our mgo driver know that it should reconnect and try again.
I do believe we have fixed the Session.Refresh() bug in a different patch, quite a while ago.

Now the "dialed mongo successfully" repeated failures could be a bad password, or possibly a bad certificate.

If the original source of the bug was indeed Jujud getting into a state where it had a failed connection to Mongo and it wasn't refreshing the connection, than I believe that bug has already been fixed.

I haven't seen anything in this bug that indicates that it is affecting Juju versions >=1.20 (and only 1.20 after an upgrade from 1.18).
I'd be willing to debug live with someone on an actual instance where it is failing, and I'm happy to spend some time making sure we have better logging for cases where connecting to mongo is failing (if we are unable to connect because of a login failure, say so, or if the cert doesn't match, etc.)

Are we able to reliably reproduce this somewhere?

Revision history for this message
John A Meinel (jameinel) wrote :

Looking at the code in 1.24, I see informational lines while connecting to Mongo that have more informative messages like "connection failed, will retry: " with the error that caused us to fail.
However, these are being emitted at DEBUG level. It does have the INFO level line "dialled mongo successfully on address: %q", which should only be logged after we've successfully dialed and verified the server's certificate.

It appears that if we fail to Login, then we don't log an error immediately, but we should generate an error internally which should bubble up to stopping the "state" worker which should log the reason for the failure.

Note that a lot of these changes wouldn't have been in juju-1.20 (I believe), but should certainly be in 1.24-alpha1

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

Given John's last comment, can this bug be reproduced with 1.24?

Curtis Hovey (sinzui)
no longer affects: juju-core/1.24
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → 1.24.0
milestone: 1.24.0 → 1.25.0
Revision history for this message
Firl (celpa-firl) wrote :

I just had this problem as well. I ended up needing to run sudo apt-get update to fix it.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → 1.25.1
tags: added: sts
removed: cts
Changed in juju-core:
milestone: 1.25.1 → 1.26.0
Changed in juju-core:
milestone: 1.26.0 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
Revision history for this message
Jacek Nykis (jacekn) wrote :

Alexis,

To anwer your question in comment #13 yes I saw this in juju 1.24.7

Changed in juju-core:
status: Expired → Confirmed
tags: removed: sts
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking this as incomplete as the last confirmed version of juju to experience this failure was 1.24.7.
If this is still present in a current version of juju - either 1.25.X or 2.0.betaX - feel free to re-open.

Changed in juju-core:
status: Confirmed → Incomplete
Changed in juju-core:
status: Incomplete → Invalid
Revision history for this message
Asad Saeeduddin (masaeedu) wrote :
Download full text (4.4 KiB)

@Anastasia Seeing this issue in 2.1-beta4-xenial-amd64. I've reported this in #1592179 for 2.0.0-xenial-amd64, not sure what the proper venue is.

I've attached a dump of /var/log/juju/machine-0.log from the controller immediately after bootstrap against the local LXD cloud. The ERROR messages are:

2017-02-01 02:34:16 ERROR juju.worker.dependency engine.go:547 "machiner" manifold worker returned unexpected error: setting machine addresses: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.logger logger.go:47 connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.dependency engine.go:547 "tools-version-checker" manifold worker returned unexpected error: cannot update tools information: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.dependency engine.go:547 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.dependency engine.go:547 "ssh-authkeys-updater" manifold worker returned unexpected error: starting key updater worker: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.1:17070->127.0.0.1:38130: write: broken pipe
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing resp...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.