jujud on machine 0 stops listening to port 17070/tcp WSS api
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Unassigned | ||
Bug Description
Encountered a problem where jujud stops listening on 17070/tcp while still running on machine 0. All juju clients stop working as they can't communicate to WSS on machine 0
Juju agent version: 1.17.4
1- We can see the following processes. So mongod is running fine, and the machine agent is running:
root@juju-
root 5439 2.4 4.7 483664 48564 ? Ssl Mar03 41:48 mongod --auth --dbpath=
root 5465 9.6 11.4 1167408 116616 ? Ssl Mar03 162:46 /var/lib/
root 12661 0.3 1.8 640832 19180 ? Ssl Mar03 4:52 /var/lib/
root 13040 0.0 2.2 153120 23188 ? Ss Mar03 0:00 /usr/bin/python /usr/local/
2- However port 17070/tcp isn't open by jujud machine0 agent:
root@juju-
3- In the logs we have hundreds of:
014-03-03 12:18:28 ERROR juju.cmd.jujud agent.go:241 closeWorker: close error: error receiving message: write tcp 127.0.0.1:17070: connection reset by peer
2014-03-03 12:18:28 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-03-03 12:18:31 ERROR juju apiclient.go:118 state/api: websocket.Dial wss://localhost
2014-03-03 12:18:31 ERROR juju runner.go:220 worker: exited "api": websocket.Dial wss://localhost
4- upon restart of jujud machine 0 agent all processes started to work fine and juju status worked. I could see that now jujud was listening on port 17070:
tcp6 0 0 :::17070 :::* LISTEN 1634/jujud
The process listening is 1634, which is compatible with the below:
root@juju-
root 1634 1.6 5.0 957760 51716 ? Ssl 13:54 0:23 /var/lib/
root 5439 2.4 5.0 549200 51584 ? Ssl Mar03 42:25 mongod --auth --dbpath=
Full log from machine-0 agent attached.
thanks,
Eduardo.
| Eduardo Damato (edamato) wrote : | #1 |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| tags: | added: mongodb state-server |
| Changed in juju-core: | |
| milestone: | none → 1.18.0 |
| tags: | added: cts-cloud-review |
| Kapil Thangavelu (hazmat) wrote : | #2 |
| Kapil Thangavelu (hazmat) wrote : | #3 |
it would be helpful to attach the mongo logs as well for this bug
| Kapil Thangavelu (hazmat) wrote : | #4 |
mongo.log from reported environment attached
| Para Siva (psivaa) wrote : | #5 |
I've been having this issue where I am unable to deploy services after bootstrap in the cts cloud. I've tried restarting the instance but to no avail. /var/log/
| Para Siva (psivaa) wrote : | #6 |
Full log containing info aftter juju agent restart.
| Changed in juju-core: | |
| milestone: | 1.20.0 → next-stable |
| Alex Chiang (achiang) wrote : | #7 |
I left a comment here, which may be related:
https:/
| tags: | added: cts |
| Changed in juju-core: | |
| milestone: | 1.21 → 1.22 |
| Changed in juju-core: | |
| milestone: | 1.22-alpha1 → 1.23 |
| Changed in juju-core: | |
| milestone: | 1.23 → none |
| importance: | High → Medium |
Hi, one of our server partners hit this issue after a power outage on the juju bootstrap node and has tried all the workarounds (reboot, restart all juju services), and still no response from the juju state server. This is not a production environment, so we can build a new environment, but is there anything we can do to recover as much as we can? For example, can we export the MongoDB from the old one and import it to a new env?
| Junien Fridrick (axino) wrote : | #9 |
Hi,
We're having the same issue on one of our juju environment after an upgrade from 1.18.3.1 to 1.20.11. The log of a manual launch is attached. mongo seems to be running fine.
It looks like jujud is not even trying to start the apiserver, for reasons unknown to me.
Could this be because of bad data in the mongo database ?
Let me know if I can provide any additionnal details, thanks.
| David Lawson (deej) wrote : | #10 |
Interesting but anecdotal, both environments we see this with had problems install mongo during the upgrade process. Is it possible something that should have been written to mongo never was?
| Changed in juju-core: | |
| importance: | Medium → High |
| milestone: | none → 1.24-alpha1 |
| Changed in juju-core: | |
| milestone: | 1.24-alpha1 → 1.25.0 |
| John A Meinel (jameinel) wrote : | #11 |
So all those lines about:
2015-04-06 15:29:41 DEBUG juju.state open.go:48 dialing mongo
2015-04-06 15:29:41 INFO juju.mongo open.go:104 dialled mongo successfully
2015-04-06 15:29:42 INFO juju.mongo open.go:104 dialled mongo successfully
2015-04-06 15:29:43 INFO juju.mongo open.go:104 dialled mongo successfully
the fact that it is repeated over and over tends to mean that it successfully accessed the port, but was unable to login. (we unfortunately aren't logging that information correctly.) And if we are unable to connect to Mongo, then we won't start the API server (connecting to the DB is a prereq for the API server to start). There are also bits of the process that connect back to itself on the API port, which is why you can see the API server not being available.
Also a bit confusingly it is this line
2015-04-06 15:29:41 INFO juju.worker runner.go:260 start "state"
which says that we are starting the "state" worker, which is responsible for connecting to Mongo and then starting the API Server.
The job here:
2015-04-06 15:29:44 INFO juju.worker runner.go:260 start "api"
Is for workers that *use* the API, not ones that *are* the API.
I'm not entirely sure what would happen in upgrading from 1.18 to 1.20 that might cause this, I also find it odd from Alex Chiang's note that restarting jujud fixed it.
Looking at his log we have:
ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
Which is that mongo is just not responding on its socket. I believe I've seen that when Mongo got overloaded. Now I think we had a problem in the past where we could overload a given Session, and then we were failing to call Session.Reset() to let our mgo driver know that it should reconnect and try again.
I do believe we have fixed the Session.Refresh() bug in a different patch, quite a while ago.
Now the "dialed mongo successfully" repeated failures could be a bad password, or possibly a bad certificate.
If the original source of the bug was indeed Jujud getting into a state where it had a failed connection to Mongo and it wasn't refreshing the connection, than I believe that bug has already been fixed.
I haven't seen anything in this bug that indicates that it is affecting Juju versions >=1.20 (and only 1.20 after an upgrade from 1.18).
I'd be willing to debug live with someone on an actual instance where it is failing, and I'm happy to spend some time making sure we have better logging for cases where connecting to mongo is failing (if we are unable to connect because of a login failure, say so, or if the cert doesn't match, etc.)
Are we able to reliably reproduce this somewhere?
| John A Meinel (jameinel) wrote : | #12 |
Looking at the code in 1.24, I see informational lines while connecting to Mongo that have more informative messages like "connection failed, will retry: " with the error that caused us to fail.
However, these are being emitted at DEBUG level. It does have the INFO level line "dialled mongo successfully on address: %q", which should only be logged after we've successfully dialed and verified the server's certificate.
It appears that if we fail to Login, then we don't log an error immediately, but we should generate an error internally which should bubble up to stopping the "state" worker which should log the reason for the failure.
Note that a lot of these changes wouldn't have been in juju-1.20 (I believe), but should certainly be in 1.24-alpha1
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| Alexis Bruemmer (alexis-bruemmer) wrote : | #13 |
Given John's last comment, can this bug be reproduced with 1.24?
| no longer affects: | juju-core/1.24 |
| Changed in juju-core: | |
| milestone: | 1.25.0 → 1.24.0 |
| milestone: | 1.24.0 → 1.25.0 |
| Firl (celpa-firl) wrote : | #14 |
I just had this problem as well. I ended up needing to run sudo apt-get update to fix it.
| Changed in juju-core: | |
| milestone: | 1.25.0 → 1.25.1 |
| tags: |
added: sts removed: cts |
| Changed in juju-core: | |
| milestone: | 1.25.1 → 1.26.0 |
| Changed in juju-core: | |
| milestone: | 1.26.0 → none |
| Launchpad Janitor (janitor) wrote : | #15 |
[Expired for juju-core because there has been no activity for 60 days.]
| Changed in juju-core: | |
| status: | Incomplete → Expired |
| Jacek Nykis (jacekn) wrote : | #16 |
Alexis,
To anwer your question in comment #13 yes I saw this in juju 1.24.7
| Changed in juju-core: | |
| status: | Expired → Confirmed |
| tags: | removed: sts |
| Anastasia (anastasia-macmood) wrote : | #17 |
Marking this as incomplete as the last confirmed version of juju to experience this failure was 1.24.7.
If this is still present in a current version of juju - either 1.25.X or 2.0.betaX - feel free to re-open.
| Changed in juju-core: | |
| status: | Confirmed → Incomplete |
| Changed in juju-core: | |
| status: | Incomplete → Invalid |
| Asad Saeeduddin (masaeedu) wrote : | #18 |
@Anastasia Seeing this issue in 2.1-beta4-
I've attached a dump of /var/log/
2017-02-01 02:34:16 ERROR juju.worker.
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.logger logger.go:47 connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.worker.
2017-02-01 02:34:16 ERROR juju.worker.
2017-02-01 02:34:16 ERROR juju.worker.
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing response: write tcp 127.0.0.
2017-02-01 02:34:16 ERROR juju.rpc server.go:510 error writing resp...


fwiw this is roughly the similiar to the issue behind bug:1284183 (disconnect all clients on juju) except normally restarts the api server there.