stuck rsyslogd causes mongodb to block

Bug #1616832 reported by Merlijn Sebrechts
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned
juju-ci-tools
Expired
Wishlist
Unassigned
juju-core
Won't Fix
Undecided
Unassigned
1.25
Won't Fix
Undecided
Unassigned

Bug Description

This is in a production environment using the manual provider, 1 physical machine. Using juju 1.x. I'm not sure how to get the exact juju version since all commands fail.

PROBLEM: All juju commands fail with "i/o timeout". Juju-db and jujud-machine-0 are both running.

WHAT I ATTEMPTED: I restarted Juju-db and jujud-machine-0, but the issue remains.

EXAMPLE:

juju status --format tabular
WARNING discarding API open error: read tcp 127.0.0.1:37017: i/o timeout
ERROR Unable to connect to environment "cityot".
Please check your credentials or use 'juju bootstrap' to create a new environment.
Error details:
read tcp 10.8.0.1:37017: i/o timeout

machine-0.log:

2016-08-25 09:37:27 ERROR juju.worker runner.go:223 exited "api": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:37:39 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 10.8.0.1:37017: i/o timeout
2016-08-25 09:37:43 ERROR juju.worker runner.go:223 exited "certupdater": retrieving initial server addesses: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:38:03 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 10.8.0.1:37017: i/o timeout
2016-08-25 09:38:27 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:38:51 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:39:05 ERROR juju.worker runner.go:223 exited "api": read tcp 10.8.0.1:37017: i/o timeout
2016-08-25 09:39:11 ERROR juju.worker runner.go:223 exited "certupdater": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:39:15 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:39:21 ERROR juju.worker runner.go:223 exited "528e26f1-5281-4307-83cf-65e2165b3356": failed to open state for environment 528e26f1-5281-4307-83cf-65e2165b3356: cannot create lease client: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:04 ERROR juju.worker runner.go:223 exited "certupdater": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:09 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:13 ERROR juju.worker runner.go:223 exited "528e26f1-5281-4307-83cf-65e2165b3356": failed to open state for environment 528e26f1-5281-4307-83cf-65e2165b3356: cannot log in to admin database as "machine-0": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:28 ERROR juju.worker runner.go:223 exited "certupdater": retrieving initial server addesses: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:31 ERROR juju.worker runner.go:223 exited "api": read tcp 10.8.0.1:37017: i/o timeout
2016-08-25 09:40:43 ERROR juju.worker runner.go:223 exited "528e26f1-5281-4307-83cf-65e2165b3356": failed to open state for environment 528e26f1-5281-4307-83cf-65e2165b3356: cannot log in to admin database as "machine-0": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:47 ERROR juju.worker.peergrouper worker.go:138 peergrouper loop terminated: tomb: dying
2016-08-25 09:40:47 ERROR juju.worker runner.go:223 exited "peergrouper": read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:40:52 ERROR juju.worker runner.go:223 exited "certupdater": retrieving initial server addesses: read tcp 127.0.0.1:37017: i/o timeout
2016-08-25 09:41:11 ERROR juju.worker runner.go:223 exited "peergrouper": cannot read settings: read tcp 127.0.0.1:37017: i/o timeout

Mongotop output:

mongotop --host localhost:37017 --ssl
connected to: localhost:37017
error: { ok: 0.0, errmsg: "unauthorized" }
error: { ok: 0.0, errmsg: "unauthorized" }

description: updated
description: updated
description: updated
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

machine-0.log

no longer affects: juju-core
Curtis Hovey (sinzui)
tags: added: manual-provider
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Those i/o timeout issues indicate that the Juju controller agent is unable to connect to its MongoDB instance. It appears that MongoDB is unhappy for some reason.

Can you please provide the MongoDB logs in /var/log/syslog?

Changed in juju:
status: New → Incomplete
Changed in juju-core:
status: New → Incomplete
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

The latest syslog is completely empty. As attachment syslog.1 grepped for 'mongo'.

/var/log/upstart/juju-db.log:

note: noprealloc may hurt performance in many applications
using syslog ident: mongod.37017
note: noprealloc may hurt performance in many applications
using syslog ident: mongod.37017

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

as you can see, juju-db is running but it doesn't seem like it's outputting anything.

initctl list | grep juju
...
juju-db start/running, process 18519
jujud-machine-0 start/running, process 16974

Changed in juju:
status: Incomplete → Triaged
importance: Undecided → High
milestone: none → 2.0-beta18
Changed in juju-core:
status: Incomplete → Won't Fix
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

This is a production cluster so I'd like to get it up and running again. Is it possible to help me find a workaround?

Revision history for this message
William Reade (fwereade) wrote :

Can you log in to the databse from the machine with these [0] instructions, and see what's in the admin database? If that doesn't work we might have to run mongodb without auth to look inside...

[0] https://github.com/juju/juju/wiki/Login-into-MongoDB

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The evidence so far points to a broken MongoDB. It's quite strange that there's no recent logs. The syslog output provided doesn't provide any clues (all looks normal there).

As Will suggests, the best course of action now is to try and manually connect to MongoDB using the shell as per the instructions he linked to. The output of the following would be useful:

  use juju
  rs.status()
  db.serverStatus()
  db.stats()

Changed in juju:
status: Triaged → Incomplete
milestone: 2.0-beta18 → none
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

I tried to connect to mongo as explained in [0]. The connection to Mongo just times out, so it seems that mongodb has a problem with starting properly. That might explain why there are no logs.

Any idea how to debug this?

[0] https://github.com/juju/juju/wiki/Login-into-MongoDB

Changed in juju:
status: Incomplete → Triaged
status: Triaged → Incomplete
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

Output of command:

mongo 127.0.0.1:37017/juju--authenticationDatabase admin --ssl --username "admin" --password "xxxxxxxxxxxxxxxxxxxxxx"
MongoDB shell version: 2.4.9
connecting to: 127.0.0.1:37017/juju--authenticationDatabase
Fri Sep 2 09:25:08.196 Error: 18 { code: 18, ok: 0.0, errmsg: "auth fails" } at src/mongo/shell/db.js:228
exception: login failed

Maybe relevant: executing 'sudo' on that host takes a very long time (~10 seconds). No idea why..

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :

Wow! Found it! Restarting rsyslog fixed the problem, so it seems a crashed rsyslog held up mongodb (and sudo).

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Well done \o/
We'll take a closer look at our rsyslog. Just to confirm - what version of Juju you are using, Juju 1.25?

Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote : Re: [Bug 1616832] Re: manual environment juju-db timeout

i'm using Juju 1.25.6

I have encountered this bug a few times before. The environments weren't
important so I didn't look into this further. However, this means that I'll
probably encounter this again in the future so it's good to find the root
cause of this. Let me know what information you need to get to the bottom
of this.

tags: added: eda
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote : Re: manual environment juju-db timeout

I've done some digging into this. It seems that if rsyslogd is stuck for some reason then anything logging to it will soon get stuck too. Local logging to rsyslogd/syslogd is done via the /dev/log unix domain socket and if that isn't being read from, the socket buffer soon fills up and processes attempting to send logs to it get blocked.

I'm not sure why rsyslog was blocked. Is there anything in earlier rotated versions of /var/log/syslog?

In all versions of Juju (including 2.0) we have mongodb logging to syslog so mongodb was blocked which then causes problems for juju.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

To replicate the problem, run this in one terminal:

https://gist.github.com/mjs/e8c3750d9ef16d1a008b42b15969cb12

and then in another:

$ pkill -STOP rsyslogd

You'll notice the timestamps emitted in the first terminal quickly stop.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Possible workarounds/solutions:

1. Figure out what's causing rsyslogd to hang.

2. Set something up to detect when rsyslogd is stuck and restart it. A simple process which attempts to log regularly in a separate thread/goroutine and notices when that gets stuck would do it.

3. Have Juju config mongod to log to a file directly instead of to syslog. Log file rotation will need to be sorted out.

summary: - manual environment juju-db timeout
+ stuck rsyslogd causes mongodb to block
Changed in juju:
status: Incomplete → Triaged
milestone: none → 2.0-rc1
Revision history for this message
Merlijn Sebrechts (merlijn-sebrechts) wrote :
Download full text (16.7 KiB)

I don't see anything relevant in syslog. Maybe relevant from dmesg:

[15693894.264172] Out of memory: Kill process 12876 (jujud) score 566 or sacrifice child
[15693894.264298] Killed process 12876 (jujud) total-vm:16240012kB, anon-rss:9890920kB, file-rss:0kB
[15693896.946435] init: jujud-machine-0 main process (12876) killed by KILL signal
[15693896.946459] init: jujud-machine-0 main process ended, respawning
[15701537.396581] init: rsyslog main process (17623) killed by ABRT signal
[15701537.396601] init: rsyslog main process ended, respawning
[15703610.772118] init: rsyslog main process (26743) killed by ABRT signal
[15703610.772139] init: rsyslog main process ended, respawning
[15703699.390319] init: rsyslog main process (28635) killed by ABRT signal
[15703699.390338] init: rsyslog main process ended, respawning
[15704652.447818] init: rsyslog main process (28754) killed by ABRT signal
[15704652.447837] init: rsyslog main process ended, respawning
[15708580.939157] init: rsyslog main process (29697) killed by SEGV signal
[15708580.939177] init: rsyslog main process ended, respawning
[15711112.370742] init: rsyslog main process (1035) killed by SEGV signal
[15711112.370762] init: rsyslog main process ended, respawning
[15717140.790974] init: rsyslog main process (3278) killed by SEGV signal
[15717140.790995] init: rsyslog main process ended, respawning
[15734205.449190] init: rsyslog main process (8526) killed by SEGV signal
[15734205.449211] init: rsyslog main process ended, respawning
[15739140.085834] init: rsyslog main process (23710) killed by ABRT signal
[15739140.085854] init: rsyslog main process ended, respawning
[15741501.865336] init: rsyslog main process (27951) killed by SEGV signal
[15741501.865356] init: rsyslog main process ended, respawning
[15745203.240097] init: rsyslog main process (30101) killed by SEGV signal
[15745203.240118] init: rsyslog main process ended, respawning
[15764428.080724] init: rsyslog main process (1259) killed by SEGV signal
[15764428.080746] init: rsyslog main process ended, respawning
[15764874.113360] init: rsyslog main process (17816) killed by SEGV signal
[15764874.113381] init: rsyslog main process ended, respawning
[15767003.283012] init: rsyslog main process (18235) killed by SEGV signal
[15767003.283032] init: rsyslog main process ended, respawning
[15770521.471332] init: rsyslog main process (20468) killed by SEGV signal
[15770521.471352] init: rsyslog main process ended, respawning
[15771313.026194] init: rsyslog main process (23571) killed by SEGV signal
[15771313.026216] init: rsyslog main process ended, respawning
[15776637.709000] init: rsyslog main process (24248) killed by ABRT signal
[15776637.709020] init: rsyslog main process ended, respawning
[15781051.656705] init: rsyslog main process (28754) killed by SEGV signal
[15781051.656724] init: rsyslog main process ended, respawning
[15783756.481548] init: rsyslog main process (3864) killed by ABRT signal
[15783756.481567] init: rsyslog main process ended, respawning
[15784878.817742] init: rsyslog main process (6151) killed by ABRT signal
[15784878.817761] init: rsyslog main process ended, respawning
[15786999.080045] init: rsyslog ...

Changed in juju:
milestone: 2.0-rc1 → 2.0-rc2
Changed in juju:
milestone: 2.0-rc2 → 2.0.1
Changed in juju-core:
status: Won't Fix → Triaged
importance: Undecided → Critical
tags: added: sts
Curtis Hovey (sinzui)
tags: added: needs-ci-test
tags: removed: manual-provider
Changed in juju:
milestone: 2.0.1 → 2.1.0
Changed in juju-ci-tools:
importance: Undecided → Wishlist
Curtis Hovey (sinzui)
Changed in juju-ci-tools:
status: New → Triaged
Changed in juju-core:
milestone: none → 1.25.11
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We are not likely to address this in 2.1.x. Re-targetting in "juju".

Changed in juju:
milestone: 2.1.0 → 2.2.0
Revision history for this message
John A Meinel (jameinel) wrote :

I'm guessing this isn't as major of an issue in 2.X because upgrading from Mongo 2.4 to 3.2 causes Mongo to write far less to syslog. So while it is probably still true that causing rsyslog to hang could cause mongo to hang, it is less likely to actually trigger in practice.

For 1.25, I don't think we are actually going to take the time to develop the feature, as that has moved out of active development phase. And while I'm fine with backporting fixes to 1.25 that are reasonably aligned with our goals for 2.x, if this isn't really important for 2.x we don't really have the time to dedicate to a 1.25 fix. (Far better to spend that time getting an upgrade-to-2.x working and having people get all the other goodness that we have spent all our time on.)

Changed in juju:
importance: High → Medium
milestone: 2.2-beta1 → none
Changed in juju-core:
status: Triaged → Won't Fix
importance: Critical → Undecided
milestone: 1.25.11 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

We need confirmation that an upgrade to Mongo 3.2 and higher has essentially resolved the issue as less is written to syslog.

Meanwhile I will mark this report as Incomplete. If the problem persists, feel free to re-open this report or file a new one for concrete Juju version.

Changed in juju:
status: Triaged → Incomplete
Changed in juju-ci-tools:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-ci-tools because there has been no activity for 60 days.]

Changed in juju-ci-tools:
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.