i/o timeout from mongodb

Bug #1556961 reported by Andreas Hasenack
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Incomplete
Undecided
Unassigned

Bug Description

A landscape-driven cloud deployment failed and we noticed this in our juju client logs:

Mar 14 04:55:55 juju-sync-1 INFO Handling failure RequestError: read tcp 10.96.15.100:37017: i/o timeout (code: '')

We didn't retry that, and filed bug #1556937 about it.

10.96.15.100 is the state server, and 37017 is mongo's port. We don't talk to mongo directly, so that was an internal juju connection.

machine-0.log ends with these three lines:
2016-03-14 04:19:51 ERROR juju.worker.firewaller firewaller.go:439 failed to lookup "machine-3-lxc-5", skipping port change
2016-03-14 04:55:55 ERROR juju.state status.go:216 failed to write status history: read tcp 10.96.15.100:37017: i/o timeout
2016-03-14 04:55:56 ERROR juju.state.leadership manager.go:72 stopping leadership manager with error: read tcp 10.96.15.100:37017: i/o timeout

After that, all other units have error lines like this one:
unit-neutron-gateway-0[10502]: 2016-03-14 04:56:03 WARNING juju.worker.dependency engine.go:304 failed to start "uniter" manifold worker: dependency not available

Of note is that all-machines.log didn't get logs from all units, just one (!). I also spotted a rsyslog restart in /var/log/syslog:
Mar 14 04:55:59 albany rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="660130" x-info="http://www.rsyslog.com"] start

/var/log/syslog got quite big (over 300MB).

I'm attaching the relevant log files from the bootstrap node. This is from a CI job so I don't have the environment up still, but I do have logs from all units if you want them (https://ci.lscape.net/job/landscape-system-tests/1362/ for our reference).

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I think this is caused by bug 1539656 (which was fixed in 1.25.4). I can verify from checking the unit logs, but I don't have access to the CI job to check. Can you add me?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

That tarball has juju logs from the units. The remaining logs that are in the CI job are files outside of /var/log/juju.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

In going through the unit files, I cannot be 100% certain that bug #1539656 is the only issue happening here, although it certainly is one of them.

I'm going to mark this as incomplete, pending a recreate on 1.25.4+. I'll also add in some additional logging into 1.25.5 that should help aid debugging for this type of problem.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Restarting jujud on the state server *should* help in this case.

Changed in juju-core:
status: New → Incomplete
milestone: none → 1.25.5
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.5 → 1.25.6
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.6 → 1.25.7
Changed in juju-core:
milestone: 1.25.7 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I believe the root cause of io/timeout is fixed by https://bugs.launchpad.net/juju-core/+bug/1597601

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.