talking to mongo can fail with "TCP i/o timeout"

Bug #1307434 reported by John A Meinel
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Michael Foord
1.20
Fix Released
High
Ian Booth

Bug Description

While doing scale testing, I ended up getting:

ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot assign unit "ubuntu/8589" to machine 14: write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 50/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 55/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 55/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout

It appears that if you overload Mongo enough that it can't respond to your requests as fast as you are making them, you end up failing because of i/o timeout, rather than getting a nicer error.

This might be considered a bug in 'mgo' that it doesn't notice that Mongo is in overload and tell you to back off. At the least, it feels like juju needs to be able to gracefully recover from this state.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

We've actually seen this exact problem with deploying 4 charms on trusty using juju-core 1.18.x

The charms deployed were:

nova-cloud-controller
glance
keystone
mysql

All in local provider with lxc containers for each

tags: added: cloud-installer
Revision history for this message
Adam Collard (adam-collard) wrote :

We hit this problem with deploying 64 nodes. Restarting the state server worked around it.

Revision history for this message
Chris J Arges (arges) wrote :
Download full text (3.4 KiB)

Me too:

2014-05-08 18:51:49 INFO install Traceback (most recent call last):
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 385, in <module>
2014-05-08 18:51:49 INFO install hooks.execute(sys.argv)
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 381, in execute
2014-05-08 18:51:49 INFO install self._hooks[hook_name]()
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 71, in install_hook
2014-05-08 18:51:49 INFO install src = config('openstack-origin')
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 44, in wrapper
2014-05-08 18:51:49 INFO install res = func(*args, **kwargs)
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 166, in config
2014-05-08 18:51:49 INFO install return json.loads(subprocess.check_output(config_cmd_line))
2014-05-08 18:51:49 INFO install File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
2014-05-08 18:51:49 INFO install raise CalledProcessError(retcode, cmd, output=output)
2014-05-08 18:51:49 INFO install subprocess.CalledProcessError: Command '['config-get', 'openstack-origin', '--format=json']' returned non-zero exit status 1
2014-05-08 18:51:49 ERROR juju.worker.uniter uniter.go:486 hook failed: exit status 1
2014-05-08 18:51:49 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:51:49 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju.worker.uniter.filter filter.go:116 watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "logger": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-0...

Read more...

Nate Finch (natefinch)
Changed in juju-core:
importance: Low → High
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

we've seen this quite a bit.. also found some panics in the jujud on state server, bouncing machine-0 jujud resolved.

Revision history for this message
Nate Finch (natefinch) wrote :

This is happening with small numbers of units (Adam Stokes' case above was just 1 unit each of 4 services) and apparently requires a server restart to resolve the situation. Increased importance to High because it seems this is much more likely than previously thought.

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

also to note this was seen on systems with almost minor load and just 2 machines.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → next-stable
Revision history for this message
Nate Finch (natefinch) wrote :

I hit this recently just by doing bootstrap and then ensure availability. no services deployed at all.

Revision history for this message
Christian Reis (kiko) wrote :

This is happening to us at the Async deployment as well, with a 10-container installation. Nothing very interesting in the mongo log.

tags: added: reliability
tags: added: landscape
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: next-stable → 1.19.3
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.19.3 → 1.19.4
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Hi,

Why is this only targetted for 1.19.x when the problem is seen in the stable release of 1.18.x? Can we get this scheduled for inclusion in the stable release series?

Thanks
Adam

Revision history for this message
Mike McCracken (mikemc) wrote :

Here's a unit log for another instance of this issue in deploying glance inside an lxc on a host in a maas cluster:

http://paste.ubuntu.com/7589184/

Here's the corresponding juju status :
http://paste.ubuntu.com/7589198/

I've experienced this same error more than half the times that I've tried to deploy this environment.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.19.4 → next-stable
Revision history for this message
Alex Chiang (achiang) wrote :

Here is my contribution to this bug.

/var/log/juju/all-machines.log: http://pastebin.ubuntu.com/7691311/

This occurs with:

- a bootstrap node
- a single mongodb unit
- my own service/unit that I am developing.

(deployed on an openstack instance)

ssh to the bootstrap node and issuing:

$ sudo service jujud-machine-0 restart
jujud-machine-0 stop/waiting
jujud-machine-0 start/running, process 6442

Seems to recover, and allow `juju status` to work again.

Ian Booth (wallyworld)
Changed in juju-core:
milestone: next-stable → 1.20.2
Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Michael Foord (mfoord)
status: Triaged → In Progress
Revision history for this message
Michael Foord (mfoord) wrote :

This looks like several different bugs, or more specifically the same issue surfacing in several different places. We believe that the underlying cause is due to juju holding onto a single session to mongo - so if this connection dies everything is irredeemably broken. The best fix is to open a new session whenever we need it, and use the underlying socket pooling provided by the mgo driver. Unfortunately our watcher architecture (and the state itself) all hold onto an indirect reference to this session via mgo collections. So it needs a bit of "unpicking" to resolve.

This (short) email thread is a useful reference on the topic: https://groups.google.com/forum/#!topic/mgo-users/iZeiKgonGDU

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.20.2 → 1.21-alpha1
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.