juju-core

talking to mongo can fail with "TCP i/o timeout"

Bug #1307434 reported by John A Meinel on 2014-04-14

This bug affects 12 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	High	Michael Foord	juju-core 1.21-alpha1
	1.20	Fix Released	High	Ian Booth	juju-core 1.20.5

Bug Description

While doing scale testing, I ended up getting:

ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot assign unit "ubuntu/8589" to machine 14: write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 50/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 52/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 53/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 55/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout
ERROR cannot add unit 55/100 to service "ubuntu": cannot add unit to service "ubuntu": write tcp 127.0.0.1:37017: i/o timeout

It appears that if you overload Mongo enough that it can't respond to your requests as fast as you are making them, you end up failing because of i/o timeout, rather than getting a nicer error.

This might be considered a bug in 'mgo' that it doesn't notice that Mongo is in overload and tell you to back off. At the least, it feels like juju needs to be able to gracefully recover from this state.

Tags:

Revision history for this message

Adam Stokes (adam-stokes) wrote on 2014-05-08:

We've actually seen this exact problem with deploying 4 charms on trusty using juju-core 1.18.x

The charms deployed were:

nova-cloud-controller
glance
keystone
mysql

All in local provider with lxc containers for each

tags:

added: cloud-installer

Revision history for this message

Adam Collard (adam-collard) wrote on 2014-05-08:

We hit this problem with deploying 64 nodes. Restarting the state server worked around it.

Revision history for this message

Chris J Arges (arges) wrote on 2014-05-08:

Download full text (3.4 KiB)

Me too:

2014-05-08 18:51:49 INFO install Traceback (most recent call last):
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 385, in <module>
2014-05-08 18:51:49 INFO install hooks.execute(sys.argv)
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 381, in execute
2014-05-08 18:51:49 INFO install self._hooks[hook_name]()
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 71, in install_hook
2014-05-08 18:51:49 INFO install src = config('openstack-origin')
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 44, in wrapper
2014-05-08 18:51:49 INFO install res = func(*args, **kwargs)
2014-05-08 18:51:49 INFO install File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 166, in config
2014-05-08 18:51:49 INFO install return json.loads(subprocess.check_output(config_cmd_line))
2014-05-08 18:51:49 INFO install File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
2014-05-08 18:51:49 INFO install raise CalledProcessError(retcode, cmd, output=output)
2014-05-08 18:51:49 INFO install subprocess.CalledProcessError: Command '['config-get', 'openstack-origin', '--format=json']' returned non-zero exit status 1
2014-05-08 18:51:49 ERROR juju.worker.uniter uniter.go:486 hook failed: exit status 1
2014-05-08 18:51:49 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:51:49 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju.worker.uniter.filter filter.go:116 watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "logger": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-0...

Me too:

2014-05-08 18:51:49 INFO install Traceback (most recent call last):
2014-05-08 18:51:49 INFO install   File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 385, in <module>
2014-05-08 18:51:49 INFO install     hooks.execute(sys.argv)
2014-05-08 18:51:49 INFO install   File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 381, in execute
2014-05-08 18:51:49 INFO install     self._hooks[hook_name]()
2014-05-08 18:51:49 INFO install   File "/var/lib/juju/agents/unit-glance-0/charm/hooks/install", line 71, in install_hook
2014-05-08 18:51:49 INFO install     src = config('openstack-origin')
2014-05-08 18:51:49 INFO install   File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 44, in wrapper
2014-05-08 18:51:49 INFO install     res = func(*args, **kwargs)
2014-05-08 18:51:49 INFO install   File "/var/lib/juju/agents/unit-glance-0/charm/hooks/charmhelpers/core/hookenv.py", line 166, in config
2014-05-08 18:51:49 INFO install     return json.loads(subprocess.check_output(config_cmd_line))
2014-05-08 18:51:49 INFO install   File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
2014-05-08 18:51:49 INFO install     raise CalledProcessError(retcode, cmd, output=output)
2014-05-08 18:51:49 INFO install subprocess.CalledProcessError: Command '['config-get', 'openstack-origin', '--format=json']' returned non-zero exit status 1
2014-05-08 18:51:49 ERROR juju.worker.uniter uniter.go:486 hook failed: exit status 1
2014-05-08 18:51:49 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:51:49 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju watcher.go:66 state/api: error trying to stop watcher connection is shut down
2014-05-08 18:53:03 ERROR juju.worker.uniter.filter filter.go:116 watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "upgrader": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "logger": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:03 ERROR juju runner.go:209 worker: fatal "uniter": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:08 ERROR juju runner.go:209 worker: fatal "rsyslog": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout
2014-05-08 18:53:08 ERROR juju runner.go:220 worker: exited "api": watcher iteration error: read tcp 127.0.0.1:37017: i/o timeout

Nate Finch (natefinch) on 2014-05-08

Changed in juju-core:
importance:	Low → High

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2014-05-08:

we've seen this quite a bit.. also found some panics in the jujud on state server, bouncing machine-0 jujud resolved.

Revision history for this message

Nate Finch (natefinch) wrote on 2014-05-08:

This is happening with small numbers of units (Adam Stokes' case above was just 1 unit each of 4 services) and apparently requires a server restart to resolve the situation. Increased importance to High because it seems this is much more likely than previously thought.

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2014-05-08:

also to note this was seen on systems with almost minor load and just 2 machines.

Curtis Hovey (sinzui) on 2014-05-08

Changed in juju-core:
milestone:	none → next-stable

Revision history for this message

Nate Finch (natefinch) wrote on 2014-05-16:

I hit this recently just by doing bootstrap and then ensure availability. no services deployed at all.

Revision history for this message

Christian Reis (kiko) wrote on 2014-05-29:

This is happening to us at the Async deployment as well, with a 10-container installation. Nothing very interesting in the mongo log.

Kapil Thangavelu (hazmat) on 2014-05-29

tags:

added: reliability

Andreas Hasenack (ahasenack) on 2014-05-29

tags:

added: landscape

Curtis Hovey (sinzui) on 2014-05-29

Changed in juju-core:
milestone:	next-stable → 1.19.3

Curtis Hovey (sinzui) on 2014-05-30

Changed in juju-core:
milestone:	1.19.3 → 1.19.4

Revision history for this message

Adam Stokes (adam-stokes) wrote on 2014-06-04:

Hi,

Why is this only targetted for 1.19.x when the problem is seen in the stable release of 1.18.x? Can we get this scheduled for inclusion in the stable release series?

Thanks
Adam

Revision history for this message

Mike McCracken (mikemc) wrote on 2014-06-04:

#10

Here's a unit log for another instance of this issue in deploying glance inside an lxc on a host in a maas cluster:

http://paste.ubuntu.com/7589184/

Here's the corresponding juju status :
http://paste.ubuntu.com/7589198/

I've experienced this same error more than half the times that I've tried to deploy this environment.

Curtis Hovey (sinzui) on 2014-06-19

Changed in juju-core:
milestone:	1.19.4 → next-stable

Revision history for this message

Alex Chiang (achiang) wrote on 2014-06-23:

#11

Here is my contribution to this bug.

/var/log/juju/all-machines.log: http://pastebin.ubuntu.com/7691311/

This occurs with:

- a bootstrap node
- a single mongodb unit
- my own service/unit that I am developing.

(deployed on an openstack instance)

ssh to the bootstrap node and issuing:

$ sudo service jujud-machine-0 restart
jujud-machine-0 stop/waiting
jujud-machine-0 start/running, process 6442

Seems to recover, and allow `juju status` to work again.

Ian Booth (wallyworld) on 2014-07-10

Changed in juju-core:
milestone:	next-stable → 1.20.2

Ian Booth (wallyworld) on 2014-07-11

Changed in juju-core:
assignee:	nobody → Michael Foord (mfoord)
status:	Triaged → In Progress

Revision history for this message

Michael Foord (mfoord) wrote on 2014-07-14:

#12

This looks like several different bugs, or more specifically the same issue surfacing in several different places. We believe that the underlying cause is due to juju holding onto a single session to mongo - so if this connection dies everything is irredeemably broken. The best fix is to open a new session whenever we need it, and use the underlying socket pooling provided by the mgo driver. Unfortunately our watcher architecture (and the state itself) all hold onto an indirect reference to this session via mgo collections. So it needs a bit of "unpicking" to resolve.

This (short) email thread is a useful reference on the topic: https://groups.google.com/forum/#!topic/mgo-users/iZeiKgonGDU

Curtis Hovey (sinzui) on 2014-07-17

Changed in juju-core:
milestone:	1.20.2 → 1.21-alpha1

Ian Booth (wallyworld) on 2014-07-23

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2014-09-08

Changed in juju-core:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.