Comment 22 for bug 1491688

Revision history for this message
Cheryl Jennings (cherylj) wrote :

From Menno:

I've had a good look around but due to the machine-0 log level being at the default of ERROR it's hard to know what happened.

One unusual thing is the various EOF errors that caused a bunch of workers to die at 19:55:36. This was caused by the MongoDB node briefly dropping from PRIMARY to SECONDARY and back again. MongoDB drops all its connections when the replicaset status changes. I'm not sure exactly why or how a node can become SECONDARY when there's only one host in the replicaset but I've seen MongoDB do this before.

Juju's workers are designed to able to cope with MongoDB dropping connections (it happens whenever the MonogDB master changes) but it's possible that something didn't recover properly when this happened. This is further evidenced by the fact that ca-cert.pem was being regularly updated. The rsyslog worker in machine-0 writes out ca-cert.pem when it starts up indicating that it was continually failing and restarting. What's curious is that there was no evidence of this in machine-0.log. There should have been at least one ERROR line for each failed start attempt.

What's also weird is that the MongoDB blip didn't seem to upset as many workers as I would have expected. It only seems to have affected workers that tried to do something with the DB at the time of the blip. Specifically I would have expected to see the "state" worker restart.

I restarted jujud on machine-0 and everything recovered. The Juju agents that weren't able to talk to rsyslog were now able to, presumably because the rsyslog worker on machine-0 was no longer broken and was able to correctly configure rsyslogd.

So my best theory is that the MongoDB replicaset blip, somehow broke the rsyslog worker (and possibly others too) in such a way that they couldn't recover. Without more detailed logs that's just a guess however.

It might be worth seeing if we can trigger this situation in a dev environment by manually forcing MongoDB to flip to SECONDARY mode at just the right time. There are MongoDB commands for manipulating the replicaset like this.