Canonical Juju

Bug #1733708
Comment #9

Comment 9 for bug 1733708

Revision history for this message

Tim Penhey (thumper) wrote on 2017-11-29:

I watched the removal of an application today with wgrant and t0mb0.

Before we started we checked out the database. The txns collection had just been cleaned up from the previous night and was a reasonable level. There were no pending cleanups for the model, and mongotop looked sane.

The command was just to remove a single application that had two units. Each of these principle units would have had four subordinates.

Just after the removal, the juju.txns.log collection had reads go through the roof. A lot of write load into juju.txns. The read load on juju.txns.log was high for a number of minutes causing 'juju status' on other models to spike to 20s from 0.5s.

There was nothing untoward going on as far as I could tell, but the statetracker report on each of the controllers showed that every statepool had 184 models (and a few extra just because). This would have resulted in 552 models each polling the txns.log.

I think what we are hitting here is a situation where the cascading changes due to a deletion cause load in the txns collection. This causes a spike in the txns.log reads due to the number of tailers. This introduces significan i/o load on mongo which in turn can cause other commands to fail, which causes retries, which triggers additional transactions, which just adds to the overall load. This can degrade into a death spiral that only is recoverable by restarting the application servers.

To solve this, I think we really need to investigate a way to provide a central txns.log tailer for each controller.