Comment 24 for bug 1733708

Revision history for this message
William Grant (wgrant) wrote :

After a couple of deploys yesterday reinforced the statuseshistory pruning hypothesis, we preserved some candidate machines and executed controlled demolitions today which basically proved that the statuseshistory cleanup is the problem.

For context, the two models (on the same 3-way HA OpenStack controller running Juju 2.2.8) that we can reproduce this on have frequently created and deleted applications with two (staging) and four (production) units each with four and six subordinates respectively. Staging is usually deployed to a couple of times a day, and production 1-3 times a week. Relevant model settings are update-status-hook-interval=5m, max-status-history-age=336h, max-status-history-size=5G. The principal and a couple of the subordinates set workload status on every update-status.

Deployments consist of deploying a new application, verifying it, adding relations, then remove-application and remove-machine --force on the old one (the remove-machine --force is mostly a leftover from the 2.1 subordinate bugs). In almost every production deployment, the remove phase tickles this bug. In staging deployments it is very rare, but has happened on at least two occasions. The main differences between the models are that production has twice as many machines and 50% more subordinates, and its units tend to live for several days rather than no more than one or two. It was on this basis that we speculated in November that statuseshistory might be to blame.

During the week of December 18, we were attempting to tickle the bug in controlled, monitored circumstances, so performed production deployments more closely together than is usual. The second production deployment in 24 hours did not tickle the bug, the first time in months that a deployment had proceeded without incident, and so we were unable to gather particularly useful data.

Both production and staging went without deployments for 12 days over the EOY break. The first staging deployment of 2018 tickled the bug -- unusual for staging. Controller logs showed "cannot delete history for unit" errors with IO timeouts for some of the units, some repeating over the next 20 minutes. The saslStart storm mentioned in comment #15 was again evident. A controller restart resolved things.

I preserved remaining production and staging machines from before the break for more detailed experiments. We manually purged the statuseshistories for the units on one machine of the other 12-day-old staging application, and a remove-machine --force worked fine; the machine vanished in about 5s and no added "juju status" latency was observed. Without removing statueshistories, remove-machine --force of the other machine also worked okay, but it took more than 30s, load on one controller spiked above 90, and status was delayed by several seconds during the removal. Each machine's units had about 50000 statuseshistories in total (on both production and staging there were ~7700 statuseshistory documents for each unit agent, and the same for the charms that update their status often).

Since that seemed like a reasonably successful test, we tried it on a 13-day-old production application inside a normal deployment that in >95% of cases breaks the controller. We purged all ~290000 statuseshistories sequentially, which took about 17 minutes. The deployment then ran smoothly, the machines died quickly, controller load did not exceed 15, and status remained snappy throughout the endeavour.

So I'm now highly confident that statuseshistory is the problem. But deletions are so incredibly slow that the approach in the PR may not end up being sufficient.