Soft restart of juju 2.2.6 causes high mongo load resulting in eventual unavailability
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Tim Penhey | ||
2.2 |
Fix Released
|
High
|
Tim Penhey |
Bug Description
We had a brief outage on the IS shared juju2 controllers for Prodstack 4.5 today that, after some investigation, we've traced to what appears to be a soft/partial restart of one of the state servers. That restart appears to have provoked load in mongo that never resolved and when, eventually, a switch deploy was attempted by one of the OLS teams it pushed the load of mongo plus the state server over a point where it was capable of responding to clients, causing client connections to be rejected and allowing mongo to recover.
- 2018-12-7 16:05:03 state server starts up, there is no log message indicating it ever stopped prior
- 2018-12-7 16:15:21 state server appears to finish some start up tasks
- 2018-12-7 16:16:02 the number of log messages from mongo climbs rapidly
- 2018-12-7 16:16:30 load on ubuntu/2 starts to climb rapidly
- 2018-12-7 16:16:32 mongo starts performing signifcant numbers of COLLSCAN operations
- From this point load on ubuntu/2 never drops
- 2018-12-8 16:40:09 OLS starts a switch deploy of the scasnap service, juju status listing and verification take some time, it had taken 11 minutes about an hour earlier, so we're theorizing that roughly fifteen minutes after the deploy started, the new services were added.
- 2017-12-8 16:56:14 Juju begins having problems talking to mongo
- 2017-12-8 19:23:45 Juju totally fails to talk to mongo
- 2017-12-8 19:23:46 Juju state server fails to talk to API server
- At this point we theorize that, since mongo/the API server were no longer receiving requests, they worked off the backlog precipitated by the soft restart the day before and dropped down to a more reasonable load.
Some supporting documentation, here's where the state server on ubuntu/2 appears to restart:
https:/
Load climbing on ubuntu/2 on 12-7:
https:/
Load climbing on ubuntu/2 on 12-8:
https:/
Load falling on ubuntu/2 on 12-8:
https:/
juju log snippets from 12-8:
https:/
https:/
https:/
Changed in juju: | |
milestone: | none → 2.3.2 |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
I think Tim's recent changes around the txns.log watcher will address at least some of this.