Soft restart of juju 2.2.6 causes high mongo load resulting in eventual unavailability

Bug #1737255 reported by David Lawson
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
2.2
Fix Released
High
Tim Penhey

Bug Description

We had a brief outage on the IS shared juju2 controllers for Prodstack 4.5 today that, after some investigation, we've traced to what appears to be a soft/partial restart of one of the state servers. That restart appears to have provoked load in mongo that never resolved and when, eventually, a switch deploy was attempted by one of the OLS teams it pushed the load of mongo plus the state server over a point where it was capable of responding to clients, causing client connections to be rejected and allowing mongo to recover.

 - 2018-12-7 16:05:03 state server starts up, there is no log message indicating it ever stopped prior
 - 2018-12-7 16:15:21 state server appears to finish some start up tasks
 - 2018-12-7 16:16:02 the number of log messages from mongo climbs rapidly
 - 2018-12-7 16:16:30 load on ubuntu/2 starts to climb rapidly
 - 2018-12-7 16:16:32 mongo starts performing signifcant numbers of COLLSCAN operations
 - From this point load on ubuntu/2 never drops
 - 2018-12-8 16:40:09 OLS starts a switch deploy of the scasnap service, juju status listing and verification take some time, it had taken 11 minutes about an hour earlier, so we're theorizing that roughly fifteen minutes after the deploy started, the new services were added.
 - 2017-12-8 16:56:14 Juju begins having problems talking to mongo
 - 2017-12-8 19:23:45 Juju totally fails to talk to mongo
 - 2017-12-8 19:23:46 Juju state server fails to talk to API server
 - At this point we theorize that, since mongo/the API server were no longer receiving requests, they worked off the backlog precipitated by the soft restart the day before and dropped down to a more reasonable load.

Some supporting documentation, here's where the state server on ubuntu/2 appears to restart:
https://pastebin.canonical.com/205110/

Load climbing on ubuntu/2 on 12-7:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512662400000&to=1512664200000

Load climbing on ubuntu/2 on 12-8:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512746168538&to=1512763322267

Load falling on ubuntu/2 on 12-8:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512756204208&to=1512769760419

juju log snippets from 12-8:
https://pastebin.canonical.com/205113/
https://pastebin.canonical.com/205114/
https://pastebin.canonical.com/205115/

Revision history for this message
John A Meinel (jameinel) wrote :

I think Tim's recent changes around the txns.log watcher will address at least some of this.

Changed in juju:
assignee: nobody → Tim Penhey (thumper)
importance: Undecided → High
status: New → In Progress
Tim Penhey (thumper)
Changed in juju:
milestone: none → 2.3.2
Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.