Canonical Juju

Soft restart of juju 2.2.6 causes high mongo load resulting in eventual unavailability

Bug #1737255 reported by David Lawson on 2017-12-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Tim Penhey	Canonical Juju 2.3.2
	2.2	Fix Released	High	Tim Penhey	Canonical Juju 2.2.7

Bug Description

We had a brief outage on the IS shared juju2 controllers for Prodstack 4.5 today that, after some investigation, we've traced to what appears to be a soft/partial restart of one of the state servers. That restart appears to have provoked load in mongo that never resolved and when, eventually, a switch deploy was attempted by one of the OLS teams it pushed the load of mongo plus the state server over a point where it was capable of responding to clients, causing client connections to be rejected and allowing mongo to recover.

- 2018-12-7 16:05:03 state server starts up, there is no log message indicating it ever stopped prior
- 2018-12-7 16:15:21 state server appears to finish some start up tasks
- 2018-12-7 16:16:02 the number of log messages from mongo climbs rapidly
- 2018-12-7 16:16:30 load on ubuntu/2 starts to climb rapidly
- 2018-12-7 16:16:32 mongo starts performing signifcant numbers of COLLSCAN operations
- From this point load on ubuntu/2 never drops
- 2018-12-8 16:40:09 OLS starts a switch deploy of the scasnap service, juju status listing and verification take some time, it had taken 11 minutes about an hour earlier, so we're theorizing that roughly fifteen minutes after the deploy started, the new services were added.
- 2017-12-8 16:56:14 Juju begins having problems talking to mongo
- 2017-12-8 19:23:45 Juju totally fails to talk to mongo
- 2017-12-8 19:23:46 Juju state server fails to talk to API server
- At this point we theorize that, since mongo/the API server were no longer receiving requests, they worked off the backlog precipitated by the soft restart the day before and dropped down to a more reasonable load.

Some supporting documentation, here's where the state server on ubuntu/2 appears to restart:
https://pastebin.canonical.com/205110/

Load climbing on ubuntu/2 on 12-7:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512662400000&to=1512664200000

Load climbing on ubuntu/2 on 12-8:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512746168538&to=1512763322267

Load falling on ubuntu/2 on 12-8:
https://grafana.admin.canonical.com/dashboard/db/juju2-controllers?panelId=34&fullscreen&orgId=1&var-controller=prodstack-is&var-host=ubuntu-2&var-node=All&from=1512756204208&to=1512769760419

juju log snippets from 12-8:
https://pastebin.canonical.com/205113/
https://pastebin.canonical.com/205114/
https://pastebin.canonical.com/205115/

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-12:

I think Tim's recent changes around the txns.log watcher will address at least some of this.

Changed in juju:
assignee:	nobody → Tim Penhey (thumper)
importance:	Undecided → High
status:	New → In Progress

Tim Penhey (thumper) on 2017-12-13

Changed in juju:
milestone:	none → 2.3.2

Tim Penhey (thumper) on 2018-01-07

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2018-01-16

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.