Canonical Juju

Busy controllers seem to become unstable at intervals, requiring an mgopurge run to recover

Bug #1896739 reported by Barry Price on 2020-09-23

12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Confirmed	Undecided	Unassigned

Bug Description

At intervals (in our case, roughly every two months), we're seeing degraded service on an HA three-node controller cluster.

Internal API services are seen to shut down on the controller machine agents, they generally come back after a manual restart but then fail again later.

Taking the controllers offline to run a full `mgopurge` restores stable service.

This bug is to track figuring out exactly what mgopurge is removing/repairing from the MongoDB store when this happens, so that we can hopefully stop it from occurring in the first place.

Feel free to mark this Incomplete for now.

The most recent such incident was a month ago, and we don't have backups to compare the state of mongo before and after. But we'll add them to this bug and reopen if/when it happens again.

Tags:

Revision history for this message

Pen Gale (pengale) wrote on 2020-11-04:

#1

Marking incomplete, as per description. Definitely interested to see the logs when we get them!

Changed in juju:
status:	New → Incomplete

Revision history for this message

Haw Loeung (hloeung) wrote on 2020-11-23:

#3

Per MM, and again here. Output from mgopurge - https://pastebin.canonical.com/p/SRZqpNjjrh/

Is this enough to help identify any issues? If not, what else should we gather next time?

Ran into https://github.com/juju/mgopurge/issues/28 which Ian helped fix with some DB surgery.

Changed in juju:
status:	Incomplete → Confirmed

Revision history for this message

John A Meinel (jameinel) wrote on 2020-12-16:

#4

Clearly you do have a bad transaction: 5fbaeb215f5ce80312317f65 which is referencing several documents that are otherwise missing/deleted.

There is the first part of: postgresql-odm-kb in model b6be743c-3db6-4edb-83c2-ba125fb20dba
Which seems odd, given that it has a very long queue, but is in a different model than the other transactions.

{machines 844969a0-e800-4047-887e-70119d1a0b82:41800}

Would implicate a different model. (and 41,800 would indicate it is the model where you are continually deploying and destroying 'ubuntu', since otherwise I don't think you're running any models with 42k machines in them.)

Revision history for this message

Laurent Sesquès (sajoupa) wrote on 2020-12-18:

#5

844969a0-e800-4047-887e-70119d1a0b82 is indeed the model where we continuously deploy and delete cs:ubuntu (it's currently at machine 45420).

Revision history for this message

Haw Loeung (hloeung) wrote on 2020-12-19:

#6

That model is used for E2E checking, deploying and destroying. Basically to catch Juju as well as OpenStack issues before users notice and report.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-juju-mgopurge #28
[open] Edit

Bug watches keep track of this bug in other bug trackers.