Busy controllers seem to become unstable at intervals, requiring an mgopurge run to recover

Bug #1896739 reported by Barry Price
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Confirmed
Undecided
Unassigned

Bug Description

At intervals (in our case, roughly every two months), we're seeing degraded service on an HA three-node controller cluster.

Internal API services are seen to shut down on the controller machine agents, they generally come back after a manual restart but then fail again later.

Taking the controllers offline to run a full `mgopurge` restores stable service.

This bug is to track figuring out exactly what mgopurge is removing/repairing from the MongoDB store when this happens, so that we can hopefully stop it from occurring in the first place.

Feel free to mark this Incomplete for now.

The most recent such incident was a month ago, and we don't have backups to compare the state of mongo before and after. But we'll add them to this bug and reopen if/when it happens again.

Tags: canonical-is
Revision history for this message
Pen Gale (pengale) wrote :

Marking incomplete, as per description. Definitely interested to see the logs when we get them!

Changed in juju:
status: New → Incomplete
Revision history for this message
Haw Loeung (hloeung) wrote :

Per MM, and again here. Output from mgopurge - https://pastebin.canonical.com/p/SRZqpNjjrh/

Is this enough to help identify any issues? If not, what else should we gather next time?

Ran into https://github.com/juju/mgopurge/issues/28 which Ian helped fix with some DB surgery.

Changed in juju:
status: Incomplete → Confirmed
Revision history for this message
John A Meinel (jameinel) wrote :

Clearly you do have a bad transaction: 5fbaeb215f5ce80312317f65 which is referencing several documents that are otherwise missing/deleted.

There is the first part of: postgresql-odm-kb in model b6be743c-3db6-4edb-83c2-ba125fb20dba
Which seems odd, given that it has a very long queue, but is in a different model than the other transactions.

{machines 844969a0-e800-4047-887e-70119d1a0b82:41800}

Would implicate a different model. (and 41,800 would indicate it is the model where you are continually deploying and destroying 'ubuntu', since otherwise I don't think you're running any models with 42k machines in them.)

Revision history for this message
Laurent Sesquès (sajoupa) wrote :

844969a0-e800-4047-887e-70119d1a0b82 is indeed the model where we continuously deploy and delete cs:ubuntu (it's currently at machine 45420).

Revision history for this message
Haw Loeung (hloeung) wrote :

That model is used for E2E checking, deploying and destroying. Basically to catch Juju as well as OpenStack issues before users notice and report.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.