upgrade step for 2.8.1 ReplaceNeverSetWithUnset fails if statuses collection is large

Bug #1907685 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.8
Fix Released
High
John A Meinel

Bug Description

while trying to upgrade prodstack, we ran into:
io timeout again: 2020-12-10 14:58:32 ERROR juju.upgrade upgrade.go:138 upgrade step "update status documents to remove neverset" failed: model UUID "844969a0-e800-4047-887e-70119d1a0b82": read tcp 10.25.2.109:38226->10.25.2.109:37017: i/o timeout

Digging into the code, it was doing:

                err := col.Find(nil).All(&docs)

Which meant that it has to load all the statuses document, not even filtered to the ones that it knows that it wants to process. We should turn this into an iterator, and filter to only find status documents that we want to touch.

I think this query does it:

diff --git a/state/upgrades.go b/state/upgrades.go
index 0e89812c46..2c33b66b0d 100644
--- a/state/upgrades.go
+++ b/state/upgrades.go
@@ -2904,14 +2904,11 @@ func ReplaceNeverSetWithUnset(pool *StatePool) (err error) {
                col, closer := st.db().GetCollection(statusesC)
                defer closer()

- var docs []bson.M
- err := col.Find(nil).All(&docs)
- if err != nil {
- return errors.Trace(err)
- }
+ iter := col.Find(bson.M{"neverset": bson.M{"$exists": 1}}).Iter()

                var ops []txn.Op
- for _, oldDoc := range docs {
+ var oldDoc bson.M
+ for iter.Next(&oldDoc) {
                        // For docs where "neverset" is true, we update the
                        // Status and StatusInfo. For all others, we just remove
                        // the "neverset attribute".
@@ -2942,6 +2939,9 @@ func ReplaceNeverSetWithUnset(pool *StatePool) (err error) {
                                Update: update,
                        })
                }
+ if err := iter.Close(); err != nil {
+ return errors.Trace(err)
+ }

                return errors.Trace(st.db().RunTransaction(ops))
        }))

John A Meinel (jameinel)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
Revision history for this message
John A Meinel (jameinel) wrote :

On Prodstack, we made the mistaken assumption that we could remove entries from statuses that were old (we confused statuseshistory with statuses).
So we removed everything older than 1 week with:
https://pastebin.canonical.com/p/fMzZwDm2Wx/

We then got upgrade to complete, but saw errors in the all watcher because it was trying to load statuses for machines/instances that had been deleted.

We restored all statuses from the backup file and then ran:
db.statuses.updateMany({"neverset": false}, {"$unset": {"neverset": ""}})
db.statuses.updateMany({"neverset": true}, {"$unset": {"neverset": ""}, "$set": {"status": "unset", "statusinfo", ""}})

Which should be the equivalent of running the upgrade step.
That resulted in:
https://pastebin.canonical.com/p/4bKyR37Z3Q/

(134k documents had neverset, but it was always false.)

So the unfortunate workaround is to move the statuses collection to the side, have upgrade progress, stop the agents again, manually do the upgrade in the alternate collection, and then move that back into place.
There should be a better fix for this in 2.8.8. and 2.9

Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/12447 is the start of a fix for 2.8

Revision history for this message
John A Meinel (jameinel) wrote :

In debugging prodstack, it seems that it keeps crashing every 1 hr. My current believe is that the TXN pruner is failing, because the old upgrade step was trying to apply a transaction to 140,000 documents.
We now have 12 transactions with 140k "d" references. They are all marked "s": 6, but likely we just can't progress because of so many docs being touched simultaneously.

Changed in juju:
milestone: 2.9-rc3 → 2.9-rc4
Revision history for this message
John A Meinel (jameinel) wrote :

Fix was released in 2.9-rc3

Changed in juju:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.