upgrade step for 2.8.1 ReplaceNeverSetWithUnset fails if statuses collection is large
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.8 |
Fix Released
|
High
|
John A Meinel |
Bug Description
while trying to upgrade prodstack, we ran into:
io timeout again: 2020-12-10 14:58:32 ERROR juju.upgrade upgrade.go:138 upgrade step "update status documents to remove neverset" failed: model UUID "844969a0-
Digging into the code, it was doing:
err := col.Find(
Which meant that it has to load all the statuses document, not even filtered to the ones that it knows that it wants to process. We should turn this into an iterator, and filter to only find status documents that we want to touch.
I think this query does it:
diff --git a/state/upgrades.go b/state/upgrades.go
index 0e89812c46.
--- a/state/upgrades.go
+++ b/state/upgrades.go
@@ -2904,14 +2904,11 @@ func ReplaceNeverSet
- var docs []bson.M
- err := col.Find(
- if err != nil {
- return errors.Trace(err)
- }
+ iter := col.Find(
var ops []txn.Op
- for _, oldDoc := range docs {
+ var oldDoc bson.M
+ for iter.Next(&oldDoc) {
@@ -2942,6 +2939,9 @@ func ReplaceNeverSet
}
+ if err := iter.Close(); err != nil {
+ return errors.Trace(err)
+ }
}))
Changed in juju: | |
assignee: | nobody → John A Meinel (jameinel) |
Changed in juju: | |
milestone: | 2.9-rc3 → 2.9-rc4 |
On Prodstack, we made the mistaken assumption that we could remove entries from statuses that were old (we confused statuseshistory with statuses). /pastebin. canonical. com/p/fMzZwDm2W x/
So we removed everything older than 1 week with:
https:/
We then got upgrade to complete, but saw errors in the all watcher because it was trying to load statuses for machines/instances that had been deleted.
We restored all statuses from the backup file and then ran: updateMany( {"neverset" : false}, {"$unset": {"neverset": ""}}) updateMany( {"neverset" : true}, {"$unset": {"neverset": ""}, "$set": {"status": "unset", "statusinfo", ""}})
db.statuses.
db.statuses.
Which should be the equivalent of running the upgrade step. /pastebin. canonical. com/p/4bKyR37Z3 Q/
That resulted in:
https:/
(134k documents had neverset, but it was always false.)
So the unfortunate workaround is to move the statuses collection to the side, have upgrade progress, stop the agents again, manually do the upgrade in the alternate collection, and then move that back into place.
There should be a better fix for this in 2.8.8. and 2.9