unit.Destroy doesn't cleanup statuseshistory correctly

Bug #1696491 reported by John A Meinel
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
Critical
Christian Muirhead
2.1
Critical
Christian Muirhead

Bug Description

We're clearly missing a test.
func (u *Unit) Destroy() (err error) {

ultimately calls:
 if err = unit.st.run(buildTxn); err == nil {
  if historyErr := unit.eraseHistory(); historyErr != nil {
   logger.Errorf("cannot delete history for unit %q: %v", unit.globalKey(), historyErr)
  }

which has:
 if _, err := historyW.RemoveAll(bson.D{{"statusid", u.globalKey()}}); err != nil {
  return err
 }
 if _, err := historyW.RemoveAll(bson.D{{"statusid", u.globalAgentKey()}}); err != nil {
  return err
 }

However, looking at a *live* deployment I see:
> db.statuseshistory.find().limit(1).pretty()
{
        "_id" : ObjectId("5925b3b0cc4149d87b5bb25e"),
        "model-uuid" : "6bf26857-5f04-4a53-8970-b57969b3b2f8",
        "globalkey" : "u#nova-cloud-controller/1",
        "status" : "idle",
        "statusinfo" : "",
        "statusdata" : {

        },
        "updated" : NumberLong("1495643056822375849")
}

There is *no* 'statusid' in any of the documents:

I thought maybe it was missing an index. Its just not *doing* anything.
And all of these calls have to do a full table scan of an *enormous* table, to do that nothing. Because there isn't an index on 'statusid' because it doesn't exist.
> db.statuseshistory.find({"statusid": {$exists: 1}}).count()
0

These are the indexes that we do have on statuseshistory:
> db.statuseshistory.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "juju.statuseshistory"
        },
        {
                "v" : 1,
                "key" : {
                        "model-uuid" : 1,
                        "globalkey" : 1,
                        "updated" : 1
                },
                "name" : "model-uuid_1_globalkey_1_updated_1",
                "ns" : "juju.statuseshistory"
        },
        {
                "v" : 1,
                "key" : {
                        "model-uuid" : 1,
                        "updated" : -1
                },
                "name" : "model-uuid_1_updated_-1",
                "ns" : "juju.statuseshistory"
        },
        {
                "v" : 1,
                "key" : {
                        "updated" : -1
                },
                "name" : "updated_-1",
                "ns" : "juju.statuseshistory"
        }
]

Probably we could use "model-uuid", "globalkey", "updated". Though we'd have to be careful to include 'updated' into the query somehow to make sure it uses that index.
Creating a new index would not be cheap:
        "totalIndexSize" : 1 127 370 752,
        "indexSizes" : {
                "_id_" : 201379840,
                "model-uuid_1_globalkey_1_updated_1" : 379 416 576,
                "model-uuid_1_updated_-1" : 274169856,
                "updated_-1" : 272404480
        },
        "ok" : 1

thats 379MB for the existing index. (And 1.1GB for all the indexing on this table.)

Revision history for this message
John A Meinel (jameinel) wrote :

That code is unchanged from the initial commit from Horacio that "introduce status-history command" that did have it called 'status-id':
git show 676c38c925d730aa937d1de5387de8a6838ff43d state/status.go
+type historicalStatusDoc struct {
+ EnvUUID string `bson:"env-uuid"`
+ Status Status
+ StatusInfo string
+ StatusData map[string]interface{}
+ Updated *time.Time
+ StatusId string
+}
+

The current doc is different:
type historicalStatusDoc struct {
 ModelUUID string `bson:"model-uuid"`
 GlobalKey string `bson:"globalkey"`
 Status status.Status `bson:"status"`
 StatusInfo string `bson:"statusinfo"`
 StatusData map[string]interface{} `bson:"statusdata"`

 // Updated might not be present on statuses copied by old
 // versions of juju from yet older versions of juju.
 Updated int64 `bson:"updated"`
}

This has always been broken, we've just never noticed.

tags: added: eda
Revision history for this message
John A Meinel (jameinel) wrote :

Good news, at the very least:

db.statuseshistory.find({model-uuid: ..., globalkey: yyyy}).limit(1).explain()
shows that it will use the model-uuid,globalkey,updated index.
Which means that we could just delete on that, and have it use the index.

Revision history for this message
John A Meinel (jameinel) wrote :

We are definitely leaking the statushistory documents:

2017-06-07 16:28:31 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/86#charm": <nil>

> db.statuseshistory.find({"model-uuid": "6bf26857-5f04-4a53-8970-b57969b3b2f8", "globalkey": "u#ntp/86#charm"}).count()
2780

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.4 KiB)

Well, they may not be 'leaked' as, we keep trying:
2017-06-06 21:16:52 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:18:35 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:26:54 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:33:07 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:34:27 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:35:29 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:36:51 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:37:54 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:39:11 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:40:26 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:42:11 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:43:14 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:44:24 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:46:01 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:49:32 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:50:46 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:52:21 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:54:19 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:55:39 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:57:03 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:58:23 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 21:59:49 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 22:01:27 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 22:03:16 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 22:04:33 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 22:05:48 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-06 22:07:12 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-07 14:48:01 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-07 14:57:15 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>
2017-06-07 14:59:27 ERROR juju.state unit.go:339 cannot delete history for unit "u#ntp/186#charm": <nil>...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

If we are going to do a custom binary for 2.1 that has the pruning stuff backported, then we would definitely want to backport the fix for this bug.

Revision history for this message
John A Meinel (jameinel) wrote :

This was bringing the large service to its knees. We worked around it (for now) by just completely dropping the juju.statuseshistory collection.

Changed in juju:
importance: High → Critical
Changed in juju:
assignee: nobody → Christian Muirhead (2-xtian)
status: Triaged → In Progress
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Also merged into develop.

Christian Reis (kiko)
tags: added: adrastea
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers