SEGV when jujud restarts after upgrade-controller 2.8.1 -> 2.8.2

Bug #1895954 reported by Jake Hill
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
John A Meinel
2.8
Fix Released
Critical
John A Meinel

Bug Description

Upgrading a HA controller cluster, the jujud processes are failing to restart causing complete loss of the controller. This is the machine log on one of the controllers at the time of the upgrade;

ERROR must restart: an agent upgrade is available
2020-09-16 12:54:43 INFO juju.cmd supercommand.go:54 running jujud [2.8.2 0 a44e6eb38430da695737f5e9f37819478b9587c3 gc go1.14.9]
2020-09-16 12:54:43 DEBUG juju.cmd supercommand.go:55 args: []string{"/var/lib/juju/tools/machine-4/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "4", "--debug"}
2020-09-16 12:54:43 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 2
2020-09-16 12:54:43 DEBUG juju.agent agent.go:583 read agent config, format "2.0"
2020-09-16 12:54:43 INFO juju.cmd.jujud agent.go:138 setting logging config to "<root>=WARNING;unit=DEBUG"
2020-09-16 12:54:44 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-4" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 12:54:48 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [b5fb8e] "machine-4" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2020-09-16 12:54:50 WARNING juju.state txns.go:89 Running no-op transaction - called by /workspace/_build/src/github.com/juju/juju/state/database.go:399 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2994 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:59 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2961 /workspace/_build/src/github.com/juju/juju/upgrades/backend.go:354 /workspace/_build/src/github.com/juju/juju/upgrades/steps_282.go:21 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:184 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:137 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:113 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/manifold.go:80 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:240 /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/agent.go:112 /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/machine.go:659 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:219 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:188 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:163 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:159
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x1d031ca]

goroutine 460 [running]:
github.com/juju/juju/state.ResetDefaultRelationLimitInCharmMetadata.func1(0xc00072c7e0, 0x0, 0x0)
        /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2973 +0x64a
github.com/juju/juju/state.runForAllModelStates(0xc0008989c0, 0x4a01118, 0x0, 0x0)
        /workspace/_build/src/github.com/juju/juju/state/upgrades.go:59 +0x308
github.com/juju/juju/state.ResetDefaultRelationLimitInCharmMetadata(0xc0008989c0, 0xc00008ca80, 0x7ff9c67037d0)
        /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2961 +0x37
github.com/juju/juju/upgrades.stateBackend.ResetDefaultRelationLimitInCharmMetadata(0xc0008989c0, 0x5266160, 0xc0008989c0)
        /workspace/_build/src/github.com/juju/juju/upgrades/backend.go:354 +0x2b
github.com/juju/juju/upgrades.stateStepsFor282.func1(0x51cfd60, 0xc0000baf90, 0xc00097f901, 0xc00097f9b0)
        /workspace/_build/src/github.com/juju/juju/upgrades/steps_282.go:21 +0x48
github.com/juju/juju/upgrades.(*upgradeStep).Run(0xc0000baf60, 0x51cfd60, 0xc0000baf90, 0x489c4c6, 0x18)
        /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:184 +0x3e
github.com/juju/juju/upgrades.runUpgradeSteps(0xc000aa1e00, 0xc00097f520, 0x1, 0x1, 0x51cfd60, 0xc0000baf90, 0xc000aa1e00, 0xc00012f200)
        /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:137 +0x21d
github.com/juju/juju/upgrades.PerformStateUpgrade(0x2, 0x8, 0x0, 0x0, 0x1, 0x0, 0xc00097f520, 0x1, 0x1, 0x51cfd60, ...)
        /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:113 +0xb0
github.com/juju/juju/worker/upgradedatabase.Manifold.func1.2(0x2, 0x8, 0x0, 0x0, 0x1, 0x0, 0xc00097f520, 0x1, 0x1, 0xc0006a8320, ...)
        /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/manifold.go:80 +0x90
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).runUpgradeSteps(0xc0001b0a00, 0x7ff99fa85e50, 0xc00012f200, 0x0, 0x7ff99fa85e50)
        /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:240 +0x1d8
github.com/juju/juju/cmd/jujud/agent.(*agentConf).ChangeConfig(0xc0004fbcb0, 0xc00097f510, 0x0, 0x0)
        /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/agent.go:112 +0xab
github.com/juju/juju/cmd/jujud/agent.(*MachineAgent).ChangeConfig(0xc0005ffe60, 0xc00097f510, 0xc0006c2701, 0xc00097f510)
        /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/machine.go:659 +0x41
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).runUpgrade(0xc0001b0a00)
        /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:219 +0x11d
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).run(0xc0001b0a00, 0x0, 0x0)
        /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:188 +0x237
gopkg.in/tomb%2ev2.(*Tomb).run(0xc0001b0a00, 0xc00083de70)
        /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:163 +0x38
created by gopkg.in/tomb%2ev2.(*Tomb).Go
        /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:159 +0xba

Revision history for this message
John A Meinel (jameinel) wrote :

I did just do an upgrade using HA 2.8.1 on LXD running a single application to 2.8.2 using the official published agents.
So simple upgrades are not affected by this.

However, this is a very critical issue so we'll keep digging to make sure it is fixed.

Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

...
  for _, charmDoc := range docs {
--> for epName, rel := range charmDoc.Meta.Requires {
    rel.Limit = 0
    charmDoc.Meta.Requires[epName] = rel
   }
..3.

My guess is that either charmDoc or chamDoc.Meta is nil. Most likely charmDoc.Meta.

Revision history for this message
John A Meinel (jameinel) wrote :

Note that charmDoc.Meta is generally a 'must-have' field for a charm document. We can certainly be extra defensive in upgrade (since during upgrade your system then isn't able to progress), but it is a sign of broken data in the database.

It would be good to get a dump of the charm docs to see where Meta is missing and to try to figure out how the data could be that way.

Recovering from this is going to require us to work with someone directly, can you ping us on irc.freenode.net and we will try to respond quickly with some queries and ways to fix things to move it forward.

Revision history for this message
John A Meinel (jameinel) wrote :

The fix could be something as simple as:
diff --git a/state/upgrades.go b/state/upgrades.go
index da2713c539..7ccd1a4868 100644
--- a/state/upgrades.go
+++ b/state/upgrades.go
@@ -2970,6 +2970,10 @@ func ResetDefaultRelationLimitInCharmMetadata(pool *StatePool) (err error) {

                var ops []txn.Op
                for _, charmDoc := range docs {
+ if charmDoc.Meta == nil {
+ logger.Warningf("charmDoc has nil Meta (invalid charm): %v", charmDoc)
+ continue
+ }
                        for epName, rel := range charmDoc.Meta.Requires {
                                rel.Limit = 0
                                charmDoc.Meta.Requires[epName] = rel

But actually getting that to a system that is currently experiencing an upgrade failure is difficult, because it won't be looking for another upgrade while this one is failing to progress.
As such, we can build a binary that works around this, or fix the database to allow it to progress, but we'd need to do live support of it (irc.freenode.net channel #juju)

Revision history for this message
Alexandros Soumplis (soumplis) wrote :
Download full text (3.3 KiB)

Similar problem here with upgrade from 2.8.1 to 2.8.2.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x1d031ca]

goroutine 1456 [running]:
github.com/juju/juju/state.ResetDefaultRelationLimitInCharmMetadata.func1(0xc000593d40, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2973 +0x64a
github.com/juju/juju/state.runForAllModelStates(0xc0003e9880, 0x4a01118, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:59 +0x308
github.com/juju/juju/state.ResetDefaultRelationLimitInCharmMetadata(0xc0003e9880, 0xc00066b180, 0x7f7fe22047d0)
 /workspace/_build/src/github.com/juju/juju/state/upgrades.go:2961 +0x37
github.com/juju/juju/upgrades.stateBackend.ResetDefaultRelationLimitInCharmMetadata(0xc0003e9880, 0x5266160, 0xc0003e9880)
 /workspace/_build/src/github.com/juju/juju/upgrades/backend.go:354 +0x2b
github.com/juju/juju/upgrades.stateStepsFor282.func1(0x51cfd60, 0xc000a24f30, 0xc000aeb201, 0xc000aeb290)
 /workspace/_build/src/github.com/juju/juju/upgrades/steps_282.go:21 +0x48
github.com/juju/juju/upgrades.(*upgradeStep).Run(0xc000a24f00, 0x51cfd60, 0xc000a24f30, 0x489c4c6, 0x18)
 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:184 +0x3e
github.com/juju/juju/upgrades.runUpgradeSteps(0xc0007c8400, 0xc000aeae00, 0x1, 0x1, 0x51cfd60, 0xc000a24f30, 0xc0007c8400, 0xc0002aaa80)
 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:137 +0x21d
github.com/juju/juju/upgrades.PerformStateUpgrade(0x2, 0x8, 0x0, 0x0, 0x1, 0x0, 0xc000aeae00, 0x1, 0x1, 0x51cfd60, ...)
 /workspace/_build/src/github.com/juju/juju/upgrades/upgrade.go:113 +0xb0
github.com/juju/juju/worker/upgradedatabase.Manifold.func1.2(0x2, 0x8, 0x0, 0x0, 0x1, 0x0, 0xc000aeae00, 0x1, 0x1, 0xc00082be80, ...)
 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/manifold.go:80 +0x90
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).runUpgradeSteps(0xc000b02500, 0x7f7fbb54b258, 0xc0002aaa80, 0x0, 0x7f7fbb54b258)
 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:240 +0x1d8
github.com/juju/juju/cmd/jujud/agent.(*agentConf).ChangeConfig(0xc000141260, 0xc000aeadf0, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/agent.go:112 +0xab
github.com/juju/juju/cmd/jujud/agent.(*MachineAgent).ChangeConfig(0xc0002e7b00, 0xc000aeadf0, 0xc000a1d601, 0xc000aeadf0)
 /workspace/_build/src/github.com/juju/juju/cmd/jujud/agent/machine.go:659 +0x41
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).runUpgrade(0xc000b02500)
 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:219 +0x11d
github.com/juju/juju/worker/upgradedatabase.(*upgradeDB).run(0xc000b02500, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/worker/upgradedatabase/worker.go:188 +0x237
gopkg.in/tomb%2ev2.(*Tomb).run(0xc000b02500, 0xc0005d1fd0)
 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:163 +0x38
created by gopkg.in/tomb%2ev2.(*Tomb).Go
 /workspace/_build/src/github.com/juju/juju/vendor/gopkg.in/tomb.v2/tomb.go:159 +0xba
2020-09-17 12:36:33 INFO juju.cmd supercommand.go:...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.8.3
assignee: Joseph Phillips (manadart) → John A Meinel (jameinel)
John A Meinel (jameinel)
Changed in juju:
milestone: 2.8.4 → 2.9-beta1
Revision history for this message
John A Meinel (jameinel) wrote :

We have a potential workaround for people impacted by this bug.

SSH to the controller machine and get access to Mongo:

agent=$(cd /var/lib/juju/agents; echo machine-*)
pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut '-d ' -sf2)

mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju

If you are in an HA controller, you will want to determine which machine is the Mongo Primary (it will have a prompt of:

juju:PRIMARY>

if you already are, and

juju:SECONDARY>

if you are not.

You can use

rs.status()

And look for the

"members": [

with "stateStr" of "PRIMARY", eg:

                        "name" : "10.5.24.54:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",

From there you can run:

db.charms.find({meta: null}).count()

And see how many records should be affected. You can exit that shell and run:
mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.find({}).pretty()' > all_records.txt

To get a complete list of all charm records. And
mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.find({"meta": null}).pretty()' > null_records.txt

To get just the null records.

And then
mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju --eval 'db.charms.update({meta: null}, { $set: {"meta": {}} }, false, true)'

Which will update the records with a nil meta to one with an empty meta, avoiding the nil pointer dereference.
You should see a line like:
WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })

Where the nModified matches the count() from earlier.

Revision history for this message
John A Meinel (jameinel) wrote :

We should start by saying you should stop the juju controllers

systemctl stop jujud-machine-X

for all controller machines,
and once you have done the database changes

systemctl start jujud-machine-X

on all controller machines.

Revision history for this message
John A Meinel (jameinel) wrote :

For a nicer formatted version of the steps to workaround this bug:
https://discourse.juju.is/t/controllers-missing-after-upgrade-controller/3560/7?u=jameinel

John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.