1.20.x cannot upgrade to 1.26-alpha1

Bug #1495542 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Andrew Wilkins

Bug Description

As seen at
    http://reports.vapour.ws/releases/3051/job/aws-upgrade-20-trusty-amd64/attempt/15

Juju 1.20.11 (former trusty-updates) cannot upgrade to 1.26-alpha1

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Eric Snow (ericsnowcurrently)
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Ultimately the following commands are run:

1. (1.20.11) juju --show-log bootstrap -e aws-upgrade-20-trusty-amd64 --constraints mem=2G
2. juju --show-log status -e aws-upgrade-20-trusty-amd64

3. juju --show-log deploy -e aws-upgrade-20-trusty-amd64 local:trusty/dummy-source
4. juju --show-log set -e aws-upgrade-20-trusty-amd64 dummy-source token=XXXXX
5. juju --show-log deploy -e aws-upgrade-20-trusty-amd64 local:trusty/dummy-sink
6. juju --show-log add-relation -e aws-upgrade-20-trusty-amd64 dummy-source dummy-sink
7. juju --show-log expose -e aws-upgrade-20-trusty-amd64 dummy-sink
8. juju --show-log status -e aws-upgrade-20-trusty-amd64

9. juju --show-log upgrade-juju -e aws-upgrade-20-trusty-amd64 --version 1.26-alpha1

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

10. juju --show-log set -e aws-upgrade-20-trusty-amd64 dummy-source token=HRV13HFTBHGS6QMTA538
11. juju --show-log status -e aws-upgrade-20-trusty-amd64

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Download full text (4.1 KiB)

Key test output from the time "upgrade-juju" is run to the subsequent error:

juju --show-log upgrade-juju -e aws-upgrade-20-trusty-amd64 --version 1.26-alpha1
2015-09-14 11:55:30 INFO juju.cmd supercommand.go:37 running juju [1.26-alpha1-precise-amd64 gc]
2015-09-14 11:55:30 INFO juju.api api.go:266 connecting to API addresses: [snip]
[snip]
2015-09-14 11:55:31 INFO juju.api apiclient.go:194 connection established to [snip]
available tools:
    1.26-alpha1-precise-amd64
    1.26-alpha1-trusty-amd64
    1.26-alpha1-trusty-i386
    1.26-alpha1-trusty-ppc64el
    1.26-alpha1-vivid-amd64
best version:
    1.26-alpha1
2015-09-14 11:55:35 INFO juju.cmd.juju upgradejuju.go:215 started upgrade to 1.26-alpha1
2015-09-14 11:55:35 INFO juju.cmd supercommand.go:435 command finished
1.20.11: 1, 0, 2, dummy-sink/0, dummy-source/0 ..!!!
2015-09-14 11:55:41 INFO juju.cmd supercommand.go:37 running juju [1.26-alpha1-precise-amd64 gc]
2015-09-14 11:55:41 INFO juju.api api.go:266 connecting to API addresses: [snip]
[snip]
2015-09-14 11:55:42 INFO juju.api apiclient.go:194 connection established to [snip]
2015-09-14 11:55:43 WARNING juju.environs.config config.go:1417 unknown config field "tools-url"
2015-09-14 11:55:44 ERROR juju.cmd supercommand.go:429 connection is shut down

1.20.11: 1, 2
1.20.11: 1, 2, dummy-sink/0, dummy-source/0!!!
2015-09-14 11:55:56 INFO juju.cmd supercommand.go:37 running juju [1.26-alpha1-precise-amd64 gc]
2015-09-14 11:55:56 INFO juju.api api.go:266 connecting to API addresses: [snip]
[snip]
2015-09-14 11:55:57 INFO juju.api apiclient.go:194 connection established to [snip]
2015-09-14 11:55:57 INFO juju.rpc server.go:328 error closing codec: EOF
2015-09-14 11:55:57 ERROR juju.cmd supercommand.go:429 connection is shut down

 ................
1.20.11: 1, dummy-sink/0, dummy-source/0 .
1.20.11: dummy-sink/0, dummy-source/0 ..
1.20.11: dummy-sink/0 .
2015-09-14 11:56:32 INFO juju run succeeded on machines: ['1', '2']
juju --show-log set -e aws-upgrade-20-trusty-amd64 dummy-source token=HRV13HFTBHGS6QMTA538
2015-09-14 11:56:32 INFO juju.cmd supercommand.go:37 running juju [1.26-alpha1-precise-amd64 gc]
2015-09-14 11:56:32 INFO juju.api api.go:266 connecting to API addresses: [snip]
2015-09-14 11:56:32 INFO juju.api apiclient.go:194 connection established to [snip]
2015-09-14 11:56:33 INFO juju.cmd supercommand.go:435 command finished
2015-09-14 11:56:33 INFO Retrieving token.
2015-09-14 11:56:38 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:56:47 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:56:56 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:06 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:15 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:24 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:34 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:43 INFO Found token '6OS656GE9ZX7OKTOOOFQ' expected 'HRV13HFTBHGS6QMTA538'
2015-09-14 11:57:52 INFO Found toke...

Read more...

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Per machine-0.log, the following key things happened:

http://pastebin.ubuntu.com/12411600/

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

Particularly note the mongo/storage-related panic at http://pastebin.ubuntu.com/12411600/.

Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

The panic happens because the mgo.Session was already explicitly closed at the point someone tried to copy the session. In this case it's the blobstore in state/storage/storage.go, as used by environmentStorageDataSource in cmd/jujud/agent/simplestreams.go.

It isn't immediately clear yet why the session was already closed.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: Eric Snow (ericsnowcurrently) → Anastasia (anastasia-macmood)
status: In Progress → Triaged
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

In cmd/jujud/agent/machine.go, environmentStorageDataSource is created from st.MongoSession() (for the state worker).

In cmd/jujud/agent/upgrade.go, environmentStorageDataSource is created from c.st.MongoSession() (for the upgrade worker).

Based on that alone, either one could be source of the mongo session that is getting closed pre-maturely.

For both workers, the state.State connection was opened for the duration of the worker's life. The question is, how well are the currently registered data sources handled for restarts-by-the-runner; and for the-worker-finished-but-the-agent-lives-on?

Ian Booth (wallyworld)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Image metadata worker gets data from all registered image metadata data sources. This logic caused problem here as one of these sources uses either state or session directly.

The solution is to move all logic that deals with data sources out to api layer and deal with data sources there. Thanks Ian :D

I have a fix that is almost ready to go - running all tests now.

Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :

This fix is incomplete. We still fail to see hooks updating after the upgrade
   http://reports.vapour.ws/releases/3055/job/aws-upgrade-20-trusty-amd64/attempt/24

Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
Casey Marshall (cmars) wrote :

Sorry, wrong tab :(

Changed in juju-core:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Curtis Hovey (sinzui) wrote :
Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

There's a few errors happening in the uniter. This one was mentioned in IRC:
    2015-09-16 02:40:53 ERROR juju.worker runner.go:218 exited "uniter": ModeAbide: cannot set invalid status "started"
but I think this is a red herring. This will not prevent the uniter from upgrading, so it'll error, restart, upgrade, then start setting status in the correct way.

I *think* this is the issue:
    2015-09-16 05:23:45 DEBUG juju.worker.metrics.collect manifold.go:159 [{github.com/juju/juju/worker/charmdir/manifold.go:102: charmdir not available}]

I'm getting this in my repro of the bug, and it's in the logs at http://data.vapour.ws/juju-ci/products/version-3059/aws-upgrade-20-trusty-amd64/build-29/unit-dummy-sink-0.log. It's interesting that, in those logs, the token *is* eventually set to the expected value... I think it's just taking longer than it should, due to this bug causing the uniter to bounce up and down.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Okay, now I'm not sure about that. I upgraded again after adding some logging, and that issue went away. But the token is still not propagating. Still investigating.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Now I'm baffled. Somehow the persistent local relation state (/var/lib/juju/agents/unit-dummy-sink-0/state/relations/0/dummy-source-0) had a completely wrong value in it. It said "5", but the most recent value seen remotely was "3" at the time. Sure enough, when I did a few more "juju set"s, the relation settings changed enough times to bump the value to 6 and propagate the token.

Now to figure out how that value got out of whack...

Revision history for this message
Andrew Wilkins (axwalk) wrote :

It appears that the issue is due to the env-uuid migration of relation units docs. They are removed and recreated (with new IDs), which resets the txn-revno. We use txn-revno for the relation units settings version, which is why it went backwards.

The big question is why this hasn't been a problem until now. Perhaps we were comparing != before and now >? Either way, we need to stop using txn-revno, and add another (earlier) upgrade step that adds a settings field based on txn-revno. Or something like that.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Confirmed that this is what's happening, and previously we used != instead of >.
    https://github.com/juju/juju/blob/1.25/worker/uniter/relation/livesource.go#L223
so failures were possible before, but rarer.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I have a fix in progress at https://github.com/axw/juju/tree/state-settings-version. I'm signing off now, will pick it back up in the morning. The idea is that the settings doc will grow a new version field, and existing settings attributes will move to a new subdoc. Most of the remaining work is to migrate existing settings docs.

Changed in juju-core:
assignee: Anastasia (anastasia-macmood) → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I'm going to temporarily revert the > back to != to unblock master, and split off a separate bug to fix this properly.

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote : Fix Released in juju-core master

Juju-CI verified that this issue is Fix Released in juju-core master:
    http://reports.vapour.ws/releases/3066

Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.