upgrade tests fail on multiple substrates with revision 24c1b80d

Bug #1403738 reported by John George
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Critical
Menno Finlay-Smits

Bug Description

upgrade tests across multiple substrates are failing while checking 'juju status' after an upgrade with revision 24c1b80d

maas-devel-upgrade-trusty-amd64
kvm-upgrade-trusty-amd64
azure-upgrade-precise-amd64
hp-upgrade-trusty-amd64

Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
description: updated
summary: - upgrade timout on multiple substrates with revision 24c1b80d
+ upgrade tests fail on multiple substrates with revision 24c1b80d
Changed in juju-core:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Looking at the logs I can say:
1. We need to run upgrade jobs with logging-config: <root>=DEBUG as the current setting (<root>=INFO;unit=DEBUG) are not detailed enough to follow upgrade steps as they are executed, and since these jobs are testing exactly upgrades we need the extra context.
2. Some time after machine-0 restarted with the 1.22-alpha1 tools, I can see the following in the log:

2014-12-18 03:51:42 DEBUG juju.apiserver apiserver.go:160 <- [3] machine-0 {"RequestId":7,"Type":"Upgrader","Request":"SetTools","Params":{"AgentTools":[{"Tag":"machine-0","Tools":{"Version":"1.22-alpha1-trusty-amd64"}}]}}
...
2014-12-18 03:51:42 DEBUG juju.apiserver apiserver.go:167 -> [3] machine-0 161.55919ms {"RequestId":7,"Response":{"Results":[{"Error":{"Message":"cannot set agent version for machine 0: not found or dead","Code":"dead"}}]}}

Because machine 0 is obviously up and running (alive), this might be caused by the machine-0 document not having the env-uuid added as needed. This could mean either the upgrade step AddEnvUUIDToMachines did not run or something might be omitted from the recent change https://github.com/juju/juju/pull/1291 that automates adding env-uuid to state operations.

Revision history for this message
John George (jog) wrote :
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I think I found the issue - the upgrader worker calls SetTools for machine-0 with the new tools (1.22-alpha1.1), which in turn calls state.Machine.SetAgentVersion which fails because runTransaction in 1.22-alpha1.1 runs with the automatic multi-environment transaction runner and effectively adds the env-uuid to the machine id.

However, this causes the transaction to fail with ErrAborted, because the upgrade steps have not yet run, so there are no env-uuids added to any doc.

It might seem surprising that specifying any Assert op on a txn.Op along with a non existing Id returns ErrAborted instead of mgo.ErrNotFound, but this is a mgo bug which I'll file separately.

A simple experiment proves the above statement: change the txn.Op in machine.SetAgentVersion and instead of having Id: m.doc.DocID, use m.doc.DocID + "a" and then run TestMachineRefresh from state/machine_test.go and you get:

machine_test.go:955:
    c.Assert(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"cannot set agent version for machine 2", cause:(*errors.errorString)(0xc210080960), previous:(*errors.Err)(0xc2100d3280), file:"github.com/juju/juju/state/machine.go", line:332} ("cannot set agent version for machine 2: not found or dead")
... error stack:
 not found or dead
 github.com/juju/juju/state/state.go:449:
 github.com/juju/juju/state/machine.go:332: cannot set agent version for machine 2

START: action_test.go:1: MachineSuite.TearDownTest

So this is indeed a regression. It can be fixed if the multi-env transaction runner handles machine.SetAgentVersion specially (not auto-prepending env-uuid if the upgrade steps have not yet run).

Changed in juju-core:
status: Triaged → Confirmed
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Related mgo bug #1403846

Curtis Hovey (sinzui)
tags: added: ci regression upgrade-juju
Changed in juju-core:
status: Confirmed → Triaged
milestone: none → 1.22
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

There were 2 problems.

The first was that SetAgentVersion may be called before DB migrations have been run.

The second is that the machine agent container setup code was running during upgrades. This makes calls to state which shouldn't be made until after upgrades have completed.

I have fixes for these issues merging now.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Fixes in eda5573a9c401f1b26411d31a4f350a172bc9c0e and f07c7312e0aaa80da2976557c534028240f09241

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

WIth this fix in place, upgrade tests for master in CI are passing. Marking this as Fix Released.

Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers