windows services cannot upgrade to 1.25.6

Bug #1577949 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Invalid
High
Unassigned
1.25
Fix Released
Critical
Cheryl Jennings

Bug Description

In this example of many
    http://reports.vapour.ws/releases/3942/job/maas-1_9-upgrade-win2012hvr2-amd64/attempt/315

1.25.5 can be deployed to windows. All the machines are can be upgraded to 1.25.6, but the services never update. Though the deployment of windows machines is slow, upgrades are actually fast. this was seen on maas 1.9, but other upgrade tests for ubutnu and centos are fine. Just windows fails.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Incomplete
tags: added: blocker ci regression upgrade-juju windows
Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Taking a look. Just installing MAAS 1.8.

/me hopes is MAAS 1.9 installer script will work with 1.8

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Sorry, commented on the wrong bug :-(

Revision history for this message
Horacio Durán (hduran-8) wrote :

For this to be reproduced a proper test environment needs to be set, which consists on a maas that can deploy cloudbaseinit able windowses

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am seeing this in most recent job log...

2016-06-07 03:30:28 WARNING winrm cat failed <Response code 1, out "", err "The system cannot fi">

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Also, did login change between MAAS2 and MAAS1?

This is also present in the logs:

Expected application/json, got: text/html; charset=utf-8
2016-06-07 03:30:25 INFO Could not login with MAAS 2.0 API, trying 1.0

Is this job meant to use MAAS2? Does CI test need to be updated?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Considering that this is a job for MAAS1.9, why did it even try to use MAAS2 api :D

Revision history for this message
Curtis Hovey (sinzui) wrote :

Hi Anastasia.

The winrm failure means Ci tried to retrieve logs, but failed. Log retrieval will fail if one of the expect log paths is missing on the host or the host is unreachable.

The CI/Juju doesn't have a clean way to know which kind of maas it is using. CI tries to log in using maas 2 rules, then fall back to maas 1.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

In the latest run, I see that the machine agents were upgraded successfully, but the unit agents never come back. Unfortunately, there are no unit logs in the last run. Is there any way they can be collected?

Revision history for this message
Curtis Hovey (sinzui) wrote :

I re ran the job *after* restoring the log collection rules. We disabled juju logs collection for window a few weeks back when we needed to collect the cloud-init logs.

    http://reports.vapour.ws/releases/4033/job/maas-1-9-upgrade-win2012hvr2-amd64/attempt/414

^ Note that the job was renamed earlier today. It doesn't show up on the reports site for build 4033, but it is there.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I'll have a look at units' logs but am wondering whether we are not giving enough time for units to upgrade maybe... For how long do we wait?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

PR against 1.25: https://github.com/juju/juju/pull/5569

We provide a fallback to use old password to authenticate. This fallback should also cater for bad credentials error.

I would like to see a CI run once this fix lands before marking this bug is Fix Committed.

Revision history for this message
Anastasia (anastasia-macmood) wrote :
Changed in juju-core:
status: Incomplete → In Progress
importance: Undecided → Critical
assignee: nobody → Anastasia (anastasia-macmood)
milestone: none → 2.0-beta9
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I was able to find that the machine agent, after the upgrade, thinks that the unit is not deployed and tries to redeploy it (which includes resetting the unit agent's password).

The determination of whether or not the unit is deployed is done by querying services that are running on the machine. Inspecting running services after the failed upgrade showed that the unit agent was running.

I added some extra logging and this time, and the upgrade test passed in CI.

The list of services that are running on the machine is populated when the deployer worker starts, and that list is queried when we get our initial event from the watcher. I suspect that this is just a timing issue where the unit agent isn't started yet when we do the initial query.

We can narrow the gap by re-querying the list of running services in the handler right before we attempt to deploy new services.

Long term (aka 2.0), there may be a better way to determine if we've already deployed a particular service.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Made a test patch which checks if the service is installed immediately before resetting the password for the unit agent. The CI run on the branch showed that the problem recreated, and this solution worked around it.

I'm working on writing the tests for the patch now.

For a fix in master, we should try to determine why the list of juju services returned from service.ListServices is empty for windows.

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
assignee: Anastasia (anastasia-macmood) → nobody
assignee: nobody → Anastasia (anastasia-macmood)
assignee: Anastasia (anastasia-macmood) → nobody
Changed in juju-core:
status: In Progress → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Curtis Hovey (sinzui)
tags: removed: blocker
Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Invalid
milestone: 2.0-beta14 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.