cannot migration nor upgrade without manual intervention for a machine after a container is removed.

Bug #1960235 reported by Heather Lanigan
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Heather Lanigan

Bug Description

You can get a machine into a state where it cannot be migrated. The unconverted-api-worker gets suck in stopping:

  unconverted-api-workers:
    inputs:
    - api-caller
    - migration-fortress
    - migration-inactive-flag
    report:
      workers:
        1-container-watcher:
          started: "2022-02-04 21:25:13"
          state: stopped
    start-count: 1
    started: "2022-02-04 21:25:13"
    state: stopping

Seen in the field (2.9.21) and reproduced locally (tip of 2.9). While attempting to debug, I also noticed that a machine agent in this state cannot restart the new agent and hangs. If you attempt to migrate, after migration timeout, unconverted-api-workers fails to restart.

To upgrade the machine, you must restart the jujud process. All agents are marked as lost on the machine until the process is restarted.

summary: - unconverted-api-workers gets to status: stopping and never returns
+ cannot migration or upgrade without manual intervention for a machine
+ after a container is removed.
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Introduced in juju 2.9.17.

summary: - cannot migration or upgrade without manual intervention for a machine
+ cannot migration nor upgrade without manual intervention for a machine
after a container is removed.
John A Meinel (jameinel)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

https://github.com/juju/juju/blob/79a364e0bda7e8e0573b3ba8bc01257a53029f52/worker/provisioner/container_initialisation.go#L146

Changing the nil to abort (channel) resolves the issue. More investigation of the runner code to understand why we're seeing different behavior between StopWorker and StopAndRemoveWorker is needed. Why does this change work?

Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Upgrade will show a lost agent on machines impacted by this bug. It resolve the problem, please kill the jujud process on that machine, it will be restarted automatically and the upgrade will continue.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.