Offer permissions are not migrated

Bug #1957745 reported by Paul Goins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth

Bug Description

Hello,

I was trying to migrate a model from one Juju 2.9.21 controller to another of the same version, just different series. (Xenial -> Focal, both MAAS-based)

Unfortunately, during this process one agent appeared to keep getting hung up for some reason:

2022-01-12 22:30:16 INFO juju.worker.migrationmaster.2b11a1 worker.go:295 quiescing, waiting for agents to report back
2022-01-12 22:30:19 INFO juju.worker.migrationmaster.2b11a1 worker.go:729 waiting for agents to report back for migration phase QUIESCE (will wait up to 15m0s)
2022-01-12 22:30:49 INFO juju.worker.migrationmaster.2b11a1 worker.go:295 quiescing, waiting for agents to report back: 967 succeeded, 1 still to report
[...repeats...]

I was able to then find this message on one of the logsink.log files on the source controller:

<UUID>: machine-2 2022-01-12 22:45:16 ERROR juju.worker.migrationmaster.2b11a1 worker.go:749 1 agents failed to report in time for "quiescing" phase (including machines: 125)

Reviewing the units on that machine, I saw that all of the "units" had pairs of log messages like this:

2022-01-12 23:24:08 INFO juju.worker.migrationminion worker.go:140 migration phase is now: QUIESCE
2022-01-12 23:24:08 DEBUG juju.worker.migrationminion worker.go:257 reporting back for phase QUIESCE: true
[...repeats...]

While the machine-125 log has this, and nothing after:

2022-01-12 23:09:44 INFO juju.worker.migrationminion worker.go:140 migration phase is now: QUIESCE

...And that remains true hours later, as I write this bug.

As this was encountered on a managed cloud, I may not be able to upload all information publicly. Please let me know what you need to help investigate this and I will post here if I reasonably can, or through internal channels otherwise.

Best Regards,
Paul Goins

Revision history for this message
Paul Goins (vultaire) wrote :

I have provided sosreports through internal channels.

Revision history for this message
Ian Booth (wallyworld) wrote :

The logs show the migration is failing because the model import to the target controller exits with an error. See the target model logs:

2022-01-12 23:51:39 ERROR juju.worker.migrationmaster.a5c742 worker.go:295 model data transfer failed, failed to import model into target controller: applications: application 0: application offer 0: application offer v2 schema check failed: acl: expected map, got nothing

Any exported application offers are required to have an "acl" attribute which is a map of who can access the offer and their permission. If that's missing, the import validation fails.

TL;DR; a failed migration, resulting in the operation being aborted and the model restored on the source controller appears to mess up offer permissions, resulting in any future migration attempts to fail (more detail below).

TL;DR; the original cause of the very first migration failure could have been anything, like a busy agent not quiesing, or a unit hook error etc; the point is the system gets wedged if that happens.

TL;DR; the reason the migration back after a failure messes up is that the offer permissions do not appear to get migrated correctly in the first place

Here's an snippet of a valid model export with an offer:

$ export JUJU_DEV_FEATURE_FLAGS=developer-mode
$ juju dump-model
...
applications:
  applications:
    charm-url: ch:amd64/trusty/mariadb-7
    name: mariadb
    offers:
      offers:
      - acl:
          admin: admin
          everyone@external: read
      - application-description: |
          MariaDB is an open source database server.
        application-name: mariadb
        endpoints:
          db: db
        offer-name: mariadb
        offer-uuid: d6ba3c46-4b01-4f10-838f-74c678ada51a
      version: 2
...

You can see that there are 2 grants that are done out of the box.

I migrated a test model with an offer and noticed some weird behaviour.
After the migration, doing a dump-model shows that there is no longer any "acl" block in the yaml.

And looking at the permissions collection shows that the offer permissions are indeed missing.

The result of the above is that attempting to migrate the model again will result in the observed error, as can be seen from this status snippet

$ juju status
Model Controller Cloud/Region Version SLA Timestamp Notes
foo test aws/ap-southeast-2 2.9.24.1 unsupported 12:43:59+10:00 migrating: aborted, removing model from target controller: model data transfer failed, failed to import model into target controller: applications: application 0: application offer 0: application offer v2 schema check failed: acl: expected map, got nothing

So there's definitely a bug to fix here in migrating offers.

Changed in juju:
milestone: none → 2.9.24
status: New → Triaged
importance: Undecided → Critical
summary: - Unable to unblock model migration; 1 agent refusing to report back
+ Offer permissions are not migrated
Revision history for this message
Ian Booth (wallyworld) wrote :

I've renamed the bug to reflect the core issue which is causing the migration to fail, as reflected in the logs.

Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

Another issue is that when a model is migrated, the offer permissions are not removed from the source model.

Revision history for this message
Ian Booth (wallyworld) wrote :
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.24 → 2.9.25
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Paul Goins (vultaire) wrote :

Excellent to see this is repaired in newer versions.

I've hit another cloud where it seems I'm having this exact issue again. Other engineers have attempted migrations, hit failures, etc. And it may have been on a pre-2.9.25 version of Juju. We're now on 2.9.42, but we have a model which is lacking the ACL fields.

Appropriate snippet of juju dump-model:

```
    name: grafana
    offers:
      offers:
      - application-description: |
          Grafana is the leading graph and dashboard builder for visualizing
          time series metrics.
        application-name: grafana
        endpoints:
          dashboards: dashboards
        offer-name: grafana
        offer-uuid: <REDACTED>
      version: 2
```

All other offers in this model have an acl entry; just this one does not. And I'm hitting the same error on attempting to migrate this model to a new controller: "model data transfer failed, failed to import model into target controller: applications: application 3: application offer 0: application offer v2 schema check failed: acl: expected map, got nothing"

Is there a suggested workaround for environments where this issue has already occurred? Should the CMR be broken/removed and the offer recreated from scratch, for example?

Revision history for this message
Paul Goins (vultaire) wrote :

Based upon discussion with jameinel and testing on the environment in question, it looks like the ACL can be manually restored with the juju grant command, e.g.:

juju grant admin admin admin/lma.grafana
juju grant everyone@external read admin/lma.grafana

After that change, the ACL does show up in "juju dump-model", and the migration does seem to get further. I'm still hitting migration issues here, but I'm going to consider it a different issue at this point; I'll leave the above comment for future travelers in case it helps.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.