Migration from 2.9 to 3.3 caused unit agent to stop working

Bug #2057928 reported by Adrian Wennström

This bug report will be marked for expiration in 14 days if no further activity occurs. (find out why)

6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
High
Unassigned

Bug Description

Juju Version: migration from 2.9.46 to 3.3.3
Cloud Type: LXD
Reproduction steps: Unclear, may happen during migration and subsequent model upgrade.

Description:

During migration from a 2.9.46 controller to a 3.3.3 controller, one of the models that migrated without errors has a unit which reports agent lost. On inspection the container is still running and the agent binaries have upgraded to the correct 3.3.3 versions, but:

1. There are no links for hooks in /var/lib/juju/tools/3.3.3-ubuntu-amd64.
2. The unit logs report that the model does not exist
3. The unit logs stopped several days ago with a complaint that the unit was unaurhotized to access the controller
4. Originally the agent.conf file for the unit agent reported the wrong IP address for the controller but it has since changed to the correct IP.

Revision history for this message
Adrian Wennström (awnns) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

From the unit logs

2024-03-13 09:55:15 INFO juju unit_agent.go:289 Starting unit workers for "polkadot/0"
2024-03-13 09:55:15 ERROR juju.worker.apicaller connect.go:209 Failed to connect to controller: invalid entity name or password (unauthorized access)

We have seen this randomly in various scenarios and have not yet been able to isolate the cause. There would also be corresponding logs on the controller side with more info - not sure if those are available or not.

Ideally, we'd like to be able to reproduce this so we can diagnose further, but have not had much luck.

Can you provide some more detail on your set up:

- how was the controller configured; was it lxd on maas? can we get a juju status --format yaml?
- what was deployed to the model being migrated; can you share the bundle that was used
- for large models, we recommend tweaking the controller api request throttle params; was this done here?
- how often does the agent lost thing happen? One or several per model?
- what do the source and target controller logs show for the relevant period etc? Is it possible to set logging config on source and target controllers to include "#migration=DEBUG" so we can get extra logs if it fails?

If we can reproduce we can look at the logs ourselves etc.

Changed in juju:
status: New → Incomplete
importance: Undecided → High
Revision history for this message
Ian Booth (wallyworld) wrote :

I'll mark as Incomplete pending further info....

Revision history for this message
Adrian Wennström (awnns) wrote :
Revision history for this message
Adrian Wennström (awnns) wrote :
Revision history for this message
Adrian Wennström (awnns) wrote :

To answer some of the questions:

1. (Controller Configuration) The controllers are bootstrapped on manually provisioned lxd instances. They manage a number of lxd clouds. No MAAS involved.
2. (Migrated Model) I'll enclose the yaml.
3. (Large Models) None of the models are particularly large. For various reasons we have a large number of very small models with 2 or 3 units. No tweaking of anything was done.
4. (Prevalence) It has only happened to one unit per model to date. Out of the migrated models, we have found 9 such models (out of around 360).
5. (Controller logs) I have some trouble knowing what you are looking for. I've enclosed the machine-0.log file entries from the controller grepped for the date and hour in question. The model UUID is 8b9d2a87-6f57-43b8-843d-cf9eda140e3a. I'll try and enable it for the one remaining controller we have to migrate, and we'll see what happens.

Revision history for this message
Adrian Wennström (awnns) wrote :
Revision history for this message
Adrian Wennström (awnns) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the info. There's a logsink.log file in /var/log/juju which is the firehose for that controller. For HA, there will be one for each controller.

The other way of getting what's needed is to juju debug-log -m controller --replay > controller.log

Both of the above if possible would be good.

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (4.8 KiB)

There's a lot to unpack here, and several questions.

TL;DR: there's a fatal error with one model - deleted user issue.
There's also a missing charm / offer issue.

These would affect those particular models. The other models should be migrated ok. I don't see anything that explains the unit connection issue.

Looking at source controller logs, it seems many, many model migrations are being initiated all at once.
Lots of these type of logs:

2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing source prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing target prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 exporting model
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 importing model into target controller
2024-03-13 09:29:47 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"

We see one of them failed:

2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 exporting model
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 importing model into target controller
2024-03-13 09:29:36 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"
2024-03-13 09:29:36 ERROR juju.worker.migrationmaster.4fb3a1 worker.go:300 model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to ABORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 aborted, removing model from target controller: model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted

This should have also appeared in status on the model after the failure.

So issue #1 - Can you confirm that the "ankan14" user was deleted? Juju should have checked this.

Issue #2 is this error

2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "environ-tracker" manifold worker returned unexpected error: model "api-starknet-mainnet-20" (ffdd553c-967b-4186-8fc8-1db85f0af57c): reading model config: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "log-forwarder" manifold worker returned unexpected error: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 INFO juju.worker.migrationmaster.ffdd55 worker.go:264 setting migration phase to DONE
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "migration-master" manifold worker returned unexpected error: failed to set phase: could not get migration: model "ffdd553c-967b-4186-8fc8-1db85f0af57c" not found (not found)
2024-03-13 09:36:16 INFO juju.wo...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

Also, we'd need logs from the affected unit agent - the one that couldn't connect. Plus the logisnk fire hose.
This will be like looking for a needle in a haystack so we might not be able to get the any root cause very easily.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.