Migration from 2.9 to 3.3 caused unit agent to stop working
This bug report will be marked for expiration in 14 days if no further activity occurs. (find out why)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Incomplete
|
High
|
Unassigned |
Bug Description
Juju Version: migration from 2.9.46 to 3.3.3
Cloud Type: LXD
Reproduction steps: Unclear, may happen during migration and subsequent model upgrade.
Description:
During migration from a 2.9.46 controller to a 3.3.3 controller, one of the models that migrated without errors has a unit which reports agent lost. On inspection the container is still running and the agent binaries have upgraded to the correct 3.3.3 versions, but:
1. There are no links for hooks in /var/lib/
2. The unit logs report that the model does not exist
3. The unit logs stopped several days ago with a complaint that the unit was unaurhotized to access the controller
4. Originally the agent.conf file for the unit agent reported the wrong IP address for the controller but it has since changed to the correct IP.
From the unit logs
2024-03-13 09:55:15 INFO juju unit_agent.go:289 Starting unit workers for "polkadot/0" apicaller connect.go:209 Failed to connect to controller: invalid entity name or password (unauthorized access)
2024-03-13 09:55:15 ERROR juju.worker.
We have seen this randomly in various scenarios and have not yet been able to isolate the cause. There would also be corresponding logs on the controller side with more info - not sure if those are available or not.
Ideally, we'd like to be able to reproduce this so we can diagnose further, but have not had much luck.
Can you provide some more detail on your set up:
- how was the controller configured; was it lxd on maas? can we get a juju status --format yaml?
- what was deployed to the model being migrated; can you share the bundle that was used
- for large models, we recommend tweaking the controller api request throttle params; was this done here?
- how often does the agent lost thing happen? One or several per model?
- what do the source and target controller logs show for the relevant period etc? Is it possible to set logging config on source and target controllers to include "#migration=DEBUG" so we can get extra logs if it fails?
If we can reproduce we can look at the logs ourselves etc.