Bug #2057928 “Migration from 2.9 to 3.3 caused unit agent to sto...” : Bugs : Canonical Juju

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-14:

#1

Unit logs Edit (24.8 KiB, text/plain)

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-15:

#2

From the unit logs

2024-03-13 09:55:15 INFO juju unit_agent.go:289 Starting unit workers for "polkadot/0"
2024-03-13 09:55:15 ERROR juju.worker.apicaller connect.go:209 Failed to connect to controller: invalid entity name or password (unauthorized access)

We have seen this randomly in various scenarios and have not yet been able to isolate the cause. There would also be corresponding logs on the controller side with more info - not sure if those are available or not.

Ideally, we'd like to be able to reproduce this so we can diagnose further, but have not had much luck.

Can you provide some more detail on your set up:

- how was the controller configured; was it lxd on maas? can we get a juju status --format yaml?
- what was deployed to the model being migrated; can you share the bundle that was used
- for large models, we recommend tweaking the controller api request throttle params; was this done here?
- how often does the agent lost thing happen? One or several per model?
- what do the source and target controller logs show for the relevant period etc? Is it possible to set logging config on source and target controllers to include "#migration=DEBUG" so we can get extra logs if it fails?

If we can reproduce we can look at the logs ourselves etc.

Changed in juju:
status:	New → Incomplete
importance:	Undecided → High

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-15:

#3

I'll mark as Incomplete pending further info....

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-15:

#4

Logs extracted from new controller Edit (7.2 MiB, text/plain)

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-15:

#5

Old controller logs Edit (876.9 KiB, text/plain)

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-15:

#6

To answer some of the questions:

1. (Controller Configuration) The controllers are bootstrapped on manually provisioned lxd instances. They manage a number of lxd clouds. No MAAS involved.
2. (Migrated Model) I'll enclose the yaml.
3. (Large Models) None of the models are particularly large. For various reasons we have a large number of very small models with 2 or 3 units. No tweaking of anything was done.
4. (Prevalence) It has only happened to one unit per model to date. Out of the migrated models, we have found 9 such models (out of around 360).
5. (Controller logs) I have some trouble knowing what you are looking for. I've enclosed the machine-0.log file entries from the controller grepped for the date and hour in question. The model UUID is 8b9d2a87-6f57-43b8-843d-cf9eda140e3a. I'll try and enable it for the one remaining controller we have to migrate, and we'll see what happens.

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-15:

#7

New controller yaml status Edit (3.8 KiB, text/plain)

Revision history for this message

Adrian Wennström (awnns) wrote on 2024-03-15:

#8

Old controller yaml status Edit (1.8 KiB, text/plain)

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-18:

#9

Thanks for the info. There's a logsink.log file in /var/log/juju which is the firehose for that controller. For HA, there will be one for each controller.

The other way of getting what's needed is to juju debug-log -m controller --replay > controller.log

Both of the above if possible would be good.

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-19:

#10

Download full text (4.8 KiB)

There's a lot to unpack here, and several questions.

TL;DR: there's a fatal error with one model - deleted user issue.
There's also a missing charm / offer issue.

These would affect those particular models. The other models should be migrated ok. I don't see anything that explains the unit connection issue.

Looking at source controller logs, it seems many, many model migrations are being initiated all at once.
Lots of these type of logs:

2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing source prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing target prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 exporting model
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 importing model into target controller
2024-03-13 09:29:47 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"

We see one of them failed:

2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 exporting model
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 importing model into target controller
2024-03-13 09:29:36 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"
2024-03-13 09:29:36 ERROR juju.worker.migrationmaster.4fb3a1 worker.go:300 model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to ABORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 aborted, removing model from target controller: model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted

This should have also appeared in status on the model after the failure.

So issue #1 - Can you confirm that the "ankan14" user was deleted? Juju should have checked this.

Issue #2 is this error

2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "environ-tracker" manifold worker returned unexpected error: model "api-starknet-mainnet-20" (ffdd553c-967b-4186-8fc8-1db85f0af57c): reading model config: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "log-forwarder" manifold worker returned unexpected error: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 INFO juju.worker.migrationmaster.ffdd55 worker.go:264 setting migration phase to DONE
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "migration-master" manifold worker returned unexpected error: failed to set phase: could not get migration: model "ffdd553c-967b-4186-8fc8-1db85f0af57c" not found (not found)
2024-03-13 09:36:16 INFO juju.wo...

There's a lot to unpack here, and several questions.

TL;DR: there's a fatal error with one model - deleted user issue.
There's also a missing charm / offer issue.

These would affect those particular models. The other models should be migrated ok. I don't see anything that explains the unit connection issue.

Looking at source controller logs, it seems many, many model migrations are being initiated all at once.
Lots of these type of logs:

2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing source prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 performing target prechecks
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 exporting model
2024-03-13 09:29:47 INFO juju.worker.migrationmaster.e3a33a worker.go:300 importing model into target controller
2024-03-13 09:29:47 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"

We see one of them failed:

2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to IMPORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 exporting model
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 importing model into target controller
2024-03-13 09:29:36 INFO juju.api apiclient.go:686 connection established to "wss://192.168.208.49:17070/api"
2024-03-13 09:29:36 ERROR juju.worker.migrationmaster.4fb3a1 worker.go:300 model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:264 setting migration phase to ABORT
2024-03-13 09:29:36 INFO juju.worker.migrationmaster.4fb3a1 worker.go:300 aborted, removing model from target controller: model data transfer failed, failed to import model into target controller: granting admin permission to the owner: user "ankan14" is permanently deleted

This should have also appeared in status on the model after the failure.

So issue #1 - Can you confirm that the "ankan14" user was deleted? Juju should have checked this.

Issue #2 is this error

2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "environ-tracker" manifold worker returned unexpected error: model "api-starknet-mainnet-20" (ffdd553c-967b-4186-8fc8-1db85f0af57c): reading model config: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "log-forwarder" manifold worker returned unexpected error: model "ffdd553c-967b-4186-8fc8-1db85f0af57c": settings not found (not found)
2024-03-13 09:36:16 INFO juju.worker.migrationmaster.ffdd55 worker.go:264 setting migration phase to DONE
2024-03-13 09:36:16 ERROR juju.worker.dependency engine.go:695 "migration-master" manifold worker returned unexpected error: failed to set phase: could not get migration: model "ffdd553c-967b-4186-8fc8-1db85f0af57c" not found (not found)
2024-03-13 09:36:16 INFO juju.workers.modelworkermanager runner.go:609 stopped "ffdd553c-967b-4186-8fc8-1db85f0af57c", err: <nil>

This is a race in setting the migration to DONE and there's a fix that hasn't landed yet. But it shouldn't be fatal, just log noise.

Issue #3 there's an orphaned resource; I don't think it's fatal:

2024-03-13 09:54:56 WARNING juju.apiserver resources_unit.go:43 cannot fetch resource reader: resource "custom-network-file" not found

It would be good to know what happened though.

We also see that each of the many models being migrated has 10000s of logs to shift across. This should be ok but adds load.

The target controller emits this error:

2024-03-13 09:29:06 WARNING juju.environs.config config.go:2014 unknown config field "default-series"

Issue #4 it's now "default-base" and should be converted by juju when it's imported. But non fatal I think.

There's an error of concern:

ERROR juju.worker.modelcache worker.go:373 watcher error: error loading entities for model ce364867-3c8f-45cf-8f0c-c06aabd02dea: failed to initialise backing for applicationOffers:polkadot: getting relation endpoint for relation "rpc-url" and application "polkadot": charm "ch:amd64/jammy/polkadot-18" not found, getting new watcher

Does the "polkadot" offer exist in the model with uuid "ce364867-3c8f-45cf-8f0c-c06aabd02dea"?
Was anything removed with --force?

At the least the error needs to be rectified. You could try removing the orphaned offer and most like application as well if it's not in use, since I have no idea how the application ended up with a missing charm. Can you shed any light on it?

Do you have any logs from the time the unit agent complained about being unauthorised to connect?

Revision history for this message

Ian Booth (wallyworld) wrote on 2024-03-19:

#11

Also, we'd need logs from the affected unit agent - the one that couldn't connect. Plus the logisnk fire hose.
This will be like looking for a needle in a haystack so we might not be able to get the any root cause very easily.

Canonical Juju

Migration from 2.9 to 3.3 caused unit agent to stop working

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches