migrating: aborted, removing model from target controller

Bug #1915511 reported by Aymen Frikha
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Unassigned

Bug Description

Trying to do model migration from one controller to another to be able to do series upgrade from xenial to bionic, but the migration didn't succeed: and had those notes: migrating: aborted, removing model from target controller: validating, some agents reported failure

Using juju 2.7.8 to do this migration.

Got those logs from the destination controller: https://pastebin.canonical.com/p/JDCqY8CSNK/

Revision history for this message
Pen Gale (pengale) wrote :

Grepping for "trace" and "error" in the logs, I see a lot of messages about network connections being refused. They're all connections to mongo on localhost, though.

Is there a problem w/ disk space or some super strict firewall rules on the controller? It's really unhappy about something, but it's not clear from the logs what the underlying issue is.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

no Peter, I have 100G of disk space and didn't have any firewall rule. I created the new controller using this command: juju bootstrap --bootstrap-series=bionic --bootstrap-constraints "tags=new-juju" maas_cloud new_maas_cloud_controller
then upgraded agent version from 2.7.6 to 2.7.8 to be on the same version as the original controller.
And then did a migration with this command: juju migrate default new_maas_cloud_controller

I also tested the workaround in this bug: https://bugs.launchpad.net/juju/+bug/1882827

but had the same issue

Revision history for this message
Ian Booth (wallyworld) wrote :

Can we get logs from the source controller as well.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Joined is the logs for the source controller 0

Revision history for this message
Ian Booth (wallyworld) wrote :

The logs from the source model say that one unit, nrpe/42 failed to validate after the migration was done and so it was rolled back. And the rollback also failed.

2021-02-12 01:16:58 ERROR juju.worker.migrationmaster.87bbf5 worker.go:734 agents failed phase "validating" (units: nrpe/42)
2021-02-12 01:16:58 ERROR juju.worker.migrationmaster.87bbf5 worker.go:286 validating, some agents reported failure
2021-02-12 01:17:06 WARNING juju.worker.migrationmaster.87bbf5 worker.go:626 failed to remove model from target controller, cannot log in: context deadline exceeded

There's nothing in the destination controller logs that show why nrpe/42 was unhappy. But some logs may be missing. The pastebin for the destination controller ends at "2021-02-12 01:12:26" whereas the source controller logs show migration occurred around "2021-02-12 01:16:58". So it's hard to know what happened. Is nrpe/42 functioning normally?

It would also be good to try this on 2.8.8 since some migration issues have been fixed since 2.7.8 was released.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

I tried to migrate the model to a 2.8.8 controller.
Here is the juju status output: https://pastebin.canonical.com/p/ny93JWhmtV/

I will also attach the destination controller logs

Revision history for this message
Aymen Frikha (aym-frikha) wrote :
Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Here is the source controller logs

Revision history for this message
Ian Booth (wallyworld) wrote :

The status from comment #6 shows the reason for aborting was "unit nrpe/23 not idle or executing".

Migration requires that there be no activity in the model, ie running a hook or action etc. The status shows nrpe/23 as idle now so it may just be a case of trying again now that the model has settled down.

Revision history for this message
Ian Booth (wallyworld) wrote :

Online conversation also revealed some units had hook errors.
Migration will only work if the model is idle and there are no unresolved hook errors. The checks are doe on the source model and also when the model is on the target controller.
Migration will be aborted if there are any such errors.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

Thanks Ian for the explanation, but I think this needs to be added in the section "For migration to work" in this documentation : https://juju.is/docs/migrating-models

I also tried another time with all the units active idle (green), but still have this issue. Here is the logs of one of the failed agents: https://pastebin.canonical.com/p/52JfN42mBz/

Revision history for this message
Ian Booth (wallyworld) wrote :

The source controller log shows this error:

11:06:42 ERROR juju.worker.migrationminion worker.go:197 validation failed: failed to open API to target controller: try again (try again)
2021-02-19 11:06:42 INFO juju.worker.migrationminion worker.go:139 migration phase is now: ABORT

One of the migrated machine or unit agents is trying to check that things went ok by first connecting to the target controller and the API connection could not be opened. Is there anything in the logs on the target side? ie controller or any of the machine/unit agent logs? What does show-model on the model which failed to migrate say?

We can also get the doc updated to add to the prerequisite checks that need to pass.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

I did another test Ian and get those logs. I was not able to identify the root cause just from the logs:

- Destination controller: https://pastebin.canonical.com/p/Rt7TTyypDp/
- Source controller I pasted only the logs after starting the migration: https://pastebin.canonical.com/p/zB5Kn76cMj/
- One of the failed units: https://pastebin.canonical.com/p/bJdJNWnHJW/
- show model command: https://pastebin.canonical.com/p/fbYZXYzdVQ/

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This looks like agents can't reach the new controller. Have you ensured that the new controller is on the same network(s) as the original controller and reachable/pingable from all units within the network?

Oddly, the logs don't seem to show what API address is being attempted for the "target controller", unless all of the connected to upstream API is correct and what you actually have is a new-model hung controller not willing to provide data to the endpoints.

2021-02-22 08:06:30 ERROR juju.worker.migrationminion worker.go:197 validation failed: failed to open API to target controller: try again (try again)

Revision history for this message
Ramon Grullon (rgrullon) wrote :

Adding juju status -m for old controller and new controller

Old controller - https://pastebin.canonical.com/p/5cQb5mckGB/
New controller - https://pastebin.canonical.com/p/VkbpwfTYPc/

Revision history for this message
Ian Booth (wallyworld) wrote :

The logs from the failed unit in comment #13 confirm the problem.

The issue is as per https://bugs.launchpad.net/juju/+bug/1882827 which we discussed earlier.
This can be seen looking at this log line

RROR juju.worker.migrationminion worker.go:197 validation failed: failed to open API to target controller: try again (try again)

The "try again" error is what happens when an agent connection gets rejected because rate limiting has been applied by the controller.

As explained in the bug, you can adjust the agent-ratelimit-max and agent-ratelimit-rate controller config.

Also, as suggested earlier, you should also upgrade the source controller to 2.8.8. The above bug has a fix which landed in 2.8 which makes the agents a little more tolerant to a busy controller during the validation phase.

Revision history for this message
Ian Booth (wallyworld) wrote :

To clarify, the source model should be running 2.8.8

Revision history for this message
Ramon Grullon (rgrullon) wrote :

Running juju 2.8.9 on both source (xenial) and destination (bionic) controller, the migration fails with following error - 'migrating: aborted, removing model from target controller: validating, some agents reported failure.

migration-start: 12 minutes ago

model has 167 units

configured with following configs

juju controller-config -c new_maas_cloud_controller agent-ratelimit-rate=1m
juju controller-config -c new_maas_cloud_controller agent-ratelimit-max=200

All of the units are in active idle state currently

Revision history for this message
Ramon Grullon (rgrullon) wrote :

Running the following configs on 167 unit model, I got a successful model migration.

167 units Xenial (source controller) Bionic (destination controller) 500 (agent-ratelimit-max) 10s (agent-ratelimit-rate) 45m (time for model to stabilized after migration. When the migration occurs, all agents are lost initially)

Revision history for this message
John A Meinel (jameinel) wrote :

I feel like we added code to get the migration to work (setting the rate at which they try to connect), and there isn't much to follow up on this bug.

Changed in juju:
status: New → Triaged
importance: Undecided → High
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.