juju migrate from bionic to focal based controller fails at precheck

Bug #1942421 reported by Gábor Mészáros
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Expired
High
Unassigned

Bug Description

Source controller version: 2.8.11
Destination controller version: 2.8.11

I have tried restarting machine and unit daemons that the migration process was complaining about, but they started to loop around.
The log I see in machine-0:

2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.notifyCollWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *state.docWatcher resource: txn watcher sync error
2021-09-01 14:44:01 ERROR juju.apiserver.common resource.go:118 error stopping *uniter.subRelationsWatcher resource: txn watcher sync error
panic: close of closed channel

goroutine 2865647 [running]:
github.com/juju/juju/state.OpenStatePool.func3(0x4a27b6c, 0x8, 0x0, 0x0)
        /home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/state/pool.go:183 +0x2e
github.com/juju/pubsub.(*subscriber).loop(0xc0105d0f50)
        /home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/vendor/github.com/juju/pubsub/subscriber.go:78 +0x22c
created by github.com/juju/pubsub.newSubscriber
        /home/jenkins/workspace/BuildJuju-amd64/_build/src/github.com/juju/juju/vendor/github.com/juju/pubsub/subscriber.go:40 +0x139
2021-09-01 14:44:02 INFO juju.cmd supercommand.go:54 running jujud [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
2021-09-01 14:44:02 DEBUG juju.cmd supercommand.go:55 args: []string{"/var/lib/juju/tools/machine-0/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0", "--debug"}
2021-09-01 14:44:02 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 8
2021-09-01 14:44:02 DEBUG juju.agent agent.go:582 read agent config, format "2.0"
2021-09-01 14:44:02 INFO juju.cmd.jujud agent.go:139 setting logging config to "<root>=WARNING;unit=DEBUG"
2021-09-01 14:44:02 WARNING juju.mongo service.go:326 configuring mongod with --noauth flag enabled
2021-09-01 14:44:02 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [e022a0] "machine-0" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
2021-09-01 14:44:06 WARNING juju.cmd.jujud machine.go:880 determining kvm support: INFO: Your CPU does not support KVM extensions
KVM acceleration can NOT be used
: exit status 1
no kvm containers possible
2021-09-01 14:44:08 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: setTime, old time: 0001-01-01 00:00:00 +0000 UTC, new time: 0001-01-01 00:00:01.00235858 +0000 UTC): clock update attempt by out-of-sync caller, retry
2021-09-01 14:44:11 WARNING juju.core.raftlease store.go:300 response timeout waiting for Command(ver: 1, op: claim, ns: singular-controller, model: e022a0, lease: 888d8769-3ce6-43c2-809e-597f35decf7a, holder: machine-0) to be processed
2021-09-01 14:46:55 ERROR juju.worker.migrationmaster.4278df worker.go:294 source prechecks failed: machine 0 agent not functioning at this time (down)
2021-09-01 14:47:00 WARNING juju.worker.migrationmaster.4278df worker.go:652 failed to remove model from target controller, model "0ea1d53a-5384-4225-8b94-80c3fc4278df" not found (not found)
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: infinidat-tools, holder: infinidat-tools/17): lease already held
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: landscape-client, holder: landscape-client/99): lease already held
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: telegraf, holder: telegraf/33): lease already held
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: landscape-client, holder: landscape-client/955): lease already held
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: telegraf, holder: telegraf/946): lease already held
2021-09-01 14:47:00 WARNING juju.core.raftlease store.go:285 command Command(ver: 1, op: claim, ns: application-leadership, model: 0ea1d5, lease: landscape-client, holder: landscape-client/3): lease already held

CLI output of migration and restarts:
$ machine='114'; juju ssh -m foundations-maas:openstack ${machine} -qt sudo systemctl restart jujud-machine-${machine//\//-}.service
$ juju migrate --show-log foundations-maas:openstack foundations-maas-focal
14:41:20 INFO juju.cmd supercommand.go:54 running juju [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
14:41:20 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.70:17070 10.64.5.68:17070]
14:41:20 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.70:17070/api"
14:41:20 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.69:17070]
14:41:20 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.69:17070/api"
14:41:20 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.70:17070 10.64.5.68:17070]
14:41:20 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.70:17070/api"
$ machine='canonical-livepatch/12'; juju ssh -m foundations-maas:openstack ${machine} -qt sudo systemctl restart jujud-unit-${machine//\//-}.ser
vice
$ juju migrate --show-log foundations-maas:openstack foundations-maas-focal
14:41:49 INFO juju.cmd supercommand.go:54 running juju [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
14:41:49 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.68:17070 10.64.5.70:17070]
14:41:49 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.70:17070/api"
14:41:49 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.69:17070]
14:41:49 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.69:17070/api"
14:41:49 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.70:17070 10.64.5.68:17070]
14:41:49 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.70:17070/api"
ERROR source prechecks failed: unit ntp/1042 not idle or executing (lost)
$ machine='ntp/1042'; juju ssh -m foundations-maas:openstack ${machine} -qt sudo systemctl restart jujud-unit-${machine//\//-}.service
$ juju migrate --show-log foundations-maas:openstack foundations-maas-focal
14:42:05 INFO juju.cmd supercommand.go:54 running juju [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
14:42:05 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.70:17070 10.64.5.68:17070]
14:42:05 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.68:17070/api"
14:42:05 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.69:17070]
14:42:05 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.69:17070/api"
14:42:05 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.68:17070 10.64.5.70:17070]
14:42:05 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.68:17070/api"
ERROR source prechecks failed: unit sysconfig-compute/617 not idle or executing (lost)
$ machine='sysconfig-compute/617'; juju ssh -m foundations-maas:openstack ${machine} -qt sudo systemctl restart jujud-unit-${machine//\//-}.serv
ice
$ juju migrate --show-log foundations-maas:openstack foundations-maas-focal
14:42:24 INFO juju.cmd supercommand.go:54 running juju [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
14:42:24 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.68:17070 10.64.5.70:17070]
14:42:24 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.68:17070/api"
14:42:24 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.69:17070]
14:42:24 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.69:17070/api"
14:42:24 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.68:17070 10.64.5.70:17070]
14:42:24 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.68:17070/api"
ERROR source prechecks failed: unit sysconfig-compute-dpdk/194 not idle or executing (lost)
$ machine='sysconfig-compute-dpdk/194'; juju ssh -m foundations-maas:openstack ${machine} -qt sudo systemctl restart jujud-unit-${machine//\//-}.s
ervice
$ juju migrate --show-log foundations-maas:openstack foundations-maas-focal
14:42:38 INFO juju.cmd supercommand.go:54 running juju [2.8.11 0 5e99fae0eff8e18081a8f734eab4680378b08608 gc go1.14.15]
14:42:38 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.68:17070 10.64.5.70:17070]
14:42:38 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.70:17070/api"
14:42:38 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.69:17070]
14:42:38 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.69:17070/api"
14:42:38 INFO juju.juju api.go:67 connecting to API addresses: [10.64.5.70:17070 10.64.5.68:17070]
14:42:38 INFO juju.api apiclient.go:648 connection established to "wss://10.64.5.68:17070/api"
Migration started with ID "0ea1d53a-5384-4225-8b94-80c3fc4278df:4"
14:43:57 INFO cmd supercommand.go:544 command finished
< yet it failed again>

tags: added: cpe-onsite
Revision history for this message
Gábor Mészáros (gabor.meszaros) wrote :

Adding field-crit as this is a customer deployment upgrade blocker now.

Harry Pidcock (hpidcock)
Changed in juju:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Ian Booth (wallyworld) wrote :

Juju 2.9 has a fix which can recover the txn watcher in the event there's a sync error - sometimes for unknown reasons (eg a network blip) the mongo connect can get interrupted and the 2.9 fix allows the underlying worker thread to recover.

It looks like what's happening here (one of the things) is that the txn watcher is bouncing and republishing a "started" event - there's a pubsub subscriber which listens for this event and closes a channel. The close is happening multiple times and that's causing the panic. We need to fix that. It might be the case that the extra robustness in 2.9 mitigates this issue but it should be addressed regardless, which we'll do in 2.9 to start with.

Is the txn watcher issue the root cause of the migration failure? That's unclear. Model migration does require that all agents are running and idle. Can you confirm whether you are able to bring the models back to the stable/idle state? If so, is it the migration initiation which causes triggers an issue, or is the system unstable outside of any migration attempt? Was the status of both models error free prior to running the first migration attempt? What's the status --format yaml of each model now? Are the txn sync errors still being logged?

Revision history for this message
Ian Booth (wallyworld) wrote :

This 2.9 PR fixes the txn watcher panic

https://github.com/juju/juju/pull/13312

If we could get a little more information as per comment #2 that would be helpful.

Changed in juju:
milestone: none → 2.9.13
Changed in juju:
milestone: 2.9.13 → 2.9.14
Revision history for this message
John A Meinel (jameinel) wrote :

Switching to Field High as it is important, but it doesn't need round-the-clock support.

Changed in juju:
milestone: 2.9.14 → 2.9.15
Revision history for this message
Gábor Mészáros (gabor.meszaros) wrote :

"""Juju 2.9 has a fix which can recover the txn watcher in the event there's a sync error - sometimes for unknown reasons (eg a network blip) the mongo connect can get interrupted and the 2.9 fix allows the underlying worker thread to recover."""

Should I upgrade to 2.9/stable (2.9.12) and try again?

"""It looks like what's happening here (one of the things) is that the txn watcher is bouncing and republishing a "started" event - there's a pubsub subscriber which listens for this event and closes a channel. The close is happening multiple times and that's causing the panic. We need to fix that. It might be the case that the extra robustness in 2.9 mitigates this issue but it should be addressed regardless, which we'll do in 2.9 to start with."""

The last txn error happened 12 days ago, which is the same one that is reported here. Since then I have upgraded the whole environment to focal (from bionic), excluding the juju controllers.

"""Is the txn watcher issue the root cause of the migration failure? That's unclear. Model migration does require that all agents are running and idle. Can you confirm whether you are able to bring the models back to the stable/idle state? If so, is it the migration initiation which causes triggers an issue, or is the system unstable outside of any migration attempt? Was the status of both models error free prior to running the first migration attempt? What's the status --format yaml of each model now? Are the txn sync errors still being logged?"""

Not all agents are in active/idle right now, as there have been some issues during the distro-upgrade, that fixing is still ongoing. I uploaded the status regardless:
https://pastebin.canonical.com/p/42wWsp7S4X/
valid for 1yr, customer specifics redacted/replaced.

Revision history for this message
John A Meinel (jameinel) wrote :

This feels like the symptoms we encountered in https://bugs.launchpad.net/juju/+bug/1918680

Which should be fixed in the 2.9 series, and is likely fixed in the tip of 2.8 but we have not done a 2.8.12 release with it.

Revision history for this message
Ian Booth (wallyworld) wrote :

Upgrading to 2.9 latest release would be beneficial in that the known issues with the txn watcher, the panic, and the migration issue are all fixed. This will give a stable base and we can then decide on next steps.

Revision history for this message
Joseph Phillips (manadart) wrote :

The bug represented by LP:1918680 looks a little different. There, the agent goes into the "failed" state briefly. Here, it is reported as "lost":

ERROR source prechecks failed: unit sysconfig-compute-dpdk/194 not idle or executing (lost)

In any case, if we back-ported the panic fix from https://github.com/juju/juju/pull/13312, there would be little to do here but decide whether a 2.8.12 is worthwhile.

Changed in juju:
milestone: 2.9.15 → 2.9.16
John A Meinel (jameinel)
Changed in juju:
importance: Critical → High
Changed in juju:
milestone: 2.9.16 → 2.9.17
Changed in juju:
milestone: 2.9.17 → 2.9.18
Revision history for this message
John A Meinel (jameinel) wrote :

As Ian pointed out in https://bugs.launchpad.net/juju/+bug/1942421/comments/7 I think the easiest way forward would be to first locally upgrade to 2.9.* and then attempt the migration again.

However, it may be something that we should be engaging at an Engineering level, and having one of our engineers working directly with you to understand what is going on. (If you are getting txn-watcher sync errors, that means that Mongo is not reliably responding to us, which is not a great time to be trying to do a large migration. There may be something like ensuring the Replica set is happy before we do other work.)

Changed in juju:
milestone: 2.9.18 → 2.9.19
Changed in juju:
milestone: 2.9.19 → 2.9.20
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
milestone: 2.9.21 → 2.9.22
tags: added: aubergine
Changed in juju:
milestone: 2.9.22 → 2.9.23
Revision history for this message
Arif Ali (arif-ali) wrote :

Hi all,

I have been working with Gabor on this, and test this in my lab environment, and struggled too.

I bootstrapped a bionic controller using 2.8.13, and then deployed a simple OpenStack cloud which mimicked similar environment. I then bootstrapped a 2.8.13 focal controller.

I was continuously getting the same issue, where it will try to migrate, but was not able to migrate. The following 2 messages would be the last 2 set of messages I would get.

* migrating: validating, waiting for agents to report back
* migrating: aborted, removing model from target controller: validating, some agents reported failure

The pastebin [1] shows the debug log of one of the units, to then understand what is going on.

After doing various debugging and searching around, and found LP bug [2], which showed very similar symptoms.

I used various parameters for agent-ratelimit-rate and agent-ratelimit-max, but didn't have much success.

In the end, the parameters below were used in the target controller, and the migration worked.

juju controller-config -c home-maas-focal agent-ratelimit-rate=50ms
juju controller-config -c home-maas-focal agent-ratelimit-max=100

Once migrated, it did take some time for the agents to settle.

So, I guess, how do we know what are the right items to add to get this working

[1] https://paste.ubuntu.com/p/4px8gnPqsT/
[2] https://bugs.launchpad.net/juju/+bug/1915511

Revision history for this message
Ian Booth (wallyworld) wrote :

It's hard to predict what values are needed for any given deployment, as what's needed depends not only on the size of the model (application, unit, machine count) but also the processing capacity of the controller. The default values are:

agent-ratelimit-max=10
agent-ratelimit-rate=250ms

The values used in comment #10 increase the allowed rate of agent connections, but the controller needs to be large enough to handle this.

What we can do is enhance the model migration docs fund at https://juju.is/docs/olm/migrate-a-model to provide some rough guidelines.

Changed in juju:
milestone: 2.9.23 → 2.9.24
Changed in juju:
milestone: 2.9.24 → 2.9.25
Changed in juju:
milestone: 2.9.25 → 2.9.26
Changed in juju:
milestone: 2.9.26 → 2.9.27
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Changed in juju:
milestone: 2.9.29 → 2.9.30
Revision history for this message
Ian Booth (wallyworld) wrote :

The docs mention suggested values for large models - marking bug as Incomplete for now but we can revisit if there's still issues.

Changed in juju:
milestone: 2.9.30 → 2.9-next
milestone: 2.9-next → none
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers