Comment 2 for bug 1942421

Revision history for this message
Ian Booth (wallyworld) wrote :

Juju 2.9 has a fix which can recover the txn watcher in the event there's a sync error - sometimes for unknown reasons (eg a network blip) the mongo connect can get interrupted and the 2.9 fix allows the underlying worker thread to recover.

It looks like what's happening here (one of the things) is that the txn watcher is bouncing and republishing a "started" event - there's a pubsub subscriber which listens for this event and closes a channel. The close is happening multiple times and that's causing the panic. We need to fix that. It might be the case that the extra robustness in 2.9 mitigates this issue but it should be addressed regardless, which we'll do in 2.9 to start with.

Is the txn watcher issue the root cause of the migration failure? That's unclear. Model migration does require that all agents are running and idle. Can you confirm whether you are able to bring the models back to the stable/idle state? If so, is it the migration initiation which causes triggers an issue, or is the system unstable outside of any migration attempt? Was the status of both models error free prior to running the first migration attempt? What's the status --format yaml of each model now? Are the txn sync errors still being logged?