Upgrade from 2.4.7 to 2.5.4 blocked on raft lease migration
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Christian Muirhead |
Bug Description
Hi,
Upgrading a (single, non-HA) controller from 2.4.7 to 2.5.4 failed with the following repeated over and over in the machine agent log:
2019-05-02 09:11:05 ERROR juju.upgrade upgrade.go:140 upgrade step "migrate legacy leases into raft" failed: no log entries, expected at least one for configuration
2019-05-02 09:11:05 ERROR juju.worker.
Functionality was severely limited in this state - 'juju status' would run, but with all agents showing as lost. Commands like controller-config and model-config would not run.
I confirmed that the /var/lib/juju/raft directory existed on the controller machine, this contained a 32kb binary file named 'logs' and an empty subdirectory named 'snapshots'.
I managed to work around this by backing up and then emptying the 'leases' mongo collection and restarting the machine agent, but it looks like the raft engine is still not functioning correctly - lots of this repeated over and over in the logs post-upgrade:
2019-05-02 11:48:46 INFO juju.core.raftlease store.go:248 timeout
General functionality is restored, but I suspect we may run into more issues in future due to the above.
Changed in juju: | |
status: | Triaged → In Progress |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
milestone: | none → 2.5.6 |
status: | Fix Committed → Fix Released |
Changed in juju: | |
milestone: | 2.5.6 → 2.5.7 |
This is somewhat related to bug #1822454. The issue seems to be that the upgrade thinks the raft directory is in a consistent state, but logs, etc seem to indicate that there are 0 records in the directory.
There is a BootstrapRaft function that is done during 'upgrade to 2.4.0' but it starts with: context Context) error { AgentConfig( ) agentConfig)
return err ansport( raft.ServerAddr ess("notused" ))
func BootstrapRaft(
agentConfig := context.
storageDir := raftDir(
_, err := os.Stat(storageDir)
// If the storage dir already exists we shouldn't run again. (If
// we statted the dir successfully, this will return nil.)
if !os.IsNotExist(err) {
}
_, transport := raft.NewInmemTr
defer transport.Close()
So if the directory exists, it won't try to do anything (but it wouldn't anyway because we were upgrading from 2.4.7 to 2.5.4).
It would be good if we just had a way to manually trigger reinitialization of the raft directory.