Upgrade from 2.4.7 to 2.5.4 blocked on raft lease migration

Bug #1827371 reported by Barry Price on 2019-05-02
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
juju
High
Christian Muirhead

Bug Description

Hi,

Upgrading a (single, non-HA) controller from 2.4.7 to 2.5.4 failed with the following repeated over and over in the machine agent log:

2019-05-02 09:11:05 ERROR juju.upgrade upgrade.go:140 upgrade step "migrate legacy leases into raft" failed: no log entries, expected at least one for configuration
2019-05-02 09:11:05 ERROR juju.worker.upgradesteps worker.go:382 upgrade from 2.4.7 to 2.5.4 for "machine-0" failed (giving up): migrate legacy leases into raft: no log entries, expected at least one for configuration

Functionality was severely limited in this state - 'juju status' would run, but with all agents showing as lost. Commands like controller-config and model-config would not run.

I confirmed that the /var/lib/juju/raft directory existed on the controller machine, this contained a 32kb binary file named 'logs' and an empty subdirectory named 'snapshots'.

I managed to work around this by backing up and then emptying the 'leases' mongo collection and restarting the machine agent, but it looks like the raft engine is still not functioning correctly - lots of this repeated over and over in the logs post-upgrade:

2019-05-02 11:48:46 INFO juju.core.raftlease store.go:248 timeout

General functionality is restored, but I suspect we may run into more issues in future due to the above.

John A Meinel (jameinel) wrote :

This is somewhat related to bug #1822454. The issue seems to be that the upgrade thinks the raft directory is in a consistent state, but logs, etc seem to indicate that there are 0 records in the directory.

There is a BootstrapRaft function that is done during 'upgrade to 2.4.0' but it starts with:
func BootstrapRaft(context Context) error {
        agentConfig := context.AgentConfig()
        storageDir := raftDir(agentConfig)
        _, err := os.Stat(storageDir)
        // If the storage dir already exists we shouldn't run again. (If
        // we statted the dir successfully, this will return nil.)
        if !os.IsNotExist(err) {
                return err
        }
        _, transport := raft.NewInmemTransport(raft.ServerAddress("notused"))
        defer transport.Close()

So if the directory exists, it won't try to do anything (but it wouldn't anyway because we were upgrading from 2.4.7 to 2.5.4).

It would be good if we just had a way to manually trigger reinitialization of the raft directory.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Christian Muirhead (2-xtian) wrote :

I can make a binary to delete the raft dir and rerun BootstrapRaft. I'll put in a check to indicate whether the log has entries with a --force for when we want to blow it away anyway.

It seems like sequence was:
* upgrading to 2.4
* check the raft dir doesn't already exist
* create the log and snapshot stores in raft dir
* call raft.BootstrapCluster which fails for some reason
* upgrade step fails and gets retried
* raft directory exists so the upgrade step thinks it's already bootstrapped and returns early.

Do you have logs that go far back enough to show the upgrade to 2.4? It would be good to know why the BootstrapCluster call fails.

At this point it's probably shutting the door after the horse has bolted, but I'll change the 2.4 upgrade step to use the check for a configuration entry used in the MigrateLegacyLeases step, rather than just checking for the raft directory.

I'll also change the MigrateLegacyLeases 2.5 step to run BootstrapCluster if it can't find the configuration entry. It's probably better than requiring people to run the emergency rebootstrap tool if we can detect the problem.

Changed in juju:
assignee: nobody → Christian Muirhead (2-xtian)
Changed in juju:
status: Triaged → In Progress
Christian Muirhead (2-xtian) wrote :

PR to fix the upgrade steps: https://github.com/juju/juju/pull/10135

This doesn't provide the binary to rebootstrap - I'll work on that next.
Discussing with Ian we wanted to get this in since we're doing a 2.5 release very soon.

Barry Price (barryprice) wrote :

Sorry Christian, the oldest upgrade in the local logs is from 2.4.5 to 2.4.7 (December 6th 2018).

Do let me know if I can provide anything else of use though.

Christian Muirhead (2-xtian) wrote :

No worries - I figured that was a long shot. At least with this change it'll be made apparent at upgrade time.

Here's repository with code for a binary that will rebootstrap the raft directory on the controller you have with a bad one.
https://github.com/juju/rebootstrap-raft

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
milestone: none → 2.5.6
status: Fix Committed → Fix Released
Changed in juju:
milestone: 2.5.6 → 2.5.7
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers