after a successful update and reboot, the bootloader snappy_mode is set to 'try'

Bug #1498293 reported by Leo Arias on 2015-09-22
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Snappy
Low
Unassigned

Bug Description

After a snappy update, reboot and successful boot, it's possible to run commands before the boot-ok service has run and updated the bootloader config file.

This causes the rollback test to go crazy because snappy rollback updates the bootloader conf, then boot-ok runs and changes the values of the file.

It shouldn't be possible to log in before boot-ok has finished running.

This doesn't affect manual scenarios because for a person it's slow to do something like snappy rollback. It affects the integration suite because after ssh all the commands are executed very fast.

I have proposed a branch with a workaround for the integration suite. Also a branch with an assertion that exposes the problem. Once this is fixed, the second branch should be merged and the workaround removed.

Original report
===============
To reproduce:

$ go run ./_integration-tests/main.go --filter rollbackSuite

****** Resuming rollbackSuite.TestRollbackMustRebootToOtherVersion after reboot
PASS: <autogenerated>:52: rollbackSuite.SetUpTest 0.000s

snappy list
Name Date Version Developer
ubuntu-core 2015-09-21 180 ubuntu
generic-amd64 2015-09-22 1.4 canonical
/home/elopio/workspace/go/src/launchpad.net/snappy/_integration-tests/tests/rollback_test.go:51:
    ...open /home/elopio/workspace/go/src/launchpad.net/snappy/_integration-tests/tests/rollback_test.go: no such file or directory
... obtained bool = false
... expected bool = true

The test expects ubuntu core to be 179, the previous version. But it boots on 180.
I think this is something that the test causes, and not a snappy issue.

Related branches

Federico Gimenez (fgimenez) wrote :

I'm getting this randomly, most of the time it's working. It seems to be related to the value of snappy_trial_boot on /boot/grub/grubenv, when it succeeds the file has:

# GRUB Environment Block
snappy_mode=try
snappy_ab=a
snappy_trial_boot=0

and when it fails:

# GRUB Environment Block
snappy_mode=try
snappy_ab=a
snappy_trial_boot=1

Leo, can you please confirm?

Federico Gimenez (fgimenez) wrote :

This is the complete log of a failed execution with calls to show the contents of /boot/grub/grubenv [1]. It seems that after the first update snappy_trial_boot is not properly unset and when trying the rollback the code at [2] switches the boot partition.

[1] http://paste.ubuntu.com/12603352/
[2] http://bazaar.launchpad.net/~snappy-dev/snappy-hub/snappy-systems/view/head:/generic-amd64/grub.cfg#L24

Leo Arias (elopio) wrote :

I've ran this like 30 times this evening. I haven't been able to reproduce it.

Take a look at this from your paste:

kpartx -ds error: device-mapper: remove ioctl on loop0p5 failed: Device or resource busy
loop deleted : /dev/loop0

That is https://bugs.launchpad.net/ubuntu/+source/goget-ubuntu-touch/+bug/1496484
when that happens to me, I can't ssh into the machine. Can you please look if everytime you have the rollback error your udf failed?
On your paste, there's also this problem: https://bugs.launchpad.net/ubuntu/+source/goget-ubuntu-touch/+bug/1473333
udf fails, but doesn't stop the run.

Federico Gimenez (fgimenez) wrote :

After some executions I've been able to get the error without any reported failures from kpartx [1]. It seems to be more frequent when the kpartx failure is present, but I've also seen successful executions with them [2]

[1] http://paste.ubuntu.com/12609842/
[2] http://paste.ubuntu.com/12609512/

Leo Arias (elopio) wrote :

Many runs today again.

I can't reproduce the error if I call kvm to start the vm manually. I can reproduce it when I let adt-run start the vm.

Here's the grubenv on a successful run:

boot at 187

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

fake update

snappy_mode=try
snappy_ab=b
snappy_trial_boot=0

reboot

snappy_mode=regular
snappy_ab=b
snappy_trial_boot=0

rollback

snappy_mode=try
snappy_ab=a
snappy_trial_boot=0

reboot

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

Here's on a failed run:

boot at 187

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

fake update

snappy_mode=try
snappy_ab=b
snappy_trial_boot=0

reboot

snappy_mode=try
snappy_ab=b
snappy_trial_boot=1

rollback

snappy_mode=try
snappy_ab=a
snappy_trial_boot=1

reboot

snappy_mode=regular
snappy_ab=b
snappy_trial_boot=0

So the problem seems to start after the first reboot. We are in the b partition successfully, but it still says snappy_mode=try and
snappy_trial_boot=1. After a successful reboot, these values are wrong.

Leo Arias (elopio) wrote :

There's this service that should run on a successful boot:
http://bazaar.launchpad.net/~snappy-dev/snappy/snappy/view/head:/debian/ubuntu-snappy.boot-ok.service

If I break after the first reboot, systemctl status ubuntu-snappy.boot-ok show no errors, but well, there's obviously something wrong. I added a call to sudo snappy booted before the rollback. That fixes the env file, but it still boots to b and fails. That makes no sense. I think we need help from somebody who understands this boot-ok service better.

Leo Arias (elopio) on 2015-09-30
summary: - fake rollback integration test fails
+ after a successful boot, the bootloader snappy_mode is set to 'try'
summary: - after a successful boot, the bootloader snappy_mode is set to 'try'
+ after a successful update and reboot, the bootloader snappy_mode is set
+ to 'try'
description: updated
Changed in snappy:
status: New → Confirmed
Leo Arias (elopio) wrote :

Thanks to Chipaca and Federico we found that boot-ok is slow, causing havok for this test. Description updated, branches linked.

Michael Vogt (mvo) wrote :

I think we need to also look into this outside of the tests, i.e. a normal user that ssh into the machine and then run stuff (or something like ansible) could hit this scenario as well.

Changed in snappy:
status: Confirmed → Triaged
importance: Undecided → Critical
Leo Arias (elopio) wrote :

I'm confirming now that on bbb is extra slow.
I can log in through ssh or normally through serial, run fw_printenv snappy_mode and see that the value is try. I still have some time to do snappy rollback ubuntu_core and reboot, which is what produces the funny results.

So to answer mvo: yes, scripts are likely to hit this. But it can even happen for a user entering commands manually.

Michael Vogt (mvo) wrote :

We are in a much better position now. As soon as snapd runs (which is relatively early) the boot is considered done. Still, there is a smal window where its not.

Changed in snappy:
importance: Critical → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers