after a successful update and reboot, the bootloader snappy_mode is set to 'try'

Bug #1498293 reported by Leo Arias
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Triaged
Low
Unassigned

Bug Description

After a snappy update, reboot and successful boot, it's possible to run commands before the boot-ok service has run and updated the bootloader config file.

This causes the rollback test to go crazy because snappy rollback updates the bootloader conf, then boot-ok runs and changes the values of the file.

It shouldn't be possible to log in before boot-ok has finished running.

This doesn't affect manual scenarios because for a person it's slow to do something like snappy rollback. It affects the integration suite because after ssh all the commands are executed very fast.

I have proposed a branch with a workaround for the integration suite. Also a branch with an assertion that exposes the problem. Once this is fixed, the second branch should be merged and the workaround removed.

Original report
===============
To reproduce:

$ go run ./_integration-tests/main.go --filter rollbackSuite

****** Resuming rollbackSuite.TestRollbackMustRebootToOtherVersion after reboot
PASS: <autogenerated>:52: rollbackSuite.SetUpTest 0.000s

snappy list
Name Date Version Developer
ubuntu-core 2015-09-21 180 ubuntu
generic-amd64 2015-09-22 1.4 canonical
/home/elopio/workspace/go/src/launchpad.net/snappy/_integration-tests/tests/rollback_test.go:51:
    ...open /home/elopio/workspace/go/src/launchpad.net/snappy/_integration-tests/tests/rollback_test.go: no such file or directory
... obtained bool = false
... expected bool = true

The test expects ubuntu core to be 179, the previous version. But it boots on 180.
I think this is something that the test causes, and not a snappy issue.

Related branches

Revision history for this message
Federico Gimenez (fgimenez) wrote :

I'm getting this randomly, most of the time it's working. It seems to be related to the value of snappy_trial_boot on /boot/grub/grubenv, when it succeeds the file has:

# GRUB Environment Block
snappy_mode=try
snappy_ab=a
snappy_trial_boot=0

and when it fails:

# GRUB Environment Block
snappy_mode=try
snappy_ab=a
snappy_trial_boot=1

Leo, can you please confirm?

Revision history for this message
Federico Gimenez (fgimenez) wrote :

This is the complete log of a failed execution with calls to show the contents of /boot/grub/grubenv [1]. It seems that after the first update snappy_trial_boot is not properly unset and when trying the rollback the code at [2] switches the boot partition.

[1] http://paste.ubuntu.com/12603352/
[2] http://bazaar.launchpad.net/~snappy-dev/snappy-hub/snappy-systems/view/head:/generic-amd64/grub.cfg#L24

Revision history for this message
Leo Arias (elopio) wrote :

I've ran this like 30 times this evening. I haven't been able to reproduce it.

Take a look at this from your paste:

kpartx -ds error: device-mapper: remove ioctl on loop0p5 failed: Device or resource busy
loop deleted : /dev/loop0

That is https://bugs.launchpad.net/ubuntu/+source/goget-ubuntu-touch/+bug/1496484
when that happens to me, I can't ssh into the machine. Can you please look if everytime you have the rollback error your udf failed?
On your paste, there's also this problem: https://bugs.launchpad.net/ubuntu/+source/goget-ubuntu-touch/+bug/1473333
udf fails, but doesn't stop the run.

Revision history for this message
Federico Gimenez (fgimenez) wrote :

After some executions I've been able to get the error without any reported failures from kpartx [1]. It seems to be more frequent when the kpartx failure is present, but I've also seen successful executions with them [2]

[1] http://paste.ubuntu.com/12609842/
[2] http://paste.ubuntu.com/12609512/

Revision history for this message
Leo Arias (elopio) wrote :

Many runs today again.

I can't reproduce the error if I call kvm to start the vm manually. I can reproduce it when I let adt-run start the vm.

Here's the grubenv on a successful run:

boot at 187

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

fake update

snappy_mode=try
snappy_ab=b
snappy_trial_boot=0

reboot

snappy_mode=regular
snappy_ab=b
snappy_trial_boot=0

rollback

snappy_mode=try
snappy_ab=a
snappy_trial_boot=0

reboot

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

Here's on a failed run:

boot at 187

snappy_mode=regular
snappy_ab=a
snappy_trial_boot=0

fake update

snappy_mode=try
snappy_ab=b
snappy_trial_boot=0

reboot

snappy_mode=try
snappy_ab=b
snappy_trial_boot=1

rollback

snappy_mode=try
snappy_ab=a
snappy_trial_boot=1

reboot

snappy_mode=regular
snappy_ab=b
snappy_trial_boot=0

So the problem seems to start after the first reboot. We are in the b partition successfully, but it still says snappy_mode=try and
snappy_trial_boot=1. After a successful reboot, these values are wrong.

Revision history for this message
Leo Arias (elopio) wrote :

There's this service that should run on a successful boot:
http://bazaar.launchpad.net/~snappy-dev/snappy/snappy/view/head:/debian/ubuntu-snappy.boot-ok.service

If I break after the first reboot, systemctl status ubuntu-snappy.boot-ok show no errors, but well, there's obviously something wrong. I added a call to sudo snappy booted before the rollback. That fixes the env file, but it still boots to b and fails. That makes no sense. I think we need help from somebody who understands this boot-ok service better.

Leo Arias (elopio)
summary: - fake rollback integration test fails
+ after a successful boot, the bootloader snappy_mode is set to 'try'
summary: - after a successful boot, the bootloader snappy_mode is set to 'try'
+ after a successful update and reboot, the bootloader snappy_mode is set
+ to 'try'
description: updated
Changed in snappy:
status: New → Confirmed
Revision history for this message
Leo Arias (elopio) wrote :

Thanks to Chipaca and Federico we found that boot-ok is slow, causing havok for this test. Description updated, branches linked.

Revision history for this message
Michael Vogt (mvo) wrote :

I think we need to also look into this outside of the tests, i.e. a normal user that ssh into the machine and then run stuff (or something like ansible) could hit this scenario as well.

Changed in snappy:
status: Confirmed → Triaged
importance: Undecided → Critical
Revision history for this message
Leo Arias (elopio) wrote :

I'm confirming now that on bbb is extra slow.
I can log in through ssh or normally through serial, run fw_printenv snappy_mode and see that the value is try. I still have some time to do snappy rollback ubuntu_core and reboot, which is what produces the funny results.

So to answer mvo: yes, scripts are likely to hit this. But it can even happen for a user entering commands manually.

Revision history for this message
Michael Vogt (mvo) wrote :

We are in a much better position now. As soon as snapd runs (which is relatively early) the boot is considered done. Still, there is a smal window where its not.

Changed in snappy:
importance: Critical → Low
Michael Vogt (mvo)
affects: snappy → snapd
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.