uefirtvariable hangs and never exits

Bug #1308574 reported by Jeff Lane 
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Firmware Test Suite
Fix Released
High
Ivan Hu

Bug Description

We attempted to run fwts in 14.04 pre-cert yesterday on a Server that boots via uEFI.

During the overnight test run, the test got as far as running fwts' uefirtvariable test. This test started and then just stuck with no progress, no output or anything.

I was able to recreate this manually using the cert suite's fwts_test wrapper like so:

fwts_test -t uefirtvariable

which just calls fwts and tells it to run that one test via python subprocess.

IN the end, we had to edit the test list to remove this test from begin run, all the other fwts batch tests we execute seem to run fine.

Revision history for this message
Colin Ian King (colin-king) wrote :

I think that this hang suggests that the UEFI run time service is jamming and not fwts per se.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've tested this (via direct run of fwts_test) on two more systems, with the same result: An apparently-hung process. One (an Intel Decathlete server) has been stuck for about 40 minutes now, the other only about ten. I'm attaching a log file from one of these systems -- of course, it's incomplete. (The other's is identical except for time stamps.) If it should really be taking close to an hour (or more) for these tests, then I'm jumping the gun. If not, I'd say a 3-for-3 failure rate means something's wrong in the test.

Revision history for this message
Colin Ian King (colin-king) wrote :

I will assert the fact that the test is exercising a UEFI run time service, that is, it is calling into firmware. If it hangs, then I am pretty sure it's firmware at fault. Not the test.

Revision history for this message
Colin Ian King (colin-king) wrote :

Does the kernel log contain any messages?

Revision history for this message
Rod Smith (rodsmith) wrote :

Colin, here's the output of "dmesg | grep -i efi" on one of the systems. There's nothing else in kernel ring buffer that seems related, although I could give you more logs if you like.

Revision history for this message
Colin Ian King (colin-king) wrote :

just dmesg will be more useful - thanks.

Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the whole dmesg output, then.

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks, so the kernel isn't oopsing, so that's a good sign. Can you run strace on fwts, e.g.

sudo strace fwts uefirtvariable

and capture the output. I am expecting to see it break on an ioctl() call to the fwts UEFI kernel test driver, in which case, it will clearly mean that that the UEFI run time service is the root cause.

Thanks!

Revision history for this message
Rod Smith (rodsmith) wrote :

Here's the strace output from the command you suggested, on one computer (the Intel Decathlete).

Revision history for this message
Rod Smith (rodsmith) wrote :

Additional data: This test succeeded on a Lenovo IdeaPad U530 (my personal laptop), on an OCP Windmill server, and under VirtualBox. This supports Colin's assertion that the problem is with the firmware on the computers that are failing (an Intel Decathlete, a Lenovo E530 laptop, and whatever IBM model Jeff is testing). OTOH:

1) A 50% success rate with this test is disturbing. That's not to say that the test is wrong to fail in such cases, but it's something that needs to be addressed.
2) It would be preferable for the test to fail without hanging.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Colin,

Can you explain a bit about what this test is actually for and should this issue (the test hanging silently and indefinitely) gate a certification or could it lead to catastrophic failures on the test machine?

For now, this basically gates quite a few certifications... but if you can give us an idea of whether it's catastrophic or not, or if you think it should gate them, I can release them tomorrow when 14.04 is announced.

Revision history for this message
David Duffey (dduffey) wrote :

I think Kent tested one or two OCP machines w/ EFI today.

I would be +1 on making an exception here since it is a new test for an LTS it doesn't affect any other testing of actual workloads (I/O, Network, etc.)

David

Revision history for this message
Colin Ian King (colin-king) wrote :

The test is exercising the UEFI run time get variable service. The fact that it locks up in an ioctl shows that the control has been passed from fwts to the kernel and the kernel has passed control over to the UEFI run time service and this never returned.

Such services are critical to ensuring features such as setting UEFI variables when installing the boot loader or updating UEFI variables. If they hang, then it's a UEFI issue. fwts cannot break out of the hang, because we are waiting for the processor to return back from the service, which it does not.

So:

1. It's not fwts fault if it blocks indefinitely.
2. In my opinion, the firmware is broken and not fit for purpose.
3. I cannot make fwts returned from a hung call like this; it really is impossible.

We need to escalate this to the BIOS experts in HWE for their opinion. But I am firmly declaring that any firmware that fails to execute such services is really unfit for use.

Revision history for this message
Rod Smith (rodsmith) wrote :

FWIW, I found a function called dup_variable_bug() in the drivers/firmware/efi/vars.c file of the kernel. It includes the following comment:

        /*
         * Disable the workqueue since the algorithm it uses for
         * detecting new variables won't work with this buggy
         * implementation of GetNextVariableName().
         */

This suggests that there's already one bug workaround related to this function in the kernel (but I haven't traced the logic to figure out what's going on). Maybe this workaround is inadequate, or maybe this is some other EFI bug that's not so easily worked around.

Revision history for this message
Rod Smith (rodsmith) wrote :

I couldn't find references to the GetNextVariableName() EFI call in the efibootmgr source code or in GRUB (although GRUB does seem to implement its own version, efiemu_get_next_variable_name(), for its own limited EFI-emulation code). I'd expect a consistent hang as seen in the test suite would prevent installation and/or booting of the server, too. There may be calls I haven't tracked down in other tools, though. Also, there might be more subtle flaws that can't be detected by the test because it's hanging before it might detect such problems.

Revision history for this message
Kent Baxley (kentb) wrote :

Looks like we are hanging on my OCP Decathlete here as well at the same testing point. I've been sitting at the fwts test for several minutes now:

$ tail .cache/plainbox/sessions/pbox-autzawp2.session/CHECKBOX_DATA/fwts_results.log
This test run on 16/04/14 at 21:56:39 on host Linux decathlete 3.13.0-23-generic
#45-Ubuntu SMP Fri Apr 4 06:58:38 UTC 2014 x86_64.

Command: "fwts -q --stdout-summary -r /home/ubuntu/.cache/plainbox/sessions
/pbox-autzawp2.session/CHECKBOX_DATA/fwts_results.log uefirtvariable".
Running tests: uefirtvariable.

uefirtvariable: UEFI Runtime service variable interface tests.
--------------------------------------------------------------------------------
Test 1 of 7: Test UEFI RT service get variable interface.

ubuntu@decathlete:~$ cat /sys/class/dmi/id/bios_version
SE5C600.86B.02.01.0002.082220131453

Revision history for this message
Kent Baxley (kentb) wrote :

attaching sosreport for my decathlete

Revision history for this message
Colin Ian King (colin-king) wrote :

Comment #15:

1. If there is a bug in the firmware, do we expect the kernel to workaround it? If so, is it a kernel bug or a firmware bug than needs resolving?

2. If there is a bug in the firmware that causes the machine to lock up, how can the kernel work around it if we're stuck inside the firmware and can't get back out?

Is this bug report against fwts, the kernel or various bits of broken firmware? I'm not sure what the exact focus is at the moment?

@IvanHu, care to look at this as you know this code better than I do?

Revision history for this message
Ivan Hu (ivan.hu) wrote :

It looks like that firmware block the return to the efi_runtime driver for the runtime service tests.
The runtime service provided by the firmware should not block the return, even if the bad parameters were brought to it.
It should return with the errors, like EFI_INVALID_PARAMETER, EFI_NOT_FOUND ...etc.

Like comment from Conlin, the uefi runtime variable interfaces fail will cause the UEFI variable can not be setting and deleteing, it will affect UEFI boot, bootpath, secure boot ... etc.

The test stop at the first subtest of uefirtvariable, on this test it will set variable and then get variable through the uefi runtime service interfaces SetVariable, GetVariable. It is helpful if other Runtime service interfaces could be tested. Such as

sudo fwts uefirttime -- testing the Settime and Gettime interfaces

efibootmgr can be used to add the bootpath variable via efivar driver provided interfacse, but actually will use the runtime service SetVariable provides by the firmware. It should be useful to add a bootpath variable by efibootmgr to see if it work normal.

Revision history for this message
Ivan Hu (ivan.hu) wrote :

This might because the efi_runtime driver that brings the userspace pointer to the firmware that cause the firmware blocked.

I've built a test fwts version V14.04.00.01.

Please help check if it can fix the issue by

changing the fwts-efi-runtime-dkms v14.04.00.01 (attached)

Or

updateing the fwts test version from the Scratch PPA,

sudo apt-add-repository ppa:firmware-testing-team/ppa-fwts-stable
sudo apt-get update
sudo apt-get install fwts

Changed in fwts:
assignee: nobody → Ivan Hu (ivan.hu)
Revision history for this message
Rod Smith (rodsmith) wrote :

Ivan,

Thanks! That seems to have gotten around the hang. Here's a copy of the results.log that I get when running "sudo fwts uefirtvariable" on one of the affected computers. Several of the tests are now returning EFI_INVALID_PARAMETER. I don't know if this indicates an EFI bug or a bug in fwts. If you need access to one of the systems that's failing to do further diagnostics yourself, that can be arranged; just say so.

Revision history for this message
Gary Gaydos (gaydos) wrote :

Ivan:
I've retested using the fwts-efi-runtime-dkms package you attached above. The two systems under test were the ones that Jeff, the original submitter, was referring to. After applying your new package the fwts no longer hangs. I tested both with the misc tests only using canonical-certification-server, and fwts uefirtvariable in a loop. The results.html file is attached.

Does your new package only correct the test suite hang, but not the underlying bug? Or does the new package correct the root cause?

Regards, Gary

Revision history for this message
Rod Smith (rodsmith) wrote :

FWIW, I've tried on another system (a desktop with an ASUS P8H77-I motherboard). It hangs with the stock 14.04 GA code, but with Ivan's modified version, it not only does not hang, but it passes the tests. I'm appending the results.log file.

That still leaves the question of why the Intel Decathlete is failing -- a firmware bug or a bug in the fwts program....

Revision history for this message
Ivan Hu (ivan.hu) wrote :

The test package contains only fixed the blocked issue from fwts, the other failures look like from firmware itself.

Revision history for this message
Ivan Hu (ivan.hu) wrote :

The patch had sent from Matt, and commit to fwts,
https://lists.ubuntu.com/archives/fwts-devel/2014-April/004633.html

Changed in fwts:
milestone: none → 14.05.00
importance: Undecided → High
status: New → Fix Committed
Ivan Hu (ivan.hu)
Changed in fwts:
status: Fix Committed → Fix Released
Revision history for this message
Rod Smith (rodsmith) wrote :

Is this fix different from the one in post #20? If so, I can test it on one of the systems that showed problems with the original.

Revision history for this message
Ivan Hu (ivan.hu) wrote :

It is included in the new formal release of fwts version, you could have a test with the fwts version V14.05.00 by,
sudo apt-add-repository ppa:firmware-testing-team/ppa-fwts-stable
sudo apt-get update
sudo apt-get install fwts

Revision history for this message
Rod Smith (rodsmith) wrote :

Ivan, I've tried 14.05.00 and had no problems with it -- that is, it produced the same results as the test version on the test computer, and did not hang it. (I ran both standalone and as part of c-c-s.)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.