i386 emulation unreliable since commit b76f0d8c2e3eac94bc7fd90a510cb7426b2a2699

Bug #1127369 reported by Andreas Gustafsson on 2013-02-16
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned

Bug Description

I am running daily automated tests of the qemu git mainline that
involve building qemu on a Linux host (32-bit), booting a NetBSD guest
in qemu-system-i386, and running the NetBSD operating system test
suite on the guest.

Since commit b76f0d8c2e3eac94bc7fd90a510cb7426b2a2699, there has been
a marked increase in the number of failing test cases. Before that
commit, the number of failing test cases was typically in the range 3
to 6, but since that commit, test runs often show 10 or more failed
tests, or they end prematurely due to a segmentation fault in the test
framework itself.

To aid in reproducing the problem, I have prepared a disk image
containing a NetBSD 6.0.1 system configured to automatically run
the test suite on boot.

To reproduce the problem, run the following shell commands:

  wget http://www.gson.org/bugs/qemu/NetBSD-6.0.1-i386-test.img.gz
  gunzip NetBSD-6.0.1-i386-test.img.gz
  qemu-system-i386 -m 32 -nographic -snapshot -hda NetBSD-6.0.1-i386-test.img

The disk image is about 144 MB in size and uncompresses to 2 GB. The
test run typically takes a couple of hours, printing progress messages
to the terminal as it goes. When it finishes, the virtual machine
will be automatically powered down, causing qemu to exit.

Near the end of the output, before the shutdown messages, there should
be a summary of the test results. The expected output looks like this:

  Summary for 500 test programs:
      2958 passed test cases.
      5 failed test cases.
      45 expected failed test cases.
      70 skipped test cases.

A number of "failed test cases" in the range 3 to 6 should be
considered normal. Please ignore the "expected failed test cases".
Using a version of qemu affected by the bug, the summary will look
more like this:

  Summary for 500 test programs:
      2951 passed test cases.
      12 failed test cases.
      45 expected failed test cases.
      69 skipped test cases.

Or it may end with a segmentation fault like this:

   p2k_ffs_race: atf-report: ERROR: 10912: Unexpected token `<<EOF>>'; expected end of test case or test case's stdout/stderr line
[1] Segmentation fault (core dumped) atf-run |
      Done(1) atf-report

The problem goes away if the "-m 32" is omitted from the qemu command line,
which leads me to suspect that the problem may be related to paging or
swapping activity in the guest.

The revision listed in the subject, b76f0d8c2e3eac94bc7fd90a510cb7426b2a2699,
is the first one exhibiting the excessive test failures, but the bug may already
have been introduced in the previous commit, fdbb84d1332ae0827d60f1a2ca03c7d5678c6edd.
If I attempt to run the test on fdbb84d1332ae0827d60f1a2ca03c7d5678c6edd, the
guest fails to boot. The revision before that, 32761257c0b9fa7ee04d2871a6e48a41f119c469,
works as expected.
--
Andreas Gustafsson, <email address hidden>

Aurelien Jarno (aurel32) wrote :

This has been fixed in commit d6e839e718c2540b880ac9d2d7a49fb7ade02cfb

Changed in qemu:
status: New → Fix Committed
Richard Jones (rjones-redhat) wrote :

Thanks for the detailed test case and fix. However unfortunately I cannot see
d6e839e718 in the current qemu git. Is it possible the commit hash changed
because of a rebase when it was committed?

Aurelien Jarno (aurel32) wrote :

Oops sorry. The correct commit hash is 52ae646d4a3ebdcdcc973492c6a56f2c49b6578f

Andreas Gustafsson (gson) wrote :

Thank you. Now if someone could also fix bug 1154328 , my automated tests might run again...

Richard Jones (rjones-redhat) wrote :

Thanks - fix committed to Fedora. Hopefully this will squash the rare and random segfaults in the libguestfs test suite.

Andreas Gustafsson (gson) wrote :

Now that bug 1154328 has been fixed, the NetBSD OS test suite successfully starts, but it still does not work as expected; actually things have gone from bad to worse. Every test run since the 1154328 fix has either timed out after running only a fraction of the tests, or has ended in a guest kernel panic. For example, here is part of the console output from a recent test run using qemu git revision 93b48c201eb6c0404d15550a0eaa3c0f7937e35e:

msdosfs_symlink_zerolen: [1.172395s] Skipped: symlinks not supported by file system
    nfs_access_simple: [9.423184s] Failed: child died
    nfs_attrs: fatal double fault in user mode
trap type 269 code 80000000 eip c010cca2 cs 8 eflags ed1fe cr2 b83fd834 ilevel 0
panic: trap
cpu0: Begin traceback...
printf_nolog(c0ba9e6b,c34adf7c,c34adf7c,c010cca2,8,ed1fe,b83fd834,0,0,0) at netbsd:printf_nolog
trap_tss() at netbsd:trap_tss
--- trap via task gate ---
b83ff74c:
cpu0: End traceback...

So either the supposed fix actually made things worse, or some unrelated regression has been introduced while the tests were inoperable due to bug 1154328.

Andreas Gustafsson (gson) wrote :

My tests are now working again. The point in time when they started working is consistent with this having been fixed by commit 38ebb396c955ceb2ef7e246248ceb7f8bfe1b774, "target-i386: ROR r8/r16 imm instruction fix". Many thanks to everyone involved in fixing it.

Aurelien Jarno (aurel32) on 2013-05-20
Changed in qemu:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers