Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]

Bug #1745312 reported by David Lindsay
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
QEMU
Expired
Undecided
Unassigned

Bug Description

[Headsup: This report is long-ish due to the amount of detail I've stumbled on along the way that I think is relevant to include. I can't speak as to the complexity of the actual bugs, but the size of this report should not suggest that the reproduction process is particularly headache-inducing.]

Hi!

I recently needed to fire up some ancient software for research purposes and got very distracted discovering and playing with old versions of Windows :). In the process I've discovered some glitches with disk I/O.

I believe I've stumbled on two completely separate issues that coincidentally surfaced at the same time. It's possible that components of this report will be re-filed as more specific new bugs, but I'm not an authority on QEMU internals or how to narrow down/categorize what I've found.

- The first bug only surfaces when the "isapc" machine type is used. It intermittently produces "General failure {read,writ}ing drive _" under MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to make no difference whatsoever, which may help with debugging.

- The second issue involves
  - a WinNT4 disk image
  - created by running through a bog-standard NT4 install inside QEMU 2.9.0
  - which will now fail to boot in any version of QEMU - even version 1.0
    - but which VirtualBox will boot fine
      - but only if I point VirtualBox at QEMU's raw disk image via a
        hacked-together VMDK file
      - if the raw image is converted to VHD(X), VirtualBox will also fail
        to boot the image with exactly the same error as QEMU
      - this state of affairs is not affected by image sparseness (which makes
        sense)

I'm confident I've bisected the first issue.

I wasn't able to bisect the second issue (as all tested versions of QEMU behaved identically), but I've figured out a working repro testcase and I believe I've managed to pin down a solid root cause.

== #1: Intermittent I/O issues when `-M isapc` is used =====

These symptoms sometimes take a small amount of time and fiddling to trigger, but I AM able to consistently surface them on my machine after a short while. (I am very very interested to hear if others cannot reproduce them.)

So, first of all:

https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
  (Jul 30 2013): the last version that works

https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
  (Oct 30 2013): the first version that intermittently fails

Maybe lift out and build these branches while reading. *shrug*
(How to do this can be found at the end of this report - along with a time-saving ./configure line, FWIW)

Here are the changelists between these two revisions:

https://github.com/qemu/qemu/compare/306ec6c...e689f7c
(Compare direction: OLD to NEW) (Commits: 166 Files changed: 192)

https://github.com/qemu/qemu/compare/e689f7c...306ec6c
(Compare direction: NEW to OLD) (Commits: 30 Files changed: 22)

(Someone else more familiar with Git might know why GitHub returns results for both compare directions, and/or if the 2nd link is useful information. The first link returns a lot more results than the 2nd one, at least. Does comparing new>old return deletions?)

---

Now on to the symptoms. In a moment I'll describe reproduction.

# MS-DOS 6.22

The first symptom I discovered was that trivial read and write operations under MS-DOS would sometimes fail:

  C:\>echo test > hi

  General failure writing drive C
  Abort, Retry, Fail?

Anything else that exercises the disk behaves similarly:

  C:\>dir /s > nul

  General failure reading drive C
  Abort, Retry, Fail?

(Note that the above demonstrates both write and read failures)

(Also, FWIW, `dir /s` == `ls -R`)

The behavior of the I/O errors is not possible to characterise as it fluctuates so much. For example something as simple as DIR can produce wildly differing results: in one run, poking around with DIR ended with DOS deciding C:\ was empty at one point; at another point in a different run C:\ mysteriously dropped 50% of its contents only to magically gain it all back moments later after some poking around in one of the subdirectories that was still visible.

The time it takes to trigger these errors is also highly variable. QEMU may fall over as early as hanging forever at "Starting MS-DOS...", or I might get all the way into Windows 3.1 before it triggers (in which case Win3.1 reports vague memory errors - of all things).

Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard Disk..." "Boot failed: could not read the boot disk" ... "No bootable device.", and on one occasion I even got "Non-System disk or disk error" "Replace and strike any key when ready"!

# WinNT 4 Terminal Server

Most of the time, NTLDR will fire up normally. But every so often...

  SeaBIOS (version rel-1.7.3-117-g31b8b4e-20131206_080705-nilsson.home.kraxel.org)

  Booting from Hard Disk...
  A disk read error occurred.
  Insert a system diskette and restart
  the system.

(NB. You're seeing the old SeaBIOS version included with e689f7c, which was the first buggy commit.)

If NT gets past this point without erroring out (ie, it makes it to the boot menu), the rest of the system is 100% fine and there are no other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was able to enable disk compression, answer "Yes" to "Compress entire disk now?" and have the process fully complete. No hitches.

This makes me vaguely recall/wonder that perhaps this could be somehow related to LBA and/or Int 13h, or something floating around near that bunch of functionality. (I'm woefully ignorant about such low-level details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has a buggy implementation of, while NT 4 (once NTOSKRNL is up and running) is able to use a different disk mode or access mechanism.

I'm really interested to get some understanding of what the root issue is here, when this is fixed. (I wonder if it's a timing thing?)

I've observed some unusual behavior with repeated restarts. In one case, I attempted to start NT4 multiple times, and QEMU consistently failed with "No bootable device" each time. So, I removed `-M isapc`, promptly got a boot menu, hit ^C, readded `-M isapc` - and continued to get a boot menu. Yep. I'll accept "really really big coincidence" but I do very much wonder if something else is going on here. I've observed many similar incidents. It makes me wonder whether the contents of memory or some other system state is an influence. Very probably not, but still...

-- Reproduction --------------------------------------------

First of all, there was unfortunately no way for me to avoid having to post entire disk images, but I've managed to compress everything down to 174MB total download size.

FWIW, WinWorld and many other sites seem to have no operational issues providing clear pointers to CD keys; I consider my distribution of my installed HDD images an extension of the apparent status quo.

That being said, I've put everything on Google Drive so nobody has to headscratch about Launchpad/Canonical/etc's stance on hosting this data.

So, this folder contains the disk images: https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c ("Download all" at the top-right will create a ZIP file, but FWIW downloading the individual files simultaneously would implement a rough form of download acceleration)

File meta info:

Compressed
|
| Apparent
| | Actual
| | |
38M -> 200M (103M) win31.img.xz
82M -> 1G (289M) wnt4ts-broken.img.xz
55M -> 350M (146M) wnt4ts-intermittent.img.xz

SHA-256s:

win31: 8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
broken: a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa
intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784

(Wanted to keep the checksum lines within 80 columns)

And, since I can't figure out where else in this report to put this, wnt4ts-broken.img's password is "admin" but something seems to have happened to the disk and NT doesn't actually boot properly :(, and wnt4ts-intermittent.img's password is "1234". (These were set up as test images. Now I'm _really_ glad I used simple passwords! :) )

---

I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.

# MS-DOS

DOS is the simplest. It basically consists of

$ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-kvm

And then literally just playing around. Things to try include creating files (`echo blah > file`), repeatedly seeking across the entire FAT (`dir /s > nul` or `dir /s`), and launching Windows (`win`).

win31.img is not special (as far as I can tell) and merely consists of the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC. I've basically just included the image for convenience.

Generally no single "run" is immune to starting Win3.1 and then launching File Manager; if that doesn't generate an error, something is definitely up.

The second best trigger is creating new files. That very very frequently produces "General Failure ...", but not always.

# WinNT 4

Windows NT 4 is a bit more complicated. Because this error only occurs at presumably a single small point very early in boot, the window of opportunity for the glitch to surface within is much much narrower and thus often requires a larger number of tries.

Anecdotally I've had QEMU hit the boot error at the first try/run, and after as many as 63 "successful" boots.

I made a small test harness that automates the launch process. It consists of two shell scripts and requires tmux (and netcat). (*Potential epilepsy warning*: if you use a light-colored terminal background, the terminal QEMU is repeatedly invoked from will continuously flash rapidly from white to black.)

One of the scripts is run inside a tmux session in one terminal, while the other script is run in its own terminal (without any tmux).

I named this one `run-qemu-loop`:

--8<--------------------------------------------------------

#!/bin/bash

# ---

qemu=/path/to/qemu-system-i386
#or, alternatively: (I used the following line myself so I
#could tab-complete my way to different qemu executables)
#qemu="$1"

disk=/path/to/wnt4ts-intermittent.img

# ---

port=4444

rm -f STOP itercount

itercount=0

while :; do

 [ -f STOP ] && break

 ((itercount++))
 echo $itercount > itercount

 $qemu \
  -enable-kvm -vga cirrus -curses -M isapc \
  -drive file="$disk",format=raw \
  -chardev socket,id=mon0,host=localhost,port=$port,server,nowait \
  -mon chardev=mon0,mode=readline

 #point to an otherwise-unused terminal if you like (see also: `tty`)
 #echo "$itercount run(s)" > /dev/pts/__

done

------------------------------------------------------------

Not much logic above; this just repeatedly runs QEMU for as long as
the file `STOP` does not exist in the current directory.

The key "magic" bit is that QEMU is launched in -curses mode.

The other key bit is that the above script is run inside tmux.

Here's `tmux-ctl-loop`:

--8<--------------------------------------------------------

#!/bin/bash

port=4444

tmux=./tmux

printf -v l '%0.0s-' {0..25}
h1="$l/ buffer dump begin \\$l"
h2="$l-\\ buffer dump end /-$l"

while :; do

 while :; do
  echo | nc localhost $port -q0 -w1 > /dev/null && break
  echo 'Start qemu!'
 done

 buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"

 echo "$h1"
 [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
 echo "$h2"

 if echo "$buffer" | grep -q 'A disk read error occurred.'; then

  s="<Crashed after $(< itercount) runs.>"
  echo "$s"
  echo "$s" >> stats

  touch STOP

  #echo q | nc localhost $port -q0 > /dev/null

  exit

 elif echo "$buffer" | grep -q 'OS Loader V4.00'; then

  echo '<Booted successfully, trying again>'

  echo q | nc localhost $port -q0 > /dev/null

 else

  echo '<Waiting for boot>'

 fi

done

------------------------------------------------------------

Nothing particularly amazing going on here either.

While `qemu-run-loop` is running inside tmux in the first terminal, this is running in the 2nd one.

The small infinite loop at the top only breaks when it can successfully ping QEMU and it knows it's running.

Then, a screendump of the contents of the terminal QEMU is in is fetched from tmux, and the buffer's content is analyzed.

- If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
  sends `q` to QEMU through netcat, and then the script exits.

- If NTLDR loads successfully, the script sends `q` to QEMU and continues
  looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)

The scripts run very quickly, with 2-3 iterations per second on my i3 box.

# Usage

Save the two scripts above to the same directory as wnt4ts-intermittent.img,
then:

- (If port 4444 doesn't work, the value needs to be changed in both scripts.)

- In the first terminal, run `tmux -S <file>`, where <file> names the socket
  tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
  (with `tmux=./tmux`, the command would be `tmux -S tmux`)

- Still in the first terminal (and now also inside tmux), enter
  `./qemu-run-loop`, passing the path to qemu if you're using that approach
  (refer to the first few lines of the script). Don't hit enter yet.

- Now, in the 2nd terminal, type `./tmux-ctl-loop`

- Hit enter in both terminals.

Rationale for timing of Enter key:

- Running qemu-run-loop first will start QEMU, and if NTLDR starts
  successfully it will immediately begin counting down from 30. If NT actually
  starts to boot and is then hard-shut-down this /may/ affect the disk image

- tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' until
  qemu-run-loop is running.

- Starting both scripts at "more or less" the same time (no rush) works out
  well.

Hopefully potential script modifications are obvious; for example

- changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
  yourself
  (NB, if `STOP` is not created, when qemu finally exits it will of course
  promptly be relaunched)

- pointing run-qemu-loop to a modified qemu binary

== #2: QEMU-vs-VirtualBox image issue ======================

I was initially completely stumped by this issue, perhaps unsurprisingly so. :)

wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked NTFS and installed everything (which took a while).

NT setup reboots a number of times during the boot process, and IIRC those all went just fine. However, at some point, the image began to consistently bomb out with "A disk read error occurred. ...", and stubbornly refused to boot, regardless of the number of boot attempts I tried.

QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build on my system) all consistently hit "disk read error occurred".

I tried compiling QEMU 1.0 using clang so I could build for 32-bit on my 64-bit system (GCC 7 died with "Frame pointer required, but reserved"). The resulting qemu completely crashed if I didn't enable KVM (ie, TCG was (understandably) broken); with KVM enabled qemu didn't crash, but NTLDR halted with the same error as on 64-bit qemu. (TL;DR, no difference whatsoever.)

My initial reaction at this point was to try the image on another virtualization platform. My first pick was VirtualBox.

So, I followed the official instructions for pointing VirtualBox to physical disk images, except I substituted a /dev/loopN device I'd pointed to the image file via losetup.

And... VirtualBox picked the image up fine and Just Worked(TM). Yay! - but not yay. What gives?!

Confused, I then tried to convert the disk image to VHD format. Unfortunately, for some reason, if I try `qemu-image convert ... -O vhdx ...`, VirtualBox chokes on the result:

-----

VD: error VERR_NOT_SUPPORTED opening image file
'/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).

Result Code: NS_ERROR_FAILURE (0x80004005)
Component: MediumWrap
Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)

-----

Welp.

Well, a bit more digging later, and I found I could do

$ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd

but... as soon as I pointed VirtualBox to this, it too began to choke with "A disk read error occurred".

And yet, the VMDK->raw image setup worked just fine.

I found I could even replace the loop device with the path of the .img file itself and that worked just fine too.

At my wits' end, I followed some online instructions to learn about manual CHS configuration so I could try and get the image working in Bochs. "A disk read error occurred". I wasn't surprised.

It was at this point I began to give up, but I decided to try One Last Thing(TM) before properly throwing in the towel.

:)

I decided to learn a bit more about how `VBoxManage internalcommands createrawvmdk` worked, and try one thing in particular: I can edit the .vmdk file, but can I point `createrawvmdk` at the .img file directly too?

Turns out, yes you can.

It also turns out that this promptly caused VirtualBox to bomb out.

Interesting.

For reference, here's the VMDK file I initially created (by pointing `createrawvmdk` at /dev/loopN) and then later edited to point straight to the .img file, with both approaches resulting in successful boot.

--8<--------------------------------------------------------

# Disk DescriptorFile
version=1
CID=e35b9a45
parentCID=ffffffff
createType="fullDevice"

# Extent description
RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0

# The disk Data Base
#DDB

ddb.virtualHWVersion = "4"
ddb.adapterType="ide"
ddb.geometry.cylinders="1523"
ddb.geometry.heads="16"
ddb.geometry.sectors="63"
ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
ddb.geometry.biosCylinders="761"
ddb.geometry.biosHeads="32"
ddb.geometry.biosSectors="63"

------------------------------------------------------------

Here's the _diff_ of what happens if I point `createrawvmdk` at wnt4ts-broken.img directly:

--8<--------------------------------------------------------

ddb.geometry.cylinders="2080"
ddb.geometry.heads="16"
ddb.geometry.sectors="63"

------------------------------------------------------------

:D

Naturally,

$ qemu-system-i386 -drive file=wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl

will boot happily on 2.9.0 (notwithstanding the occasional "disk read error occurred" documented above).

It will also boot in 1.6.0.

(POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank 640x480 window and use 0% CPU if I specify `-M isapc`.)

And, of course, using these CHS values in Bochs also results in successful boot as well (after setting the CPU type to pentium).

Unfortunately, I have no idea what sequence of events caused the creation of the VMDK file above. No invocation of `createrawvmdk` is producing a VMDK file with the CHS settings above.

I've only just begun to learn about the intricacies of CHS. Am I to understand that these values are stored amongst the first 512 bytes of the disk? If this is the case, then I wonder what changed the data, and why. I was initially only using QEMU 2.9.0, and didn't move the image to different VMs or QEMU versions. Perhaps Windows NT got confused about the disk CHS and rewrote it?

== Sporadic BIOS-level boot failure ========================

I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No bootable device" (et al), even with the above manually-applied CHS settings.

Commit e689f7c also presents such errors.

Commit 306ec6c does not suffer from intermittent breakage of any kind:

- No SeaBIOS flake-outs
- No "Non-system disk or disk error"
- No "A disk error has occurred"
- No "General failure ..."

While most of my confidence in commit 306ec6c is based on anecdotal evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level I/O stability and left this modified version running for a few minutes.

--8<--------------------------------------------------------

#!/bin/bash

port=4444

tmux=./tmux

printf -v l '%0.0s-' {0..25}
h1="$l/ buffer dump begin \\$l"
h2="$l-\\ buffer dump end /-$l"

while :; do

 while :; do
  echo | nc localhost $port -q0 -w1 > /dev/null && break
  echo 'Start qemu!'
 done

 buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"

 echo "$h1"
 [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * )'
 echo "$h2"

 if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
  grep -q 'No bootable device'
 then

  s="<Hit error after $(< itercount) runs.>"
  echo "$s"
  echo "$s" >> stats

  touch STOP

  #echo q | nc localhost $port -q0 > /dev/null

  exit

 elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
  grep -q 'A disk read error'
 then

  echo '<Boot did not hang at BIOS, trying again>'

  echo q | nc localhost $port -q0 > /dev/null

 else

  echo '<Waiting for boot>'

 fi

done

------------------------------------------------------------

For the above to work, the top of run-qemu-loop must also be modified to read something along the lines of

disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63

(Suggestion: modify copies of both scripts)

One small terminal-flicker-headache (and a 57°C CPU) later, I was able to carefully observe just over 350 successful runs in which QEMU commit 306ec6c only ever produced a boot menu. No other hitches.

** Important: **

However, commit 306ec6c will fail to boot, ever, if the cylinders and geometry are not set to the values VirtualBox "discovered". (Of note is the fact that QEMU (2.9.0) was what initially created this image. I must admit that I don't remember what sequence of QEMU versions I fed the image to - and I maybe, possibly, didn't think to back the file up (sorry), so maybe something mangled something somewhere. But VirtualBox figured it out nonetheless!)

Furthermore, feeding /dev/loopN to any QEMU version will NOT result in correct CHS discovery (and successful boot).

This is what leads me to conclude that I've discovered two separate issues.

== Appendix: How to build the branches =====================

It's very simple.

First, `git clone https://github.com/qemu/qemu` somewhere if you don't already have a local copy. If you have an old git checkout that's from 2014 or later, you can use that old checkout instead. (If you want to test an old checkout you have, the commands below will either work perfectly or completely bomb out with no side effects.)

A full checkout is a ~183MB download. Sorry.

Next, create two new directories somewhere. Name them what you like, eg `qemu-working` and `qemu-broken`.

Now, cd into the checkout directory, and run:

$ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC /path/to/qemu-working/

$ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC /path/to/qemu-broken/

The paths can be relative.

Now, run this in both of the new directories:

$ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp --disable-usb-redir --disable-guest-agent --disable-libiscsi --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-docs --disable-attr --disable-cap-ng --disable-vde --disable-user --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi --enable-debug --target-list=i386-softmmu --disable-fdt

$ make -j64

You can open two terminals and configure and build both simultaneously if you like.

On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - make doesn't actually launch too many gcc processes. You *will* see your system load spike to ~20 though :)
(NB. Do. not. use. -j64. with. the. linux. kernel.)

On my system, a single build with -j64 takes only about 35 seconds. C FTW. (Although this has increased to 1min20sec for more recent builds.)

Most of the configure arguments remove functionality I'll never use (in this situation) and which will only slow down the build.

Once QEMU is built, run qemu-system-i386 directly from where it has been built.

$ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
$ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...

Again, the paths can be relative.

Tags: disk io
Revision history for this message
Stefan Hajnoczi (stefanha) wrote : Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
Download full text (53.2 KiB)

On Thu, Jan 25, 2018 at 07:18:52AM -0000, i336_ wrote:
> Public bug reported:
>
> [Headsup: This report is long-ish due to the amount of detail I've
> stumbled on along the way that I think is relevant to include. I can't
> speak as to the complexity of the actual bugs, but the size of this
> report should not suggest that the reproduction process is particularly
> headache-inducing.]

I've CCed people who may be able to help.

I don't have time to read through everything you've posted.

> Hi!
>
> I recently needed to fire up some ancient software for research purposes
> and got very distracted discovering and playing with old versions of
> Windows :). In the process I've discovered some glitches with disk I/O.
>
> I believe I've stumbled on two completely separate issues that
> coincidentally surfaced at the same time. It's possible that components
> of this report will be re-filed as more specific new bugs, but I'm not
> an authority on QEMU internals or how to narrow down/categorize what
> I've found.
>
> - The first bug only surfaces when the "isapc" machine type is used. It
> intermittently produces "General failure {read,writ}ing drive _" under
> MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows
> NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to
> make no difference whatsoever, which may help with debugging.

Is this using the IDE disk controller? In that case John Snow can help
you debug what's going on at the IDE level.

> - The second issue involves
> - a WinNT4 disk image
> - created by running through a bog-standard NT4 install inside QEMU 2.9.0
> - which will now fail to boot in any version of QEMU - even version 1.0
> - but which VirtualBox will boot fine
> - but only if I point VirtualBox at QEMU's raw disk image via a
> hacked-together VMDK file
> - if the raw image is converted to VHD(X), VirtualBox will also fail
> to boot the image with exactly the same error as QEMU
> - this state of affairs is not affected by image sparseness (which makes
> sense)

VMDK stores the disk geometry (cylinders, heads, sectors), which may
affect guest software. I've CCed Fam Zheng.

>
> I'm confident I've bisected the first issue.
>
> I wasn't able to bisect the second issue (as all tested versions of QEMU
> behaved identically), but I've figured out a working repro testcase and
> I believe I've managed to pin down a solid root cause.
>
>
> == #1: Intermittent I/O issues when `-M isapc` is used =====
>
> These symptoms sometimes take a small amount of time and fiddling to
> trigger, but I AM able to consistently surface them on my machine after
> a short while. (I am very very interested to hear if others cannot
> reproduce them.)
>
> So, first of all:
>
> https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
> (Jul 30 2013): the last version that works
>
> https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
> (Oct 30 2013): the first version that intermittently fails
>
> Maybe lift out and build these branches while reading. *shrug*
> (How to do this can be found at the end of this rep...

Revision history for this message
Fam Zheng (famz) wrote :

QEMU ignores the CHS numbers in VMDK images. From the report, it seems VirtualBox uses it.

So like what you've discovered, for QEMU the right thing to do for such a guest would be setting the correct values explicitly from the command line, rather than let it decide (guess).

I have no idea about the first issue, though.

Revision history for this message
John Snow (jnsnow) wrote :

Can you post your commandline for the MSDOS 6.22 issue? NT is known to have a few problems and may be out of scope for what I can help with, but I was under the assumption that MSDOS 6.22 was well-behaved in QEMU.

Commandline and steps to reproduce the error may be helpful (any particularly kind of command, workflow, etc that helps trigger the IO errors? How big is the hard disk you are using? etc)

Thanks,
--John

Revision history for this message
Mario (mario1992-deactivatedaccount) wrote :

I have a similar bug: 1674114

Revision history for this message
Mdasoh Kyaeppd (mdasoh) wrote :

Can confirm the DOS issue is present. Here are some steps to recreate:
wget http://www.freedos.org/download/download/FD12CD.iso
apt-get install mbr fdisk parted dosfstools qemu-system-x86
# dd if=/dev/zero of=dos.img bs=512 count=1032192
# losetup /dev/loop0 dos.img
# fdisk -u=cylinders /dev/loop0
command: x
expert: h
heads: 16 (you can try different values, 16, 32, 64, 128, 255)
expert: c
cylinders (default 1024):
expert: r
command: c
DOS compatibility flag is set...
command: n
select: p
partition (default 1):
first cylinder (default 1):
last cylinder (default 1024):
command: a
command: t
selected partition 1
type: 6
command: w
# partprobe /dev/loop0
# install-mbr -f /dev/loop0
# mkdosfs -F 16 /dev/loop0p1
# qemu-system-i386 -drive file=/dev/loop0,cache=none,format=raw,index=0 \
-drive file=FD12CD.iso,cache=none,media=cdrom,if=ide,format=raw,index=1 -boot d \
-machine isapc
--------
qemu comes up
"install to harddisk"
select your preferred language
"yes - continue with the installation"
drive C does not appear to be formatted
"yes - please erase and format drive c:"
lbacache flush write error 0c80/chs#0001
...
etc etc etc.

John Snow (jnsnow)
Changed in qemu:
assignee: nobody → John Snow (jnsnow)
Revision history for this message
John Snow (jnsnow) wrote :

I will try to debug as time permits, but the priority of MS-DOS bugs is not ... measurable with casual tools. However, there are a lot of other IDE bugs on my plate that are very important! so I am hoping to grab a bunch of IDE bugs at once, but no promises here.

Notably, our geometry detection is not very good, it's more than possible we are misreporting values and confusing DOS. Our IDE disks are also not very consistent about what standard of the spec they are trying to emulate, so there are likely other problems there, too.

If you'd like to debug on your own, I'd recommend enabling tracing and enabling some of the IDE trace points; some of them can be quite verbose -- don't enable the data dumping ones. The control flow ones can be informational sometimes to guess when the guest OS got confused and then walk your way back to a register read that would have picked up some error bits, or to detect busy-waits on registers not changing and try to guess what it was waiting for.

https://github.com/qemu/qemu/blob/master/docs/devel/tracing.txt
https://github.com/qemu/qemu/blob/master/hw/ide/trace-events

Ignore the AHCI and ATAPI traces, and don't use the ide_data_* traces unless you are booting a custom firmware that only performs a strict few IO accesses -- otherwise you'll get flooded off the map.

Revision history for this message
Thomas Huth (th-huth) wrote :

The QEMU project is currently considering to move its bug tracking to
another system. For this we need to know which bugs are still valid
and which could be closed already. Thus we are setting older bugs to
"Incomplete" now.

If you still think this bug report here is valid, then please switch
the state back to "New" within the next 60 days, otherwise this report
will be marked as "Expired". Or please mark it as "Fix Released" if
the problem has been solved with a newer version of QEMU already.

Thank you and sorry for the inconvenience.

Changed in qemu:
status: New → Incomplete
Thomas Huth (th-huth)
tags: removed: qemu
Revision history for this message
Thomas Huth (th-huth) wrote : Moved bug report

This is an automated cleanup. This bug report has been moved
to QEMU's new bug tracker on gitlab.com and thus gets marked
as 'expired' now. Please continue with the discussion here:

 https://gitlab.com/qemu-project/qemu/-/issues/56

Changed in qemu:
assignee: John Snow (jnsnow) → nobody
status: Incomplete → Expired
Revision history for this message
Lev Kujawski (lkujaw) wrote :

Hi,

Thanks to everyone who contributed information to this report. As far as issue #1 from David, I cannot reproduce the intermittent MS-DOS or Windows NT 4 I/O failures with the latest git revision (a74c66b1). I am similarly unable to reproduce Mdasoh's issue.

For the NT 4 testing script, I had to substitute '-display curses' for '-curses' to accommodate the changes in QEMU, and match against 'Please select' from the boot loader menu rather than 'OS Loader V4.00', which disappears too quickly.

For issue #2, the root seems to be that both SeaBIOS and QEMU default to LARGE/ECHS disk translation for small disks (<4 GiB). If you apply the patch at

https://<email address hidden>/

you should be able to get to the NT 4 boot loader using

qemu-system-i386 -blockdev node-name=hda,driver=file,filename=./wnt4ts-broken.img -device ide-hd,drive=hda,bus=ide.0,unit=0,bios-chs-trans=lba

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.