risc-v doubles getting clobbered somehow

Bug #1872945 reported by Colin Ian King on 2020-04-15
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
High
Unassigned

Bug Description

I've been noticing that doubles get clobbered in stress-ng. I managed to whittle it down to the following reproducer that can trip the issue on a risc-v system running the Linux risc-v-qemu 5.4.0-24-generic kernel. This also occurs on other older kernels too. I've tested this with gcc-9, gcc-10 and clang-9 too.

#include <sys/time.h>
#include <stdio.h>
#include <stdbool.h>
#include <unistd.h>
#include <errno.h>

double timeval_to_double(const struct timeval *tv)
{
 return (double)tv->tv_sec + ((double)tv->tv_usec / 1000000.0);
}

int main(void)
{
 struct timeval tv2, tv1;
 double t1;
 bool fail = false;
 gettimeofday(&tv1, NULL);
 t1 = timeval_to_double(&tv1);

 for (;;) {
  double t2;
  int tmp;

  gettimeofday(&tv2, NULL);
  t2 = timeval_to_double(&tv2);
  tmp = errno;

  if (t2 - t1 < 0.0) {
   printf("%f %f %f\n", t1, t2, t2 - t1);
   printf("START : %f %ld %ld (%lx %lx) errno=%d\n",
     t1,
     tv1.tv_sec, tv1.tv_usec,
     tv1.tv_sec, tv1.tv_usec,
     tmp);
   printf("NOW BAD: %f %ld %ld (%lx %lx) errno=%d\n",
     t2,
     tv2.tv_sec, tv2.tv_usec,
     tv2.tv_sec, tv2.tv_usec,
     tmp);
   fail = true;
  } else {
   if (fail) {
    printf("NOW OK: %f %ld %ld (%lx %lx) errno=%d\n",
     t2,
     tv2.tv_sec, tv2.tv_usec,
     tv2.tv_sec, tv2.tv_usec,
     tmp);
   }
   fail = false;
  }
 }
}

Run the code (compiled with -O0, -O1 or even O2) and after some (long) random-ish time I see:

./a.out
1575050049.059796 0.000000 -1575050049.059796
START : 1575050049.059796 1575050049 59796 (5de15b41 e994) errno=0
NOW BAD: 0.000000 1575050053 437619 (5de15b45 6ad73) errno=0
NOW OK: 1575050053.460829 1575050053 460829 (5de15b45 7081d) errno=0

So the t2 value is zero which is a bit weird. The integer values for the tv struct are sane though, so it appears not to be a bad struct value returned from the kernel.

Colin Ian King (colin-king) wrote :

Does this occur on real H/W? Can somebody with risc-v H/W test this out?

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in qemu (Ubuntu):
importance: Undecided → High

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1872945

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: focal
Colin Ian King (colin-king) wrote :

OK, not a kernel bug. I'm getting random clobbering when printing via the printf() family of calls when passing a double. Sometimes it's corrupting passed variables, it looks like this only occurs when using va_args and doubles.

no longer affects: linux (Ubuntu)
Colin Ian King (colin-king) wrote :

This does not happen when executing the *same* binary with qemu-riscv64-static

Colin Ian King (colin-king) wrote :

A bit more debugging with the same issue on stress-ng and it seems to occur during logging when using the vsnprintf family of va_arg printing. My current hypothesis is that the va_args handling with formatting doubles to a string is randomly clobbering doubles somehow.

William Grant (wgrant) wrote :

I've been running your test code on a HiFive Unleashed board with our 5.4.0-24 for a couple of hours so far, and no issues yet.

I saw widespread crashes with qemu 4.2 that went away when downgrading (though I went straight back to 3.1), so I'm pretty confident there is a qemu bug, but I haven't had time to dig into it.

Colin Ian King (colin-king) wrote :

Thanks William for verifying this. At least I can stop worrying about registers getting trashed by context switching by the kernel.

William Grant (wgrant) wrote :

Past 22 hours now and still no issues, even though I've intermittently been saturating all cores with builds. So yeah, the kernel is probably fine.

Test via https://people.ubuntu.com/~wgrant/riscv64/

Talking with Wgrant there also was an atomic fix that he has on his build.
The upstream version of it is [1] and some other bits in that series.
That is in since 4.0.

Another testcase (unsure if it is the same root cause, but easier to test for now).
Start via:
$ qemu-system-riscv64 -machine virt -m 8192 -smp 4 -kernel bbl-vmlinuz-5.4.0-23-generic -initrd initrd.img-5.4.0-23-generic -append "console=ttyS0 rw noquiet root=/dev/vda1" -device virtio-blk-device,drive=vda -drive file=focal-riscv64-20200411.qcow2,id=vda -device virtio-net-device,netdev=eth0 -netdev user,id=eth0,hostfwd=tcp::22222-:22

Check SSH which will hang:
$ ssh -v root@localhost -p 22222
...
debug1: SSH2_MSG_SERVICE_ACCEPT received

Wgrant reports that this also worked with his patched 3.1 but fails with 4.2 as in Ubuntu.

I tried that with:
- qemu 4.2 Ubuntu archive - failing
- git master (5.0-rc) - working
- git 4.0 - working
- git 4.1 - working
- git 4.2 - failing

So it is not in the Ubuntu Delta but between 5.1 and 4.2.
I think I can bisect from here - once we know what it is we can cross check if it is the same root case that also caused ckings failure.

[1]: https://git.qemu.org/?p=qemu.git;a=commit;h=c13b169f1a3dd158d6c75727cdc388f95988db39

mistyped before:
- broken 4.1 -> 4.2
- some fix in 4-2 -> master

Download full text (4.1 KiB)

FYI v5.0.0-rc3 - working as well

Bisect log (looking for the fix):

git bisect start
# old: [b0ca999a43a22b38158a222233d3f5881648bb4f] Update version for v4.2.0 release
git bisect old b0ca999a43a22b38158a222233d3f5881648bb4f
# new: [20038cd7a8412feeb49c01f6ede89e36c8995472] Update version for v5.0.0-rc3 release
git bisect new 20038cd7a8412feeb49c01f6ede89e36c8995472
# new: [971b2a1e5b1a8cc8f597ac5d7016908f9fa880de] Merge remote-tracking branch 'remotes/palmer/tags/riscv-for-master-5.0-sf2' into staging
git bisect new 971b2a1e5b1a8cc8f597ac5d7016908f9fa880de
# old: [bd5de61e7b926b5bb9dae254630bf3c8be575bc0] migration: add savevm_state_handler_remove()
git bisect old bd5de61e7b926b5bb9dae254630bf3c8be575bc0
# new: [cd45c506c8ec37c05fdfe06441ad350ab8e19138] hw/s390x/ipl: replace deprecated qdev_reset_all registration
git bisect new cd45c506c8ec37c05fdfe06441ad350ab8e19138
# old: [cdc497c6925be745bc895355bd4674a17a4b2a8b] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
git bisect old cdc497c6925be745bc895355bd4674a17a4b2a8b
# new: [760df0d121a836dcbf3726b80b820115aef21b30] Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
git bisect new 760df0d121a836dcbf3726b80b820115aef21b30
# old: [4f67d30b5e74e060b8dbe10528829b47345cd6e8] qdev: set properties with device_class_set_props()
git bisect old 4f67d30b5e74e060b8dbe10528829b47345cd6e8
# old: [c0248b36d8d190933a43919b9f71013a255e866c] Merge remote-tracking branch 'remotes/kraxel/tags/ui-20200123-pull-request' into staging
git bisect old c0248b36d8d190933a43919b9f71013a255e866c
# old: [a4abb6f2061f3a40f01cfe4c79c1527db655ac89] hw/misc/stm32f4xx_syscfg: Fix copy/paste error
git bisect old a4abb6f2061f3a40f01cfe4c79c1527db655ac89
# new: [ba2ed84fe6a78f64b2da441750fc6e925d94106a] Merge remote-tracking branch 'remotes/palmer/tags/riscv-for-master-5.0-sf1' into staging
git bisect new ba2ed84fe6a78f64b2da441750fc6e925d94106a
# old: [1d59b1b210d7c3b0bdf4b10ebe0bb1fccfcb8b95] virtiofsd: add some options to the help message
git bisect old 1d59b1b210d7c3b0bdf4b10ebe0bb1fccfcb8b95
# old: [613fa160e19abe8e1fe44423fcfa8ec73d3d48e5] target/riscv: Fix tb->flags FS status
git bisect old 613fa160e19abe8e1fe44423fcfa8ec73d3d48e5
# old: [a37f21c27d3e2342c2080aafd4cfe7e949612428] riscv: Set xPIE to 1 after xRET
git bisect old a37f21c27d3e2342c2080aafd4cfe7e949612428
# new: [82f014671cf057de51c4a577c9e2ad637dcec6f9] target/riscv: update mstatus.SD when FS is set dirty
git bisect new 82f014671cf057de51c4a577c9e2ad637dcec6f9
# old: [a59796eb6d59bbd74ce28ddbddb1b83e60674e96] target/riscv: fsd/fsw doesn't dirty FP state
git bisect old a59796eb6d59bbd74ce28ddbddb1b83e60674e96
# first new commit: [82f014671cf057de51c4a577c9e2ad637dcec6f9] target/riscv: update mstatus.SD when FS is set dirty

$ git bisect old
82f014671cf057de51c4a577c9e2ad637dcec6f9 is the first new commit
commit 82f014671cf057de51c4a577c9e2ad637dcec6f9
Author: ShihPo Hung <email address hidden>
Date: Tue Jan 14 22:17:33 2020 -0800

    target/riscv: update mstatus.SD when FS is set dirty

    remove the check becuase SD bit should summarize FS and XS fields
    unconditionally.

    Signed-off-by: ShihPo Hung <shihpo.hung...

Read more...

William Grant (wgrant) on 2020-04-20
Changed in qemu (Ubuntu):
status: New → Triaged

Trying Ckings testcase on a 4.2 build with those five patches on top.

I can reproduce it on 4.2
- 1st Try ~ 3 seconds
- 2st Try ~ 50 seconds

On 4.2 plus the fixes it works for ~30 minutes now

I'll run a few more to be sure but it looks good.

Started 4 of ckings programs.
- Broken qemu ~1 hit per 15 seconds
- Fixed qemu good for ~30 minutes

I think we can add those patches ...

As coordinated with Wgrant who is working on riscv in general I uploaded 4.2-3ubuntu6 to focal-unapproved including this fix.

Pre-build uploaded to:
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4015/+packages
General pre-builds (before this fix) are also in
https://launchpad.net/~wgrant/+archive/ubuntu/nonvirt/+packages

Ryan Finnie (fo0bar) wrote :

Confirmed qemu 4.2-3ubuntu6 fixes the OpenSSH sshd hang on riscv64 guest.

Colin Ian King (colin-king) wrote :

Updated with the new version in the PPA:

dpkg -S qemu-system-riscv64
qemu-system-misc: /usr/share/man/man1/qemu-system-riscv64.1.gz
qemu-system-misc: /usr/bin/qemu-system-riscv64
cking@skylake:~/risc-v-ubuntu$ dpkg -l | grep qemu-system-misc
ii qemu-system-misc 1:4.2-3ubuntu6 amd64 QEMU full system emulation binaries (miscellaneous)

and tested the reproducer:

[00:02:05.328] cking@ubuntu:~$ ./a.out
[00:04:29.189] 1587408297.940333 0.000000 -1587408297.940333
[00:04:29.190] START : 1587408297.940333 1587408297 940333 (5e9deda9 e592d) errno=0
[00:04:29.191] NOW BAD: 0.000000 1587408440 627126 (5e9dee38 991b6) errno=0
[00:04:29.192] NOW OK: 1587408440.633365 1587408440 633365 (5e9dee38 9aa15) errno=0

so still an issue for me.

William Grant (wgrant) wrote :

I think you got the wrong version. PPA version 1:4.2-3ubuntu6~ppa5 has the fix, archive version 1:4.2-3ubuntu6 has the fix, but PPA version 1:4.2-3ubuntu6 (which was to be archive -3ubuntu6, but while it was waiting in the queue we found this fix) doesn't have the fix.

Ryan Finnie (fo0bar) wrote :

Sorry, I said 4.2-3ubuntu6 but meant 1:4.2-3ubuntu6~ppa5 from the PPA (which works). Sounds like Colin got 1:4.2-3ubuntu6 from the PPA; I will test 1:4.2-3ubuntu6 from -proposed to be sure.

$ apt-cache policy qemu-system-misc
qemu-system-misc:
  Installed: 1:4.2-3ubuntu6~ppa5
  Candidate: 1:4.2-3ubuntu6~ppa5
  Version table:
 *** 1:4.2-3ubuntu6~ppa5 500
        500 http://ppa.launchpad.net/ci-train-ppa-service/4015/ubuntu focal/main amd64 Packages
        100 /var/lib/dpkg/status
     1:4.2-3ubuntu5 500
        500 http://us.archive.ubuntu.com/ubuntu focal/universe amd64 Packages

Ryan Finnie (fo0bar) wrote :

Yep, 1:4.2-3ubuntu6 from focal-proposed test case works for me.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.2-3ubuntu6

---------------
qemu (1:4.2-3ubuntu6) focal; urgency=medium

  [ Christian Ehrhardt ]
  * enable riscv build (LP: #1872931)
    [ changes picked from Debian ]
    - enable support for riscv64 hosts
    - only enable librbd on architectures where it is built
    - ceph: do not list librados-dev as we only use librbd-dev and the latter
      depends on the former
    - seccomp grew up, no need in versioned build-dep
    - enable seccomp only on architectures where it can be built
  * d/p/ubuntu/lp-1872931-*: fix build on non KVM platforms
  * d/p/ubuntu/lp-1872945-*: fix riscv emulation errors that e.g. hung ssh
    and clobbered doubles (LP: #1872945)

  [ William Grant ]
  * d/control-in: disable rbd support unavailable on riscv (LP: 1872931)

 -- Christian Ehrhardt <email address hidden> Wed, 15 Apr 2020 14:27:15 +0200

Changed in qemu (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers