testsuite fails under qemu (SIGILL) works fine on real hw [missing getrandom 384 syscall]

Bug #1707409 reported by Gianfranco Costamagna on 2017-07-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
launchpad-buildd
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned
qemu (Ubuntu)
High
Unassigned

Bug Description

Hello, after spending a lot of time debugging notmuch failure under armhf, we came to a conclusion:

it started to fail when the infra moved to a new kernel 3.2 to a 4.2, and moved under qemu/kvm environment.

the latest successful build is here created on 2016-03-13 https://launchpad.net/ubuntu/+source/notmuch/0.21-3ubuntu2/+build/9344826

and the first bad is this one: Started on 2016-08-31 https://launchpad.net/ubuntu/+source/notmuch/0.22.1-2ubuntu1/+build/10600002

Kernel version: Linux kishi10 3.2.0-98-highbank #138-Ubuntu SMP PREEMPT Mon Jan 11 13:24:41 UTC 2016 armv7l
Kernel version: Linux bos01-arm64-024 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:24:20 UTC 2016 aarch64

so, in the first case armhf was ran on top of an armv7 kernel, in the other case it became an arm64 one
this might not even be a regression in qemu/kvm, but rather a change in buildd system that spot a new bug

doing a xenial build failed aswell, so I presume this is not a toolchain regression (also because it works fine on real HW), but a qemu/linux bug.

I did run the test under strace/valgrind, I can't do much more testing, but I hope the logs can be useful for you
https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/locutusofborg-ppa/+build/13169431
https://launchpadlibrarian.net/331134898/buildlog_ubuntu-artful-armhf.notmuch_0.25-2ubuntu1_BUILDING.txt.gz

You can see the strace/valgrind outputs between "BEGIN" and "END" keywords

I'm assigning launchpad, maybe somebody can try notmuch/armhf with an updated qemu or a downgraded kernel :)

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in qemu (Ubuntu):
importance: Undecided → High
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1707409

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise

Yes, the nature of the bug makes impossible to do as requested by the bot…

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: bot-stop-nagging xenial
removed: precise
Colin Watson (cjwatson) wrote :

I think it's very unlikely indeed that this is a Launchpad bug, and we're not here to go on fishing expeditions for you testing random things. qemu/kernel developers are generally better placed to be able to bisect this sort of thing. Of course you can reopen this if there turns out to be some evidence that this is in fact a Launchpad bug.

affects: launchpad → launchpad-buildd
Changed in launchpad-buildd:
status: New → Invalid
Download full text (6.1 KiB)

TL;DR:
- you can use the pbuilder + static qemu setup to debug
- qemu/libvirt throw no error
- the tests do not "consider" the unsupported syscall
- I found to get just the same issues on Artful but with more context indicating that missing syscall
- You can use the setup described above (or sbuild) to debug further as there are 1-2 issues which seem to have other reasons e.g. "Xapian exception: read only files"
- the testcases might need several fixes, but one of them is surely to test not only if gdb is installed, but if it is working.

---

Hi Gianfranco,
Lacking a working arm system atm to test any further I tried this in qemu static.
Might be an odd setup, but I had it around from evrifying another bug.

pbuilder-dist artful armhf create
pbuilder-dist artful armhf login
apt install ubuntu-dev-tools vim-nox fakeroot
pull-lp-source notmuch
# enable deb-src in /etc/apt/sources.list
apt update
apt-get build-dep notmuch
cd notmuch
debuild -uc -us

(tried the same in Trusty with the old version of notmuch)
Both had a few testcase errors - but not exactly the same you had.

But after the build running dh_auto_test manually in that dir shows more output that can help.

With that approach I think I saw some issues that might be related as with local dh_autp_test you get mroe details.
- Xapian exception: read only files - that could be your DB errors

I quite often saw tests skipped for gdb missing, on an older build environment that was around.
So for a test I installed gdb and reran the tests and found what might be related.
Gdb in the virtual environment (at least my qemu-statis-armhf one) could not work due to an Unsupported syscall.
The retval of gdb in that was 255, and that reminded me of your log:
FAIL success exit with --keep when add_message returns READ_ONLY_DATABASE
 --- T070-insert.33.expected 2016-08-31 07:10:21.960346786 +0000
 +++ T070-insert.33.output 2016-08-31 07:10:21.960346786 +0000
 @@ -1 +1 @@
 -0
 +255
While I saw:
 FAIL success exit with --keep when add_message returns READ_ONLY_DATABASE
        exit code 255, expected 0 gdb --batch-silent --return-child-result -ex 'set args insert --keep < /usr/notmuch-0.25/test/tmp.T070-insert/mail/msg-018' -x index-file-READ_ONLY_DATABASE.gdb notmuch
qemu: Unsupported syscall: 26

Not exactly the same, but the question is what is throwing the 255 in your case, but as I said my setup seems insufficient for that. Maybe the "full" qemu running arm on arm supports that system call but fails differently?

I wondered, in your buildlog [1] I see shell syntax errors like:
./T380-atomicity.sh: line 79: ((: i < : syntax error: operand expected (error token is "< ")
But when running locally (before installing gdb) I saw:
 missing prerequisites: gdb(1)
 SKIP all tests in T380-atomicity
With gdb installed I get exactly your error on the "./T380-atomicity.sh: line 79:" case.
I also saw that at least back in Yakkety gdb was a build dep but with various arch restrictions:
Build-Depends: gdb-minimal, gdb [!s390x !ia64 !armel !ppc64el !mips !mipsel !mips64el]
Build-Conflicts: ruby1.8, gdb-minimal, gdb [s390x ia64 armel ppc64el mips mipsel mips64el]

Note - the same on a Trusty pbuil...

Read more...

Changed in qemu (Ubuntu):
status: New → Incomplete

>And with that finally I found:
>override_dh_auto_test:
>ifeq ($(DEB_HOST_ARCH),armhf)
> TERM=vt100 dh_auto_test || true
>Which means it is not meant/expected to work properly on archhf.

sure, I put that *because* of this bug :) it is quite the opposite, since qemu or something else throws illegal instruction, I had to disable the testsuite.

G.

please use the debian package when doing things, the Ubuntu one has the gdb missing and other hacks (and please use the latest version)

(I'm trying with the same setup as you, but I don't think I have the knowledge to trace down this bug further)

 +qemu: Unsupported syscall: 384
 +qemu: Unsupported syscall: 26
 +exit status: 255

I would say that missing syscall 384 and 26 are the culprit?

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Incomplete

well, some tests are failing just because of 384 non implemented on artful
T150-tagging: Testing "notmuch tag"
 FAIL Xapian exception: read only files
 --- T150-tagging.24.expected 2017-08-08 10:28:25.000000000 +0000
 +++ T150-tagging.24.output 2017-08-08 10:28:25.000000000 +0000
 @@ -1 +1 @@
 -A Xapian exception occurred opening database
 +

qemu: Unsupported syscall: 384

so, 26 is not a problem

384 is getrandom?

summary: - testsuite fails under qemu (SIGILL) works fine on real hw
+ testsuite fails under qemu (SIGILL) works fine on real hw [missing
+ getrandom 384 syscall]

I agree on 384 being getrandom and there are other cases like [1].
I don't know details but would consider the pure lack of the system call support a feature request to upstream qemu.
We could add a qemu upstream task here for that FR - in the worst case we are told why we are wrong - opinions?

On the case itself for now I'd recommend to get back the section:
ifeq ($(DEB_HOST_ARCH),armhf)
        TERM=vt100 dh_auto_test || true

Which maybe was there for similar reasons.

[1]: https://users.rust-lang.org/t/missing-system-calls-when-running-tests-under-qemu-arm-at-travisci/5013

>I agree on 384 being getrandom and there are other cases like [1].

I did the same research :)

>I don't know details but would consider the pure lack of the system call support a feature request to >upstream qemu.
>We could add a qemu upstream task here for that FR - in the worst case we are told why we are wrong - >opinions?

yes please!

>On the case itself for now I'd recommend to get back the section:
>ifeq ($(DEB_HOST_ARCH),armhf)
> TERM=vt100 dh_auto_test || true
>
>Which maybe was there for similar reasons.

I think this used to hang the builders, IIRC, but seems to pass now
https://launchpadlibrarian.net/332584378/buildlog_ubuntu-artful-armhf.notmuch_0.25-4ubuntu1_BUILDING.txt.gz

uploaded in artful

G.

Riku Voipio (riku-voipio) wrote :

There is two issues being mixed up here

1) launchpad buildd changes.

notmuch build system appears to be confused by the new enviroment. It appears ubuntu has chosen "armhf chroot on arm64 machine" approach, which would mean qemu system call emulation is not involved. if it is - it a buildd configuration error.

My suspicion is that notmuch testsuite gets confused in armhf-on-arm64 setup. This setup is a bit shoddy and launchpad should really run the armhf builders with armhf kernel, which kvm on arm64 host can easily do.

2) The "unsupported syscall" errror.

the pbuilder-based cross-build uses qemu linux-user, so the build env is not equivalent to what is launchpad.

syscall 384 is getrandom, which qemu does support. You may have too old qemu

Changed in launchpad-buildd:
status: Invalid → New
Changed in linux (Ubuntu):
status: Incomplete → Invalid

Well, LP builders have qemu 2.5, so this might be true for the syscall.

However, I appreciate the first point, this is in-line with my expectations and might be solvable by launchpad buildd team admins.

Can you please have a deeper look now?

thanks!

Colin Watson (cjwatson) wrote :

qemu isn't involved. We intentionally run armhf builds on an arm64 kernel (with an appropriate personality set) because this allows us to make denser use of our build resources; I don't expect this to change.

Interesting, so somewhat the system is triggering an illegal arm64 instruction? Can be that gdb needs somebody telling it to use armhf platform?

Setting qemu task to invalid per Colins explanation that the LP case doesn't use it.

Changed in qemu (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers