guest OS catches a page fault bug when running dotnet

Bug #1866892 reported by Robert Henry
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
QEMU
Expired
Undecided
Unassigned

Bug Description

The linux guest OS catches a page fault bug when running the dotnet application.

host = metal = x86_64
host OS = ubuntu 19.10
qemu emulation, without KVM, with "tiny code generator" tcg; no plugins; built from head/master
guest emulation = x86_64
guest OS = ubuntu 19.10
guest app = dotnet, running any program

qemu sha=7bc4d1980f95387c4cc921d7a066217ff4e42b70 (head/master Mar 10, 2020)

qemu invocation is:

qemu/build/x86_64-softmmu/qemu-system-x86_64 \
  -m size=4096 \
  -smp cpus=1 \
  -machine type=pc-i440fx-5.0,accel=tcg \
  -cpu Skylake-Server-v1 \
  -nographic \
  -bios OVMF-pure-efi.fd \
  -drive if=none,id=hd0,file=ubuntu-19.10-server-cloudimg-amd64.img \
  -device virtio-blk,drive=hd0 \
  -drive if=none,id=cloud,file=linux_cloud_config.img \
  -device virtio-blk,drive=cloud \
  -netdev user,id=user0,hostfwd=tcp::2223-:22 \
  -device virtio-net,netdev=user0

Here's the guest kernel console output:

[ 2834.005449] BUG: unable to handle page fault for address: 00007fffffffc2c0
[ 2834.009895] #PF: supervisor read access in user mode
[ 2834.013872] #PF: error_code(0x0001) - permissions violation
[ 2834.018025] IDT: 0xfffffe0000000000 (limit=0xfff) GDT: 0xfffffe0000001000 (limit=0x7f)
[ 2834.022242] LDTR: NULL
[ 2834.026306] TR: 0x40 -- base=0xfffffe0000003000 limit=0x206f
[ 2834.030395] PGD 80000000360d0067 P4D 80000000360d0067 PUD 36105067 PMD 36193067 PTE 8000000076d8e867
[ 2834.038672] Oops: 0001 [#4] SMP PTI
[ 2834.042707] CPU: 0 PID: 13537 Comm: dotnet Tainted: G D 5.3.0-29-generic #31-Ubuntu
[ 2834.050591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 2834.054785] RIP: 0033:0x1555547eaeda
[ 2834.059017] Code: d0 00 00 00 4c 8b a7 d8 00 00 00 4c 8b af e0 00 00 00 4c 8b b7 e8 00 00 00 4c 8b bf f0 00 00 00 48 8b bf b0 00 00 00 9d 74 02 <48> cf 48 8d 64 24 30 5d c3 90 cc c3 66 90 55 4c 8b a7 d8 00 00 00
[ 2834.072103] RSP: 002b:00007fffffffc2c0 EFLAGS: 00000202
[ 2834.076507] RAX: 0000000000000000 RBX: 00001554b401af38 RCX: 0000000000000001
[ 2834.080832] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fffffffcfb0
[ 2834.085010] RBP: 00007fffffffd730 R08: 0000000000000000 R09: 00007fffffffd1b0
[ 2834.089184] R10: 0000155555331dd5 R11: 00001555553ad8d0 R12: 0000000000000002
[ 2834.093350] R13: 0000000000000001 R14: 0000000000000001 R15: 00001554b401d388
[ 2834.097309] FS: 0000155554fa5740 GS: 0000000000000000
[ 2834.101131] Modules linked in: isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev input_leds serio_raw parport_pc parport sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper virtio_net psmouse net_failover failover virtio_blk floppy
[ 2834.122539] CR2: 00007fffffffc2c0
[ 2834.126867] ---[ end trace dfae51f1d9432708 ]---
[ 2834.131239] RIP: 0033:0x14d793262eda
[ 2834.135715] Code: Bad RIP value.
[ 2834.140243] RSP: 002b:00007ffddb4e2980 EFLAGS: 00000202
[ 2834.144615] RAX: 0000000000000000 RBX: 000014d6f402acb8 RCX: 0000000000000002
[ 2834.148943] RDX: 0000000001cd6950 RSI: 0000000000000000 RDI: 00007ffddb4e3670
[ 2834.153335] RBP: 00007ffddb4e3df0 R08: 0000000000000001 R09: 00007ffddb4e3870
[ 2834.157774] R10: 000014d793da9dd5 R11: 000014d793e258d0 R12: 0000000000000002
[ 2834.162132] R13: 0000000000000001 R14: 0000000000000001 R15: 000014d6f402d040
[ 2834.166239] FS: 0000155554fa5740(0000) GS:ffff97213ba00000(0000) knlGS:0000000000000000
[ 2834.170529] CS: 0033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2834.174751] CR2: 000014d793262eb0 CR3: 0000000036130000 CR4: 00000000007406f0
[ 2834.178892] PKRU: 55555554

I run the application from a shell with `ulimit -s unlimited` (unlimited stack to size).

The application creates a number of threads, and those threads make a lot of calls to sigaltstack() and mprotect(); see the relevant source for dotnet here https://github.com/dotnet/runtime/blob/15ec69e47b4dc56098e6058a11ccb6ae4d5d4fa1/src/coreclr/src/pal/src/thread/thread.cpp#L2467

using strace -f on the app shows that no alt stacks come anywhere near the failing address; all alt stacks are in the heap, as expected. None of the mmap/mprotect/munmap syscalls were given arguments in the high memory 0x7fffffff0000 and up.

gdb (with default signal stop/print/pass semantics) does not report any signals prior to the kernel bug being tripped, so I doubt the alternate signal stack is actually used.

When I run the same dotnet binary on the host (eg, on "bare metal"), the host kernel seems happy and dotnet runs as expected.

I have not tried different qemu or guest or host O/S.

Revision history for this message
Robert Henry (mhodog) wrote :

A simpler case seems to produce the same error. See https://bugs.launchpad.net/qemu/+bug/1824344

Revision history for this message
Robert Henry (mhodog) wrote :

I have confirmed that the dotnet guest application is executing a "iretq" instruction when this guest kernel bug is hit. A first round of analysis shows nothing unreasonable at the point the iretq is executed. The $rsp points into the middle of a mapped in page, the returned-to $rip looks reasonable, etc. We continue our analysis of qemu and the dotnet runtime.

Revision history for this message
Peter Maydell (pmaydell) wrote :

Is dotnet intentionally doing an iret? It seems like an odd thing for a userspace program to do, given it's basically "return from interrupt".

Revision history for this message
Robert Henry (mhodog) wrote :
Revision history for this message
Peter Maydell (pmaydell) wrote :

Thanks, that's pretty clear. I expect you'll find the bug is just that QEMU doesn't get the semantics of an iret from userspace correct. The helper_iret_protected() function is probably a good place to look.

Revision history for this message
Robert Henry (mhodog) wrote :
Download full text (3.3 KiB)

I've stepped/nexted from the helper_iret_protected, going deep into the bowels of the TLB, MMU and page table engine. None of which I understand. The helper_ret_protected faults in the first POPQ_RA. I'll investigate the value of sp at the time of the POPQ_RA.

Here's the POPQ_RA in i386/seg_helper.c:2140

    sp = env->regs[R_ESP];
    ssp = env->segs[R_SS].base;
    new_eflags = 0; /* avoid warning */
#ifdef TARGET_X86_64
    if (shift == 2) {
        POPQ_RA(sp, new_eip, retaddr);
        POPQ_RA(sp, new_cs, retaddr);
        new_cs &= 0xffff;
        if (is_iret) {
            POPQ_RA(sp, new_eflags, retaddr);
        }

and here's the stack. Note some of the logical intermediate frames are optimized out due to -O3 and inline. (the value of env-errorcode is 1)

0 0x0000555555a370c0 in raise_interrupt2
    (env=env@entry=0x5555566ef200, intno=14, is_int=is_int@entry=0, error_code=1, next_eip_addend=next_eip_addend@entry=0, retaddr=retaddr@entry=140736367565663) at /mnt/robhenry/qemu_robhenry_amd64/qemu/include/exec/cpu-all.h:426
#1 0x0000555555a377f9 in raise_exception_err_ra
    (env=env@entry=0x5555566ef200, exception_index=<optimized out>, error_code=<optimized out>, retaddr=retaddr@entry=140736367565663) at /mnt/robhenry/qemu_robhenry_amd64/qemu/target/i386/excp_helper.c:127
#2 0x0000555555a37d69 in x86_cpu_tlb_fill
    (cs=0x5555566e69a0, addr=140727872411616, size=<optimized out>, access_type=MMU_DATA_LOAD, mmu_idx=0, probe=<optimized out>, retaddr=140736367565663) at /mnt/robhenry/qemu_robhenry_amd64/qemu/target/i386/excp_helper.c:697
#3 0x0000555555952295 in tlb_fill
    (cpu=0x5555566e69a0, addr=140727872411616, size=8, access_type=MMU_DATA_LOAD, mmu_idx=0, retaddr=140736367565663)
    at /mnt/robhenry/qemu_robhenry_amd64/qemu/accel/tcg/cputlb.c:1017
#4 0x0000555555956320 in load_helper
    (full_load=0x555555956140 <helper_le_ldq_mmu>, code_read=false, op=MO_64, retaddr=93825010692608, oi=48, addr=140727872411616, env=0x5555566ef200) at /mnt/robhenry/qemu_robhenry_amd64/qemu/include/exec/cpu-all.h:426
#5 0x0000555555956320 in helper_le_ldq_mmu
    (env=env@entry=0x5555566ef200, addr=addr@entry=140727872411616, oi=oi@entry=48, retaddr=retaddr@entry=140736367565663)
    at /mnt/robhenry/qemu_robhenry_amd64/qemu/accel/tcg/cputlb.c:1688
#6 0x0000555555956dc0 in cpu_load_helper
    (full_load=0x555555956140 <helper_le_ldq_mmu>, op=MO_64, retaddr=140736367565663, mmu_idx=<optimized out>, addr=140727872411616, env=0x5555566ef200) at /mnt/robhenry/qemu_robhenry_amd64/qemu/accel/tcg/cputlb.c:1752
#7 0x0000555555956dc0 in cpu_ldq_mmuidx_ra
    (env=env@entry=0x5555566ef200, addr=addr@entry=140727872411616, mmu_idx=<optimized out>, ra=ra@entry=140736367565663)
--Type <RET> for more, q to quit, c to continue without paging--
    at /mnt/robhenry/qemu_robhenry_amd64/qemu/accel/tcg/cputlb.c:1799
#8 0x0000555555a4ff09 in helper_ret_protected
    (env=env@entry=0x5555566ef200, shift=shift@entry=2, is_iret=is_iret@entry=1, addend=addend@entry=0, retaddr=140736367565663)
    at /mnt/robhenry/qemu_robhenry_amd64/qemu/target/i386/seg_helper.c:2140
#9 0x0000555555a50ff5 in helper_iret_protected (env=0x5555566ef200, shift=2, next_eip=-999...

Read more...

Revision history for this message
Robert Henry (mhodog) wrote :

Peter: I think your intuition is right. The POPQ_RA (pop quad, passing through return address handle) is only called from helper_ret_protected, and it suspiciously calls cpu_ldq_kernel_ra which calls cpu_mmu_index_kernel which only is prepared for kernel space iretq (and of course the substring _kernel in the function name tells us that too).

Revision history for this message
Peter Maydell (pmaydell) wrote :

That function is for "do a guest memory load as if from the kernel" (ie with the permissions and page table settings that guest kernel-mode accesses would use). We'd need to look at what the required semantics for the instruction are for user-mode iretq: are the loads supposed to be done with only the user-mode access and page tables?

Revision history for this message
Richard Henderson (rth) wrote :

Robert, please try replacing s/_kernel_/_data_/g.

That will use the current address space and permissions as opposed to cpl=0 address space and permissions.

Revision history for this message
Robert Henry (mhodog) wrote :

The change should only be dynamically visible when doing an iretq from and to the same protection level, AFAICT. The code clearly[sic] works now for the interrupt return that is used by the linux kernel, presumably {from=kernel, to=kernel} or {from=kernel, to=user}. I would claim that to make this code correct it needs to work for all 4*4 state changes, as windows (notoriously) uses all 4 protection levels. I'm still digging, but your suggestion is certainly on the path forward.

Revision history for this message
Thomas Huth (th-huth) wrote :

The QEMU project is currently considering to move its bug tracking to
another system. For this we need to know which bugs are still valid
and which could be closed already. Thus we are setting older bugs to
"Incomplete" now.

If you still think this bug report here is valid, then please switch
the state back to "New" within the next 60 days, otherwise this report
will be marked as "Expired". Or please mark it as "Fix Released" if
the problem has been solved with a newer version of QEMU already.

Thank you and sorry for the inconvenience.

Changed in qemu:
status: New → Incomplete
Revision history for this message
Peter Maydell (pmaydell) wrote :

The QEMU code implementing the iret insn is still doing loads/stores as if kernel mode, so this is still a bug.

Changed in qemu:
status: Incomplete → Confirmed
Revision history for this message
Thomas Huth (th-huth) wrote : Moved bug report

This is an automated cleanup. This bug report has been moved to QEMU's
new bug tracker on gitlab.com and thus gets marked as 'expired' now.
Please continue with the discussion here:

 https://gitlab.com/qemu-project/qemu/-/issues/249

Changed in qemu:
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.