segfaults/unusable system with 4.13 kernel under VMware

Bug #1746562 reported by Dan Streetman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
High
Dan Streetman
Artful
Invalid
High
Dan Streetman

Bug Description

[impact]

Running the 4.4 kernel, VMware guests were operating ok; then after an upgrade to the HWE 4.13 kernel and guest reboot, some of the VMware guests report errors and most of their userspace programs randomly segfault. The system is unusable.

The kernel log shows many WARNINGs such as:

[ 0.000000] WARNING: CPU: 0 PID: 0 at /build/linux-hwe-7c8uoR/linux-hwe-4.13.0/arch/x86/include/asm/fpu/internal.h:340 fpu__init_system_xstate+0x538/0x87d
...
[ 1.560428] WARNING: CPU: 0 PID: 95 at /build/linux-hwe-7c8uoR/linux-hwe-4.13.0/arch/x86/include/asm/fpu/internal.h:373 fpu__clear+0xf6/0x100
...
[ 1.569011] WARNING: CPU: 0 PID: 95 at /build/linux-hwe-7c8uoR/linux-hwe-4.13.0/arch/x86/include/asm/fpu/internal.h:358 __switch_to+0x50f/0x530
...etc...

Program segfaults appear such as:
[ 109.399376] gnome-screensav[2575]: segfault at 0 ip 00007efc58809f5e sp 00007ffd54133380 error 6 in libxcb.so.1.1.0[7efc58800000+21000]
[ 118.647544] lsb_release[2809]: segfault at 7fc1d909a700 ip 00007fc1d8fe5730 sp 00007fff83052f10 error 4 in libm-2.23.so[7fc1d8f80000+108000]
[ 719.675965] unity-greeter[3868]: segfault at 7f0c0d0f7940 ip 00007f0c2d069275 sp 00007ffd68b21808 error 4 in libm-2.23.so[7f0c2d046000+108000]
[ 746.783766] dbus-daemon[4063]: segfault at ffffff0000000018 ip 00007f95d82bc29d sp 00007ffd0f7cf400 error 5 in libdbus-1.so.3.14.6[7f95d8288000+4a000]
[ 940.161586] unity-settings-[4393]: segfault at 8 ip 00007fbe83129aaa sp 00007ffdd7b29a10 error 4 in libpower.so[7fbe8311a000+19000]
[ 1007.871238] traps: grep[4553] general protection ip:7f92bf1941c8 sp:7fff7a57c648 error:0 in libdl-2.23.so[7f92bf193000+3000]

[Test Case]

This was reported to me, so I can't directly reproduce it; but the reporter has almost 2000 VMware guests total, and this happens only on around 50 of those. It's unclear what specific configuration is causing this, but reverting affected systems back to the 4.4 kernel "fixes" the problem.

[regression potential]

The fix for this is not known yet; unknown regression potential currently.

[other info]

Booting with the "nopti" kernel parameter does not help.

This may be related/sameas debian bug 844446:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=844446

Possible upstream fixes:

commit d5c8028b4788f62b31fb79a331b3ad3e041fa366
commit 0852b374173bb57f870d78e6c6839c77b339be5f

Dan Streetman (ddstreet)
Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Artful):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Dan Streetman (ddstreet)
Dan Streetman (ddstreet)
description: updated
Revision history for this message
Dan Streetman (ddstreet) wrote :

The cause of this was that the guest was running under an older version of VMware that didn't correctly support the new features of the hypervisor CPU, and was passing the CPU features through to the guest. So, when the guest tried to setup xsave/xrstor, it failed to even initialize, even though the (emulated) cpu features reported it was supported. The kernel didn't expect this initialization failure (because if the cpu reports supporting the features, it must be able to at least initialize the features), and so caused an error in the kernel, which resulted in repeated errors every time xsave/xrstor was used later, causing an unstable OS. Technically, the kernel could be updated to check for xsave/xrstor initialization failure, and if detected then just disable the use of xsave/xrstor completely (a test kernel was made for the reporter and it did fix/workaround their problem). However, since the real cause of this issue is a broken hypervisor (older VMware), simple upgrading VMware fixed the problem for the reporter, and adding error checking to the kernel seems unnecessary.

Closing this as invalid.

Changed in linux (Ubuntu Artful):
status: In Progress → Invalid
Changed in linux (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.