segfaults/unusable system with 4.13 kernel under VMware
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
High
|
Dan Streetman | ||
Artful |
Invalid
|
High
|
Dan Streetman |
Bug Description
[impact]
Running the 4.4 kernel, VMware guests were operating ok; then after an upgrade to the HWE 4.13 kernel and guest reboot, some of the VMware guests report errors and most of their userspace programs randomly segfault. The system is unusable.
The kernel log shows many WARNINGs such as:
[ 0.000000] WARNING: CPU: 0 PID: 0 at /build/
...
[ 1.560428] WARNING: CPU: 0 PID: 95 at /build/
...
[ 1.569011] WARNING: CPU: 0 PID: 95 at /build/
...etc...
Program segfaults appear such as:
[ 109.399376] gnome-screensav
[ 118.647544] lsb_release[2809]: segfault at 7fc1d909a700 ip 00007fc1d8fe5730 sp 00007fff83052f10 error 4 in libm-2.
[ 719.675965] unity-greeter[
[ 746.783766] dbus-daemon[4063]: segfault at ffffff0000000018 ip 00007f95d82bc29d sp 00007ffd0f7cf400 error 5 in libdbus-
[ 940.161586] unity-settings-
[ 1007.871238] traps: grep[4553] general protection ip:7f92bf1941c8 sp:7fff7a57c648 error:0 in libdl-2.
[Test Case]
This was reported to me, so I can't directly reproduce it; but the reporter has almost 2000 VMware guests total, and this happens only on around 50 of those. It's unclear what specific configuration is causing this, but reverting affected systems back to the 4.4 kernel "fixes" the problem.
[regression potential]
The fix for this is not known yet; unknown regression potential currently.
[other info]
Booting with the "nopti" kernel parameter does not help.
This may be related/sameas debian bug 844446:
https:/
Possible upstream fixes:
commit d5c8028b4788f62
commit 0852b374173bb57
Changed in linux (Ubuntu): | |
assignee: | nobody → Dan Streetman (ddstreet) |
importance: | Undecided → High |
status: | New → In Progress |
Changed in linux (Ubuntu Artful): | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Dan Streetman (ddstreet) |
description: | updated |
The cause of this was that the guest was running under an older version of VMware that didn't correctly support the new features of the hypervisor CPU, and was passing the CPU features through to the guest. So, when the guest tried to setup xsave/xrstor, it failed to even initialize, even though the (emulated) cpu features reported it was supported. The kernel didn't expect this initialization failure (because if the cpu reports supporting the features, it must be able to at least initialize the features), and so caused an error in the kernel, which resulted in repeated errors every time xsave/xrstor was used later, causing an unstable OS. Technically, the kernel could be updated to check for xsave/xrstor initialization failure, and if detected then just disable the use of xsave/xrstor completely (a test kernel was made for the reporter and it did fix/workaround their problem). However, since the real cause of this issue is a broken hypervisor (older VMware), simple upgrading VMware fixed the problem for the reporter, and adding error checking to the kernel seems unnecessary.
Closing this as invalid.