reboot does a shutdown instead of rebooting (OCP v2)

Bug #1154749 reported by Samantha Jian-Pielak
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OEM Priority Project
Invalid
Medium
Unassigned
The Open Compute Project
New
Critical
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

On the OCP v2 compliant system (Windmill OCP motherboard v2), the system hang at the reboot startup. This issue happens on 12.04.1, 12.04.2 and 12.10.

The work around in http://ubuntuforums.org/showthread.php?t=2024096, blacklisting mei, works. Note the first reboot doesn't work, but the subsequent reboots work.

mei in 12.04.1
$ modinfo mei
filename: /lib/modules/3.2.0-38-generic/kernel/drivers/staging/mei/mei.ko
version: 7.1.20.1
license: GPL v2
description: Intel(R) Management Engine Interface
author: Intel Corporation
srcversion: 9252880194B02634B5F0427

Kernel mei message before blacklisted:
kernel: [ 6.961623] mei: module is from the staging directory, the quality is unknown, you have been warned.
kernel: [ 6.962226] mei 0000:00:16.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
kernel: [ 6.962235] mei 0000:00:16.0: setting latency timer to 64
kernel: [ 6.962310] mei 0000:00:16.0: irq 104 for MSI/MSI-X

mei in 12.04.2
$ modinfo mei
filename: /lib/modules/3.5.0-25-generic/kernel/drivers/misc/mei/mei.ko
license: GPL v2
description: Intel(R) Management Engine Interface
author: Intel Corporation
srcversion: 0580619CB5DEF6AB7228EBE

Kernel mei message before blacklisted:
kernel: [ 6.543470] mei 0000:00:16.0: setting latency timer to 64
kernel: [ 6.543547] mei 0000:00:16.0: irq 104 for MSI/MSI-X
kernel: [ 6.547516] mei 0000:00:16.0: wd: failed to find the client

mei in 12.10
$ modinfo mei
filename: /lib/modules/3.5.0-21-generic/kernel/drivers/misc/mei/mei.ko
license: GPL v2
description: Intel(R) Management Engine Interface
author: Intel Corporation
srcversion: 0580619CB5DEF6AB7228EBE

Kernel mei message before blacklisted:
kernel: [ 4.731997] mei 0000:00:16.0: setting latency timer to 64
kernel: [ 4.736801] mei 0000:00:16.0: irq 104 for MSI/MSI-X
kernel: [ 4.741695] mei 0000:00:16.0: wd: failed to find the client

Changed in opencompute:
importance: Undecided → Critical
Changed in oem-priority:
importance: Undecided → High
importance: High → Medium
Revision history for this message
Jason Sievert (jsievert) wrote :

Little more info on this. When the system is cycled or shutdown either from a reboot command (shutdown -r now or reboot), shutdown, or the reboot button is hit on the system the box will hang and the only way to recover the system is with a full power cycle (IPMI power off, press and hold the power button, yank mains). Using the debug card the system is deadlocked with a post code of "00" and will stay in that state until it is power cycled. Once the module is blacklist the system responds normally.

Revision history for this message
James M. Leddy (jm-leddy) wrote :

Hi, would you please give the latest mainline kernel builds a try? It is possible that upstream has already fixed this issue without our knowledge.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1154749

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jason Sievert (jsievert) wrote :

I will give latest stable (3.8.5) as well as mainline (3.9-rc5) a go and report back.

Revision history for this message
Jason Sievert (jsievert) wrote :

Just tried 3.9-rc5 from the kernel-ppa mainline branch and I do see the same result. When the mei kernel module is enabled I am unable to reboot the server. It does seem to go down ok but it never come back. Next up is 3.8.5 from the kernel-ppa mainline repo.

Revision history for this message
Jason Sievert (jsievert) wrote :

Confirmed with 3.8.5-030805-generic from the kernel-ppa mainline repo. Next up I am going to blacklist the mei module just to ensure that the behavior is the same.

Revision history for this message
Jason Sievert (jsievert) wrote :

Blacklisting the mei module with 3.9.0-030900rc5-generic and 3.8.5-030805-generic allows the system to be rebooted without issue.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
James M. Leddy (jm-leddy) wrote :

Hi jason, thanks for that input.

Revision history for this message
Samantha Jian-Pielak (samantha-jian) wrote :

There is a different mei module from bug 1156667 to try. I have successfully compiled and loaded it fine on Ubuntu 12.04.2 /OCP v2. Because I am checking these systems remotely, I will wait till I have a system with working ipmitool before I reboot the system, so that if this mei module doesn't have the fix, I can bring back the system without having someone physically pushing the power button.

Changed in oem-priority:
status: New → Invalid
Revision history for this message
Samantha Jian-Pielak (samantha-jian) wrote :

mei v7.1.21.4.S from attached tarball can be compiled and loaded fine and the system can reboot successfully.

penalvch (penalvch)
tags: added: needs-kernel-logs needs-upstream-testing precise quantal
Revision history for this message
Rod Smith (rodsmith) wrote :

This bug still exists in Trusty, at least when the OCPv2 Windmill is booted in EFI mode. Blacklisting the mei module does *NOT* work around the problem with Trusty in EFI mode. Testing in BIOS mode will have to wait a few more days....

Revision history for this message
Rod Smith (rodsmith) wrote :

Update: I've verified this bug in a BIOS-mode boot in trusty, with a twist: BOTH the mei AND the mei_me modules must be blacklisted in order to cause a reboot to work. When both are blacklisted, rebooting also works when the system is booted in EFI mode -- I suspect a failure to blacklist mei_me caused the EFI-mode reboot failures noted in my previous report.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've traced this bug to changes introduced with the 3.2.0 kernel; kernels up to the 3.1.9 kernel reboot just fine.

The driver provided by Samantha earlier is very similar to the 3.0.x drivers.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

So I messed around with this machine for a good part of the day. I built and booted 3.14-rc4, then instrumented the mei and mei-me modules to show the various setup and teardown stages. I think they are correct. What I did find were some issues with MSI. In fact, if you boot with MSI disabled (pci=nomsi) on the kernel command line, then this machine won't _reboot_ without a power cycle. This seems indicative of a BIOS problem.

I then removed pci=nomsi and instrumented the MEI module such that it does not attempt to use MSI (instead uses normal threaded interrupts) and voila it started rebooting.

I guess the solution is to either add a module parameter to the MEI module that allows one to selectively disable MSI, or get the platform BIOS vendors to fix their MSI support. Note that there are other consumers of MSI on this platform such as the igb NIC driver.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

rodsmith - please try the test kernel at http://kernel.ubuntu.com/~rtg/3.13.0-16.36-mei/ - After installing you can add the module parameter options thusly:

echo "options mei-me disable_msi=1" | sudo tee /etc/modprobe.d/mei-me.conf

This option should behave the same as my testing, e.g., disallows MSI initialization which was preventing reboot.

Revision history for this message
Rod Smith (rodsmith) wrote :

This does seem to help, but with some very big caveats:

First, I was unable to test the -generic kernel because it lacked the mei and mei_me modules. I therefore did my testing with the -lowlatency version of the kernel.

Second, whether the disable_msi=1 kernel option was used or not, the system has frequently been coming up without access to the i350 Ethernet hardware. This isn't happening 100% of the time, but it is happening quite often -- maybe 90% of the time.

Finally, I'm seeing the following kernel errors when using dcmitool in-band:

[ 256.029897] BUG: scheduling while atomic: dcmitool/2056/0x00000002
[ 256.029956] Modules linked in: dcmi(OF) snd_hda_codec_hdmi snd_hda_intel snd_hda_codec intel_rapl radeon snd_hwdep snd_pcm snd_page_alloc snd_timer x86_pkg_temp_thermal intel_powerclamp snd coretemp soundcore kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel sb_edac ppdev lp aes_x86_64 ttm drm_kms_helper psmouse drm edac_core serio_raw joydev parport_pc mei_me parport lrw gf128mul glue_helper ablk_helper cryptd mei lpc_ich mac_hid igb i2c_algo_bit hid_generic isci dca e1000e usbhid libsas ptp hid ahci libahci pps_core scsi_transport_sas
[ 256.030019] CPU: 9 PID: 2056 Comm: dcmitool Tainted: GF W O 3.13.0-17-lowlatency #37
[ 256.030022] Hardware name: Quanta Freedom/Windmill-EP, BIOS F03_3A07 03/02/2012
[ 256.030025] ffff882c7fc34600 ffff881212f0bc50 ffffffff81715e6d 7fffffffffffffff
[ 256.030034] ffff881212f0bc60 ffffffff8170feec ffff881212f0bcc0 ffffffff817198de
[ 256.030041] ffff881209a76040 ffff881212f0bfd8 0000000000014600 0000000000014600
[ 256.030049] Call Trace:
[ 256.030058] [<ffffffff81715e6d>] dump_stack+0x4d/0x6f
[ 256.030064] [<ffffffff8170feec>] __schedule_bug+0x4c/0x5a
[ 256.030070] [<ffffffff817198de>] __schedule+0x6de/0x7f0
[ 256.030077] [<ffffffff81719a19>] schedule+0x29/0x70
[ 256.030083] [<ffffffff81718c39>] schedule_timeout+0x279/0x320
[ 256.030089] [<ffffffff81094cbb>] ? ttwu_stat+0x9b/0x110
[ 256.030096] [<ffffffff8171ae1c>] wait_for_completion+0x9c/0x100
[ 256.030103] [<ffffffff8109ae90>] ? wake_up_state+0x20/0x20
[ 256.030110] [<ffffffff8108bfda>] kthread_stop+0x4a/0x130
[ 256.030118] [<ffffffffa0277f7d>] dcmi_transport_close+0x6d/0x120 [dcmi]
[ 256.030124] [<ffffffffa02776d9>] dcmi_disconnect+0x29/0x40 [dcmi]
[ 256.030130] [<ffffffffa0278d0e>] dcmi_release+0x2e/0xa0 [dcmi]
[ 256.030137] [<ffffffff811beb13>] __fput+0xd3/0x250
[ 256.030143] [<ffffffff811becde>] ____fput+0xe/0x10
[ 256.030154] [<ffffffff810888f4>] task_work_run+0xc4/0xe0
[ 256.030162] [<ffffffff8106a816>] do_exit+0x2a6/0xa90
[ 256.030169] [<ffffffff8109e8a4>] ? vtime_account_user+0x54/0x60
[ 256.030175] [<ffffffff8106b07f>] do_group_exit+0x3f/0xa0
[ 256.030180] [<ffffffff8106b0f4>] SyS_exit_group+0x14/0x20
[ 256.030187] [<ffffffff817252ff>] tracesys+0xe1/0xe6

Once, the computer spontaneously rebooted a few seconds after I used dcmitool. I don't know if that's connected to the kernel error messages, though.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

rodsmith - I don't think you installed linux-image-extra-3.13.0-17-generic_3.13.0-17.37_amd64.deb which is why you are missing mei.ko and mei_me.ko. The lowlatency flavour has still got some issues that we're working through.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

The kernel command line parameter described in #15 has been uploaded to linux 3.13.0-17.37

Revision history for this message
Rod Smith (rodsmith) wrote :

I can confirm that this has worked much better with recent kernels, so my comment in #16 can be ignored.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.