kipmi0 process hangs with ipmitool

Bug #1383921 reported by andrew bezella on 2014-10-21
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Precise
Undecided
Unassigned
Trusty
Medium
Chris J Arges

Bug Description

we're in the process of migrating from 12.04 to 14.04 and are noticing a problem apparently related to the kernel and ipmi. after an indeterminate period of regular ipmi queries (e.g., nagios checks using ipmitool) the kipmi0 process pegs a cpu to ~100% usage and further ipmitool commands hang. the former is not a huge problem as the process is niced and its cpu usage can be limited using the /sys/module/ipmi_si/parameters/kipmid_max_busy_us interface. however, ipmitool's hanging severely degrades hardware monitoring.

this problem initially reared its head in our enviroment on a handful of 12.04 hosts on which hwe kernels were installed. of the dozen deployed 14.04 hosts, four have started displaying these symptoms in the past week. all of the 12.04+hwe hosts were eventually affected; i believe that given enough time all of the 14.04 hosts would be, as well. a reboot clears it up until its recurrence.

red hat has a bug logged that appears to match this: https://bugzilla.redhat.com/show_bug.cgi?id=1090619 unfortunately the work-around and proposed fix (from a duplicate bug) are currently non-public. it does look like they were able to identify the "ipmi: simplify locking" patch from commit id f60adf42ad55405d1b17e9e5c33fdb63f1eb8861 as the culprit. i have just finished building a kernel from linux-source-3.13.0=3.13.0-37.64 w/this patch reversed and will deploy it to see if the problem is alleviated.

thank you for your time and effort.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-37-generic 3.13.0-37.64
ProcVersionSignature: Ubuntu 3.13.0-37.64-generic 3.13.11.7
Uname: Linux 3.13.0-37-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 18 13:11 seq
 crw-rw---- 1 root audio 116, 33 Oct 18 13:11 timer
AplayDevices: aplay: device_list:268: no soundcards found...
ApportVersion: 2.14.1-0ubuntu3.5
Architecture: amd64
ArecordDevices: arecord: device_list:268: no soundcards found...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Tue Oct 21 12:55:43 2014
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Supermicro X8DT6
PciMultimedia:

ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/usr/bin/zsh
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-37-generic root=/dev/md0 ro consoleblank=0 console=tty0 console=ttyS2,115200n8 nomdmonddf nomdmonisw bootdegraded=true
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-37-generic N/A
 linux-backports-modules-3.13.0-37-generic N/A
 linux-firmware 1.127.7
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 05/15/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.0c
dmi.board.asset.tag: 1234567890
dmi.board.name: X8DT6
dmi.board.vendor: Supermicro
dmi.board.version: 1234567890
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 17
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 1234567890
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.0c:bd05/15/2012:svnSupermicro:pnX8DT6:pvr1234567890:rvnSupermicro:rnX8DT6:rvr1234567890:cvnSupermicro:ct17:cvr1234567890:
dmi.product.name: X8DT6
dmi.product.version: 1234567890
dmi.sys.vendor: Supermicro

andrew bezella (abezella) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Tim Gardner (timg-tpi) on 2014-10-22
Changed in linux (Ubuntu Precise):
status: New → Invalid
Changed in linux (Ubuntu Trusty):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Utopic):
assignee: nobody → Tim Gardner (timg-tpi)
status: Confirmed → In Progress
Changed in linux (Ubuntu Vivid):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Tim Gardner (timg-tpi) wrote :

Please try the test kernel at http://kernel.ubuntu.com/~rtg/lp1383921-ipmi/, e.g.,

wget http://kernel.ubuntu.com/~rtg/lp1383921-ipmi/linux-image-3.13.0-38.65-generic_3.13.0-38.65.lp1383921_amd64.deb
wget http://kernel.ubuntu.com/~rtg/lp1383921-ipmi/linux-image-extra-3.13.0-38.65-generic_3.13.0-38.65.lp1383921_amd64.deb
sudo dpkg -i linux-image*lp1383921*.deb

This kernel has commit f60adf42ad55405d1b17e9e5c33fdb63f1eb8861 (ipmi: simplify locking) reverted.

Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Changed in linux (Ubuntu Utopic):
importance: Undecided → Medium
Changed in linux (Ubuntu Vivid):
importance: Undecided → Medium
andrew bezella (abezella) wrote :

i've installed and rebooted 16 of the 14.04 hosts so that they are now running w/the 3.13.0-38.65-generic #lp1383921 kernel. i'll update the ticket if i see the symptoms recur on this set.

the remaining 8 can't be rebooted until early next week.

thanks!

andrew bezella (abezella) wrote :

the workaround red hat suggests (forcing a hot-add/hot-remove cycle of the IPMI interface) appears to work. the same page claims the issue was resolved w/kernel-2.6.32-504.el6 from Errata RHSA-2014-1392 (released 2014-10-14). however, i'm unable to pinpoint the resolution in the release or technical notes.

Tim Gardner (timg-tpi) wrote :

I assume you performed the workaround suggested by Redhat on servers running the stock 3.13.0-38.65 kernel. I'm looking at the code to see if I can spot a locking mistake (which it will almost certainly turn out to be).

andrew bezella (abezella) wrote :

the hot-add/hot-remove workaround was performed on a mishmash of hosts where this has crept up. the majority were precise hosts w/a 3.5.0 hwe kernel. one was trusty w/3.13.0-37-generic and the remainder are running locally built kernels from linux-source 3.13.0-35.62 w/the patch from lp#1368991 applied (these are the hosts i have to wait for reboot).

thanks!

andrew bezella (abezella) wrote :

the reboot was delayed but 7 of the 8 remaining hosts are now running the
3.13.0-38.65-generic #lp1383921
kernel. i have not yet seen a recurrence of the ipmi hang on any hosts running this version.

Tim Gardner (timg-tpi) wrote :

Two weeks since comment #4 and no errors seems like pretty good evidence that f60adf42ad55405d1b17e9e5c33fdb63f1eb8861 (ipmi: simplify locking) is root cause.

andrew bezella (abezella) wrote :

agreed. thank you for providing the test kernel!

Matt Jarvis (matt-jarvis) wrote :

Is there any news on this issue ? we have many nodes in our Openstack cluster starting to suffer from this problem as we have a puppet fact which runs ipmitool lan print and so triggers the bug often. Could you also tell us what the workaround was for RHEL as we can't access the page on RedHat's site. The workaround we have found is to remove and re-add the ipmi_si module, but removing it takes a VERY long time ie. many many hours

Tim Gardner (timg-tpi) wrote :

Matt - I've proposed a revert for the Trusty kernel: https://lists.ubuntu.com/archives/kernel-team/2014-November/050780.html

Changed in linux (Ubuntu Utopic):
assignee: Tim Gardner (timg-tpi) → Chris J Arges (arges)
Changed in linux (Ubuntu Vivid):
assignee: Tim Gardner (timg-tpi) → Chris J Arges (arges)
Chris J Arges (arges) wrote :

Looking through the code, I see the following commits:
89986496de141213206d49450ffdd36098d41209
0dfe6e7ed47feeb22f3cf8c7d8ac7e65bd4e87f5
48e8ac2979920ffa39117e2d725afa3a749bfe8d

that may be fixes to the issue.

Overall these are all in 3.16, and I'd like to see if this kernel is affected by this issue.
Can someone who can reproduce try running with the 3.16 kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-utopic/

Thanks,
--chris

andrew bezella (abezella) wrote :

Matt - i am using the following (takes advantage of the ipmi_si module's hot add and remove):
KIPMI_INST="0" # the web suggests this should increment but i've always seen 0
IPMI_PARAMS="$(cat /proc/ipmi/"${KIPMI_INST}"/params)"
sudo sh -c "echo "remove,${IPMI_PARAMS}" > /sys/module/ipmi_si/parameters/hotmod"
sleep 5
sudo sh -c "echo "add,${IPMI_PARAMS}" > /sys/module/ipmi_si/parameters/hotmod"

takes a while (30min+) but that sounds better than "many many hours"

andrew bezella (abezella) wrote :

@chris - i have 10 nodes that i just installed w/3.16.0-031600-generic #201408031935

i will increase that number tomorrow and update thisbug report if i see ipmi hangs on these hosts.

Chris J Arges (arges) on 2014-11-19
Changed in linux (Ubuntu Trusty):
assignee: Tim Gardner (timg-tpi) → Chris J Arges (arges)
Tim Gardner (timg-tpi) on 2014-11-21
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Matt Jarvis (matt-jarvis) wrote :

The kernel 3.13.0-41-generic which is in proposed does not appear to have any of the ipmi modules. Are we doing something wrong ?

root@adam:/lib/modules# uname -r
3.13.0-41-generic

root@adam:/lib/modules# modprobe ipmi_si
modprobe: FATAL: Module ipmi_si not found.

root@adam:/lib/modules# ipmitool lan print
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory
Get Channel Info command failed
Invalid channel: 0

Matt Jarvis (matt-jarvis) wrote :

Actually forget that, just worked out they are in the extras package

Matt Jarvis (matt-jarvis) wrote :

With the kernel 3.13.0-41-generic, I can confirm that we are now able to run warm and cold resets of the mc via ipmitool correctly. On previous kernel those commands would cause kipmi to use 100% cpu time and require the ipmi modules to be unloaded and reloaded before ipmitool would operate correctly. With this kernel once the mc is reset, ipmitool works properly. This would seem to fix the issue for us, but we're not the original bug reporter.

Chris J Arges (arges) wrote :

@matt-jarvis
Thanks for verifying this!

tags: added: verification-done-trusty
removed: verification-needed-trusty
andrew bezella (abezella) wrote :

@arges - fyi the 20 nodes running
3.16.0-031600-generic #201408031935
have not encountered this problem in the ~2 weeks since install & reboot.

@matt-jarvis - thx for the verification, i was out-of-office most of this week.

i just took the 20 nodes that had been running
3.13.0-38.65-generic #lp1383921
and upgraded them to the -proposed kernel
3.13.0-41-generic #70-Ubuntu

i will report back if they encounter ipmi issues but tend to concur that this is resolved.

Launchpad Janitor (janitor) wrote :
Download full text (14.0 KiB)

This bug was fixed in the package linux - 3.13.0-41.70

---------------
linux (3.13.0-41.70) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1396112

  [ Chris J Arges ]

  * [Config] CONFIG_SCOM_DEBUGFS=y for powerpc/powerpc64-smp
    ppc64el/generic
    - LP: #1395855

  [ Upstream Kernel Changes ]

  * Revert "KVM: x86: Handle errors when RIP is set during far jumps"
    - LP: #1393477
  * Revert "net/macb: add pinctrl consumer support"
    - LP: #1393477
  * Revert "iwlwifi: mvm: treat EAPOLs like mgmt frames wrt rate"
    - LP: #1393477
  * Revert "ipmi: simplify locking"
    - LP: #1383921
  * ACPI / blacklist: add Win8 OSI quirks for some Dell laptop models
    - LP: #1339456
  * ACPI / battery: Accelerate battery resume callback
    - LP: #838543
  * tools: cpu-hotplug fix unexpected operator error
  * netlink: reset network header before passing to taps
    - LP: #1393477
  * rtnetlink: fix VF info size
    - LP: #1393477
  * myri10ge: check for DMA mapping errors
    - LP: #1393477
  * tcp: don't use timestamp from repaired skb-s to calculate RTT (v2)
    - LP: #1393477
  * sit: Fix ipip6_tunnel_lookup device matching criteria
    - LP: #1393477
  * tcp: fix tcp_release_cb() to dispatch via address family for
    mtu_reduced()
    - LP: #1393477
  * tcp: fix ssthresh and undo for consecutive short FRTO episodes
    - LP: #1393477
  * packet: handle too big packets for PACKET_V3
    - LP: #1393477
  * openvswitch: fix panic with multiple vlan headers
    - LP: #1393477
  * vxlan: fix incorrect initializer in union vxlan_addr
    - LP: #1393477
  * l2tp: fix race while getting PMTU on PPP pseudo-wire
    - LP: #1393477
  * bonding: fix div by zero while enslaving and transmitting
    - LP: #1393477
  * bridge: Check if vlan filtering is enabled only once.
    - LP: #1393477
  * bridge: Fix br_should_learn to check vlan_enabled
    - LP: #1393477
  * net: allow macvlans to move to net namespace
    - LP: #1393477
  * tg3: Work around HW/FW limitations with vlan encapsulated frames
    - LP: #1393477
  * tg3: Allow for recieve of full-size 8021AD frames
    - LP: #1393477
  * xfrm: Generate blackhole routes only from route lookup functions
    - LP: #1393477
  * xfrm: Generate queueing routes only from route lookup functions
    - LP: #1393477
  * macvtap: Fix race between device delete and open.
    - LP: #1393477
  * gro: fix aggregation for skb using frag_list
    - LP: #1393477
  * hyperv: Fix a bug in netvsc_start_xmit()
    - LP: #1393477
  * ip6_gre: fix flowi6_proto value in xmit path
    - LP: #1393477
  * team: avoid race condition in scheduling delayed work
    - LP: #1393477
  * sctp: handle association restarts when the socket is closed.
    - LP: #1393477
  * tcp: fixing TLP's FIN recovery
    - LP: #1393477
  * sparc64: Do not disable interrupts in nmi_cpu_busy()
    - LP: #1393477
  * sparc64: Fix pcr_ops initialization and usage bugs.
    - LP: #1393477
  * sparc32: dma_alloc_coherent must honour gfp flags
    - LP: #1393477
  * sparc64: sun4v TLB error power off events
    - LP: #1393477
  * sparc64: Fix corrupted thread fault code.
    - LP: #1393477
  * sparc64: find_node adjustment
   ...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Chris J Arges (arges) wrote :

Marked as not affecting 3.16+. And we've fixed this in trusty.

Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Released
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Vivid):
assignee: Chris J Arges (arges) → nobody
no longer affects: linux (Ubuntu Vivid)
no longer affects: linux (Ubuntu Utopic)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers