Comment 93 for bug 1505564

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: [Bug 1505564] Re: Soft lockup with "block nbdX: Attempted send on closed socket" spam

Hello Junien,
(recommendations with *)
I'm replying to you and to the LP bug so it gets proper documentation.
Under comment #91:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/comments/91
You can see my kernel dump analysis, where I am showing you that the
OS is stuck in a "migration thread", possibly because of a lack of
IPIs synchronisation (maybe even an IPI being lost). We have already
seen cases like this - specially in nested virtualisation environments
- and this has been discussed in LKML.
Before we move further I need you to follow some kind of "best
practices" for Proliant Servers:
1 - NMIs caused during MWAIT instruction (caused by intel_idle module):
& HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580)
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837)
* Firmware: Configure a maximum of a C3 c-state for CPU savings (CPU C-STATES)
* Firmware: Disable packed CPU c-state
* Firmware: Disable Cooperative Power Management
* Make sure NOT TO LOAD HPWDT kernel module (LP: #1432837 Fix Released
3.13.0-49.81)
2 - Recently discovered NMIs caused by a BUG in Intel microcode
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1416414)
** If you have Intel based Proliant Servers, because of Intel
microcode issue, use at least* 3.13.0-35.61.
3 - X2APIC support for HP Proliant Servers
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497)
* For Proliant prior to G8 (<= G7) use "nox2apic intremap=off" into grub cmdline
* For Proliant G8 use "intremap=no_x2apic_optout" into grub cmdline
4 - HP Proliant Latest Firmware
MOST IMPORTANT
Upgrade server firmware to latest version
There were numerous firmware fixes from HP.
---> If we are facing a firmware problem - related to IPIs, the
inter-processor-interrupts, being missed - we have to make sure this
is reproducible in the latest firmware in order to work together with
HP ROM engineering team.
Summary:
Could you follow all these steps and provide feedback ? I understand
this might take awhile if you have a big number of servers and - if so
- I would take a statistical approach here, by changing only half of
the servers and sticking with the first half as the "control group",
for future comparisons.
Is this feasible ? Looking forward to hearing your feedback.
Best Regards
Rafael Tinoco
Sustaining Engineering