Comment 6 for bug 1486180

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-01-11 10:34 EDT-------
Status update:

The root cause was found, and a patch is provided.
The problem happens when DLPAR of PCI device is done in LPAR with no PCI devices present at boot time. When DDW is being enabled (in function query_ddw() specifically), a NULL pointer dereference happens because a member of struct eeh_dev is NULL.

This is caused because EEH is not initialized correctly, by not probing PCI devices as expected, and so not initializing the eeh_dev struct.

The commit 89a51df5ab1d ("powerpc/eeh: Fix crash in eeh_add_device_early() on Cell") added a check to avoid oops in Cell architecture in function eeh_add_device_early() - this function is used to probe PCI devices in hotplug/DLPAR operation. The check is performed by evaluating the return of eeh_enable() function.

The issue then happens because since we have no PCI device on boot time, EEH is not enabled and this check fails on eeh_add_device_early(). Our patch changes the way the arch checking is done, and so this bug does not happen anymore.

The patch was submitted upstream. I don't know exactly the procedure regarding Canonical - I think we should wait the upstream acceptance and then request Canonical to add the patch to Ubuntu's 14.04.4/15.10/16.04 kernel.
The patch's description provides a bit more details of the issue and the proposed solution.

Link to patch on ppc-dev list: https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-January/137695.html

Thanks Shryia for all the help provided.
Cheers,

Guilherme