Comment 0 for bug 1761104

Revision history for this message
Colin Ian King (colin-king) wrote :

== SRU Justification, ARTFUL ==

Bug fix #1747069 causes an issue for NVIDIA drivers on ppc64el platforms. According to Will Davis at NVIDIA:

"- The original patch 3d79a728f9b2e6ddcce4e02c91c4de1076548a4c changed the call to arch_add_memory in mm/memory_hotplug.c to call with the boolean argument set to true instead of false, and inverted the semantics of that argument in the arch layers.

- The revert patch 4fe85d5a7c50f003fe4863a1a87f5d8cc121c75c reverted the semantic change in the arch layers, but didn't revert the change to the arch_add_memory call in mm/memory_hotplug.c"

And also:

"It looks like the problem here is that the online_type is _MOVABLE but
can_online_high_movable(nid=255) is returning false:

        if ((zone_idx(zone) > ZONE_NORMAL ||
            online_type == MMOP_ONLINE_MOVABLE) &&
            !can_online_high_movable(pfn_to_nid(pfn)))

This check was removed by upstream commit
57c0a17238e22395428248c53f8e390c051c88b8, and I've verified that if I apply
that commit (partially) to the 4.13.0-37.42 tree along with the previous
arch_add_memory patch to make the probe work, I can fully online the GPU device
memory as expected.

Commit 57c0a172.. implies that the can_online_high_movable() checks weren't
useful anyway, so in addition to the arch_add_memory fix, does it make sense to
revert the pieces of 4fe85d5a7c50f003fe4863a1a87f5d8cc121c75c that added back
the can_online_high_movable() check?"

== Fix ==

Fix partial backport from bug #1747069, remove can_online_high_movable and fix the incorrectly set boolean argument to arch_add_memory().

== Regression Potential ==

This fixes a regression in the original fix and hence the regression potential is the same as the previously SRU'd bug fix for #1747069, namely:

"Reverting this commit does remove some functionality, however this does not regress the kernel compared to previous releases and having a working reliable memory hotplug is the preferred option. This fix does touch some memory hotplug, so there is a risk that this may break this functionality that is not covered by the kernel regression testing."