ERAT Multihit machine checks

Bug #1352995 reported by bugproxy
This bug report is a duplicate of:  Bug #1357014: Add THP fixes in 14.10 kernel. Edit Remove
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Tim Gardner
Trusty
In Progress
Undecided
Tim Gardner
Utopic
Fix Released
Medium
Tim Gardner

Bug Description

-- Problem Description --
Our project involves porting a 3rd-party out-of-tree module to LE Ubuntu on Power.

We've been seeing occasional ERAT Multihit machine checks with kernels ranging from the LE Ubuntu 14.04 3.13-based kernel through the very latest 3.16-rc5 mainline kernels.

Our kernels are running directly on top of OPAL/Sapphire in PowerNV mode, with no intervening KVM host.

FSP dumps captured at the time of the ERAT detection show that there are duplicate mappings in force for the same real page, with the duplicate mappings being for different sized pages. So, for example, the same 4K real page will be referred to by a 4K mapping and an overlapping 16M mapping.

Aneesh has been working with us on this.

We are currently testing this patchset. (git format-patch --stdout format). We are still finding ERAT with this changes. Most of these changes are already posted externally. Some of them got updated after that.

Current status is. When hitting multi hit erat, I don't find duplicate hash pte entries. So it possibly indicate a missed flush or a race.

Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Dump the rest of 256 entries
Dar value is 3fff7d0f0000 psize 0
slot = 453664 v = 40001f0d74ff7d01 r = 7ca0f0196 with b_size = 15 a_size = -1
Done..
Dump the rest of 256 entries
Done..
Found hugepage shift 0 for ea 3fff7d0f0000 with ptep 1f283d8000383
Severe Machine check interrupt [Recovered]
  Initiator: CPU
  Error type: ERAT [Multihit]
    Effective address: 00003fff7d0f0000

That is what i am finding on machine check. I am searching the hash pte with base page size 4K and 64K and printing matching hash table entries. b_size = 15 and a_size = -1 both indicate 4K.

-aneesh

I guess we now have a race in the unmap path. I am auditing the hpte_slot_array usage.

We do check for hpte_slot_array != NULL in invalidate. But if we hit two pmdp_splitting flush one will skip the invalidate as per current code and will go ahead and mark hpte_slot_array NULL. I have a patch in the repo which try to work around that. But I am not sure whether we really can have two pmdp_splitting flush simultaneously. because we call that under pmd_lock.

Still need to look at the details.

-aneesh

I added more debug prints. And this is what i found. Before a hugepage flush I added debug prints to dump the hash table to see if we are failing to clear any hash table entries. After every update we seems to have clearly updated hash table. One MCE some of the relevant part of logs are

pmd_hugepage_update dumping entries for 0x3fff71000000 with clr = 0xffffffffffffffff set = 0x0
.....

.....

dump_hash_pte_group dumping entries for 0x3fff7191da8c with clr = 0x0 set = 0x0
func = dump_hash_pte_group, addr = 3fff7191da8c psize = 0 slot = 1174024 v = 4001a9245cff7181 r = 7dfb5d193 with b_size = 0 a_size = 0 count = 2333
func = dump_hash_pte_group, addr = 3fff71000000 psize = 0 slot = 1155808 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 0
func = dump_hash_pte_group, addr = 3fff710a2000 psize = 0 slot = 1157104 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 162
func = dump_hash_pte_group, addr = 3fff710e6000 psize = 0 slot = 1156560 v = 4001a9245cff7105 r = 7cc038196 with b_size = 0 a_size = 9 count = 230
func = dump_hash_pte_group, addr = 3fff71378000 psize = 0 slot = 1161504 v = 4001a9245cff7105 r = 7cc038116 with b_size = 0 a_size = 9 count = 888

So we end up clearing the huge pmd with 0x3fff71000000 and at that point we didn't had anything in hash table. That is the last pmdp_splitting_flush or pmd_hugepage_update even on that address.

Can we audit the driver code to understand large/huge page usage and if it is making any x86 assumptions around the page table accessor. For example ppc64 rules around page table access are more strict than x86. We don't have flush_tlb_* functions and we need to make sure we hold ptl while updating page table and also flush the hash pte holding the lock.

Attaching the log also

-aneesh

Aneesh writes:

Can we audit the driver code to understand large/huge page usage and if it is making any x86 assumptions around the page table accessor. For example ppc64 rules around page table access are more strict than x86. We don't have flush_tlb_* functions and we need to make sure we hold ptl while updating page table and also flush the hash pte holding the lock.

Yes, we can do that (all the driver code that's specific to linux is in the kernel-interface subdirectory, so you can take a look as well). But I'm not quite sure what we'd be looking for.

The driver doesn't have any explicit awareness of huge-pages; it doesn't intend or expect to interact with them in any way. And I wouldn't expect the driver to be updating the kernel's page tables itself but rather to use of some set of (relatively safe) services to do that.

So if you can tell us what we might want to look for in the driver code, we'll be happy to do that.

I do notice a couple uses of __flush_tlb() and global_flush_tlb(), but those are under x86 ifdefs and won't be compiled in for Power. The intent of the code using those is to flush the caches when the driver changes the cache attribute of memory regions between cached and uncached.

The driver's linux kernel interface code does contain references to updating "pte", but those should all be the PTEs that are used by the adapter, not the linux kernel page table entries.

After some additional looking, I see that there are some code paths in the driver's kernel interface layer at least refer to the kernel page table structures (see the references to pte_t, pmd_t, pgd_t, etc.) in
kernel_interface/nv-linux.h and nv.c.

But again, these are code paths that should only be compiled in for x86 (and in this case for kernel
versions < 2.6.1) as far as I can see.

Can you try the new patchset? I was able to run recreat1.sh in loop for more than 8 times now. I will leave it running for the rest of the day and will check again tomorrow morning.

I still need to get clarification on calling tlbie in loop for huge pages from hardware guys.

-aneesh

Revision history for this message
bugproxy (bugproxy) wrote : current patchset

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-113570 severity-high targetmilestone-inin1410
Revision history for this message
bugproxy (bugproxy) wrote : debug message on machine check.

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dbug logs tracking hash pte flush

Default Comment by Bridge

Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1352995

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Luciano Chavez (lnx1138)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: ppc64el
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2014-08-14 21:19 EDT-------
The finalized set of patches for this has been submitted upstream and will be marked to go into the stable kernels as well.

The 3.13-based kernel shipped in Ubuntu 14.04 is susceptible to this problem, so the changes should probably be back-ported for 14.04 as well as being included in 14.10.

Revision history for this message
Diane Brent (drbrent) wrote :

I thought this fix was included in the 14.10 build from last Friday-Saturday???? No?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2014-08-21 13:51 EDT-------
I'm wondering the same. I believe that this fix was included in the 3.16.0-9 kernel that Canonical made available sometime over last weekend, but that's just based on discussions with Breno. I was surprised not to see a "fix released" note hit this bug via the LP item.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

All of these patches are in http://bugs.launchpad.net/bugs/1357014 and were released with Ubuntu-3.16.0-9.14. I'll start working on back porting them to 14.04.

Changed in linux (Ubuntu Utopic):
assignee: nobody → Tim Gardner (timg-tpi)
status: Confirmed → In Progress
status: In Progress → Fix Released
Changed in linux (Ubuntu Trusty):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2014-08-21 19:39 EDT-------
Thank you for confirming!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.