Comment 75 for bug 1640518

Revision history for this message
William J. Schmidt (wschmidt-g) wrote :

Given the results of Aaron's experiments over the weekend, here's a summary of what we think we're seeing:
 - There is an unsafe interaction between two threads.
 - This interaction can only be observed in a very small time window, and then relatively rarely.
 - The interaction is observable only when TLE is enabled. We posit this is because TLE improves lock/unlock speed, so that the observable time window applies.
 - The time window is sufficiently small that a relatively fast syscall to mprotect is sufficient to disrupt the timing.
 - If all threads are forced to run on separate processors (SMT=1), this is sufficient to disrupt the timing.

We believe this is an application problem. Further "evidence" against a problem in TLE itself is that TLE has been enabled for ppc64el on Ubuntu for over 18 months without any similar reports.

Unfortunately, I don't think we have any way to directly debug the problem, due to the narrow time window. We have considered hardware watchpoints. There is a DAWR (data address watch register) for each thread, but using it in this case seems impractical. GDB's implementation of hardware watchpoints in a multithreaded environment is such that, when a hardware watchpoint is set, it is set to the same address for all threads. So even if we were to script the setting of a hardware watchpoint under GDB, the time required for GDB to set up the watchpoint address on the fly would surely exceed the critical time window. You could try it, but I wouldn't expect much.

Further debugging seems to require application knowledge or a code crawl of some kind. The setting of two flag bytes is the only clue we have. To my knowledge we have never seen only a single byte clobbered in the canary, and the two bytes appear to be aligned on a 16-bit boundary. So the code doing the clobbering very likely contains a store-halfword (sth or, less likely, sthx or sthu) instruction. You could examine the disassembly of the application for occurrences of sth to narrow the field of search.

It could be two individual stores, in which case you'd be looking for two instructions very close together of the form:

  stb r<x>,<n>(r<b>)
  stb r<y>,<n+1>(r<b>)

Example:

  stb r5,0(r9)
  stb r6,1(r9)

Obviously this is a huge application so this doesn't help much in and of itself, but perhaps if you've already narrowed the problem down somewhat, this could be helpful.

I am running low on ideas for how we can help you debug the problem. We'll continue discussing it; if any better thoughts arise, we'll be sure to let you know.

Bill