eglibc newer than 2.12.1 in natty results in alignment errors, SIGLILL and segfaults on tegra2 systems

Bug #739374 reported by Oliver Grawert
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
eglibc (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

doing an upgrade from a maverick to a natty rootfs on tegra2 hardware results in a mostly nonworking system, apps segfailt or die with SIGILL, dmesg is full of alignment error messages.

apparently tegra 2 processors have a bug in the register read path of bit 20 of the CP15 c13, 3 register (used for software thread local storage)

there is a tegra errata (657451) for kernel as well as for the android bionic lib that seems to work around the issue
http://gitorious.org/replicant/android_bionic/commit/e88cc3d8cb2989f66624d018a6f5fa559c51460b?diffmode=sidebyside

in maverick libc did not have this issue, it only regressed in natty due to either a change in libc or in the toolchain.
pinning libc to 2.12.1 and doing a dist-upgrade makes everything work fine.

Tags: armel
Oliver Grawert (ogra)
tags: added: armel
Revision history for this message
Peter Maydell (pmaydell) wrote :

Note that the approach taken by that patch is that when writing the TLS register we move bit 20 down into bit 0, and then on reading we move bit 0 back up into bit 20. So this requires changes to everything that reads or writes the TLS register. There are Android patches that do this for libc and the kernel; however gcc will happily emit inline TLS register accesses for __thread variables if it is compiling for armv7, because it knows the CP15 register must exist. Presumably for Android the idea is that code going onto the device can be controlled sufficiently to mandate compiling with non-inline TLS accesses. Unfortunately I don't think that's going to fly for a generic Linux...

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

I've mailed someone at Nvidia asking for more details of the Errata; (the <email address hidden> address bounced for me).
Depending how it fails I wondered if it would be possible to align the TLS allocation to avoid corruption.

Dave

Revision history for this message
Tobin Davis (gruemaster) wrote :

Could you add info on the package version of eglibc that fails and possibly some steps to reproduce this error? Thanks.

Changed in eglibc (Ubuntu):
status: New → Incomplete
Revision history for this message
martin brook (martinbrook) wrote :

We had this issue while porting MeeGo to Tegra2.

Not ideal I know but we use -mtp=soft when building glibc

vgrade

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

Stephen Warren of Nvidia tells me the bit always reads as 0 - so that does suggest if we could get the TLS data aligned we could keep using the register.

Dave

Revision history for this message
Tobin Davis (gruemaster) wrote :

Note that I have not seen any issues running natty on a Tegra2 devkit with 2.6.32 kernel. I don't think this kernel would support the AC100 though.

Revision history for this message
Michael Hope (michaelh1) wrote :

I can reproduce this in a Natty chroot on my AC100:

root@vela:/# uname -a
Linux vela 2.6.29-arm2-ac100 #1 SMP PREEMPT Mon Oct 11 12:31:40 CEST 2010 armv7l armv7l armv7l GNU/Linux
root@vela:/# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 11.04
Release: 11.04
Codename: natty

root@vela:/# aptitude
Illegal instruction

Incidentally, running aptitude again doesn't show this fault.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for eglibc (Ubuntu) because there has been no activity for 60 days.]

Changed in eglibc (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Peter Maydell (pmaydell) wrote :

For the record, the general consensus on the #ac100 irc channel seems to be:

(1) if you have a mismatched kernel and eglibc, where the kernel has its half of the Android erratum workaround enabled but the libc does not, then you are going to get segfaults (purely as a result of the mismatch and without requiring any kind of hardware bug to manifest itself) if libc tries to do TLS by direct use of the cp15 register. Maverick eglibc was OK because it always deferred to the kernel to do TLS. I believe the segfault Michael reports in comment #7 is this "mismatched libc/kernel" kind.

(2) if you did want to try to work around this bug in a way which didn't require unpleasant and impractical things like "compile everything to avoid the cp15 TLS register", the only approach we could think of was to make eglibc always allocate TLS data such that the value to be stored in the TLS register has bit 20 clear...

(3) ...however, if you have a stock eglibc and a kernel with the erratum workaround disabled/removed, then things seem in practice to work OK. Speculation is that perhaps the erratum is only a problem in marginal situations (eg if the core is very hot).

So what we've ended up doing is disabling the kernel workaround and crossing our fingers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.