Bionic domain name functions are not thread-safe on pre-3.0 Android

Bug #856072 reported by Zach Pfeffer on 2011-09-22
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro Android
Won't Fix
High
Unassigned
Mozilla Firefox
Fix Released
High

Bug Description

In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et al) are not thread-safe because stdio is not thread-safe... These functions rely on stdio for reading from the /etc/hosts file.

AFAIK in Gecko, we launch worker threads which call these domain functions, and we crash inside bionic when we run into such a race condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in 6.0.

I haven't seen single-core devices running into these crashes, but on dual-core devices they are a lot more severe. The ugly fix would be using locks around getaddrinfo calls. Another workaround would be setting the CPU affinity on worker threads; doesn't fix the issue, but should make it a lot less common. Also, might be possible to not use getaddrinfo, but that seems to have implications in IPv6 support (Bug 626866);

Complete thread on: https://bugzilla.mozilla.org/show_bug.cgi?id=687367

Offshoot from Bug 662936, which was the same bug occurring on dual-core Tegras running talos.

In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et al) are not thread-safe because stdio is not thread-safe... These functions rely on stdio for reading from the /etc/hosts file.

AFAIK in Gecko, we launch worker threads which call these domain functions, and we crash inside bionic when we run into such a race condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in 6.0.

I haven't seen single-core devices running into these crashes, but on dual-core devices they are a lot more severe. The ugly fix would be using locks around getaddrinfo calls. Another workaround would be setting the CPU affinity on worker threads; doesn't fix the issue, but should make it a lot less common. Also, might be possible to not use getaddrinfo, but that seems to have implications in IPv6 support (Bug 626866); I don't know enough to say.

lets add locking around the domain resolution and ifdef it for #android. This should only be done for pre-honeycomb (i think). jim, want to throw up a patch?

(In reply to Doug Turner (:dougt) from comment #1)
> lets add locking around the domain resolution and ifdef it for #android.
> This should only be done for pre-honeycomb (i think). jim, want to throw up
> a patch?

wow. That could have such a huge negative impact. We need parallelism on high latency operations.

Do you have a pointer to the android getaddrinfo() implementation in question? I can't find it because android.git.kernel.org is still offline. Is the io conditional on anything - a lazy init perhaps?

Maybe we can clone that code and remove the stdio (or just lock it?)?

If we really are forced to go down this path I would prefer MAX_RESOLVER_THREADS_FOR_ANY_PRIORITY, and MAX_RESOLVER_THREADS_FOR_HIGH_PRIORITY were changed into prefs that could be set to {1,0} for this case rather than the lock. But I really think that's a big step back for phones as modern as gingerbread!

slow is better than crashy. You're right, we also could just move the bionic code into nspr (or necko), but I am not sure how much that code fans out. That would be better than locking for sure.

(In reply to Patrick McManus from comment #2)
> (In reply to Doug Turner (:dougt) from comment #1)
> > lets add locking around the domain resolution and ifdef it for #android.
> > This should only be done for pre-honeycomb (i think). jim, want to throw up
> > a patch?
>
> wow. That could have such a huge negative impact. We need parallelism on
> high latency operations.
>
> Do you have a pointer to the android getaddrinfo() implementation in
> question? I can't find it because android.git.kernel.org is still offline.
> Is the io conditional on anything - a lazy init perhaps?
>
Here's a mirror (code is forked from netbsd): https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/getaddrinfo.c

fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are called by _files_getaddrinfo() which is responsible for reading /etc/hosts. I don't see a way to skip that (unless you count the workaround for our testing infra which is to delete /etc/hosts and make fopen() fail; but obviously a dirty hack used internally)

(In reply to Doug Turner (:dougt) from comment #3)
> slow is better than crashy. You're right, we also could just move the
> bionic code into nspr (or necko), but I am not sure how much that code fans
> out. That would be better than locking for sure.

Yeah. Also setting affinity will keep parallelism while reducing chance of race condition to pre-dual-core levels, which wasn't really an issue back then (I don't think).

> Here's a mirror (code is forked from netbsd):
> https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/
> getaddrinfo.c
>
> fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are
> called by _files_getaddrinfo() which is responsible for reading /etc/hosts.

Thanks Jim!

Obviously this code can crash on single cpu devices too, it just doesn't happen often. right? It seems worth trying to fix that. I don't really think process affinity is a replacement for a data safety model.

> I don't see a way to skip that

One thing we could do would be to cache the contents of the file in ram in a read-only threadsafe buffer. Update it under lock for the same conditions we call res_ninit() right now.

A slightly uglier approach would be to just put a lock around _files_get_addrinfo. If STDIO is really not threadsafe (independent of file) couldn't either of these still have races against the content process saving a download, for instance? Or is the issue that all the threads are reading /etc/hosts?

Either way means pretty much importing the c file you referenced and calling that version of getaddrinfo on android < honeycomb, but that actually looks relatively easy to do to me. (almost everything in there is static, so just drag it along and make the necessary changes).

(In reply to Patrick McManus from comment #5)
> > I don't see a way to skip that
>
> One thing we could do would be to cache the contents of the file in ram in a
> read-only threadsafe buffer. Update it under lock for the same conditions we
> call res_ninit() right now.
>
> A slightly uglier approach would be to just put a lock around
> _files_get_addrinfo. If STDIO is really not threadsafe (independent of file)
> couldn't either of these still have races against the content process saving
> a download, for instance? Or is the issue that all the threads are reading
> /etc/hosts?

Hm, good point; actually stdio is not thread-safe in general :(

There might be other race conditions, but the one I found happens when allocating FILE handles: (the netbsd code had locks but they were removed in the android fork; boo)
https://github.com/android/platform_bionic/blob/master/libc/stdio/findfp.c#L95

So in theory, even if we lock getaddrinfo, we could race against other stdio usages.

> Either way means pretty much importing the c file you referenced and calling
> that version of getaddrinfo on android < honeycomb, but that actually looks
> relatively easy to do to me. (almost everything in there is static, so just
> drag it along and make the necessary changes).

Thanks for the ideas!

Sorry for the long list of questions here.

Do we know why other apps on Android running in parallel do not hit this crash? Do they do all their IO through Java? Or does it have to do with each Android app being a separate unix user somehow?

Also, do we know why we are not hitting this with our two-process model? I guess we simply don't do much stdio stuff in the child? Surely we do plenty of it indirectly though, through various libraries (font loading, for example)?

How sure are we that the problem is limited to getaddrinfo and gethostbyname? Do we run a risk of this crash with every file opened, for example? Or do we feel it is limited to those two functions?

If we pull this code and fix it in our codebase, wouldn't we run into potential problems with different versions of bionic on different phones? (I mean, bionic could be patched for some issue on some device, and our single fixed implementation would not have that stuff.)

Finally, how does the actual threading issue happen: On one side we are calling getaddrinfo or gethostbyname, what is the other thread doing that causes the problem, do we know?

(In reply to Alon Zakai (:azakai) from comment #7)

> Finally, how does the actual threading issue happen: On one side we are
> calling getaddrinfo or gethostbyname, what is the other thread doing that
> causes the problem, do we know?

I would assume, from the way this is described, the other thread is also commonly doing getaddrinfo(). There are (up to) 8 different dedicated DNS threads because getaddrinfo() is a blocking network operation that we need to perform in parallel. The system libraries do not provide an async API for it, so we do it in parallel.

It would also be common for those parallel lookups to start almost simultaneously, maximizing the chances to be executing the same stdio code simultaneously, because they are discovered in the same HTML parse (perhaps as sharded domain names for images or perhaps as DNS pefetches discovered by parsing links).

My plan is to pull in the gingerbread bionic implementation of getaddrinfo() (in libc/netbsd/net/getaddrinfo.c) and reimplement _gethtent() without stdio calls. Note that freeaddrinfo() and gai_strerror() will also be provided, to ensure that they match the getaddrinfo() implementation.

I intend to add getaddrinfo.o to libmozutils, inside /other-licenses/android, and to rename the public symbols to __wrap_getaddrinfo(), etc. so they will shadow the libc symbols at link time.

Zach Pfeffer (pfefferz) on 2011-09-22
visibility: private → public
affects: linaro-landing-team-ti → lucidspice
Fathi Boudra (fboudra) on 2011-09-22
description: updated
Fathi Boudra (fboudra) on 2011-09-22
affects: lucidspice → firefox
Changed in firefox:
importance: Undecided → Unknown
status: New → Unknown
Changed in firefox:
importance: Unknown → High
status: Unknown → In Progress
Fathi Boudra (fboudra) on 2011-09-22
Changed in linaro-android:
importance: Undecided → High

medwards will try to fix this today for android, we need a separate patch for B2G later

Michael, can you please post a status update?

I'm setting up to push the candidate fix to the TryChooser. Should have a testable build this afternoon.

I couldn't find any try server runs with this. Can you please link it?

Also, can you please attach the patch?

Thanks.

Note that I have hacked out the code in getaddrinfo.c that looks aside to dnsproxyd, since the DNS proxy is not enabled in Android 2.3 or earlier, according to http://code.google.com/p/android/issues/detail?id=15722 .

(see android_getaddrinfo_proxy() in the original code; I copied http://git.android-x86.org/?p=platform/bionic.git;a=blob;f=libc/netbsd/net/getaddrinfo.c;h=edb4f707e730057d01c10872f01c932c31f42fbf;hb=refs/heads/gingerbread-x86)

Created attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

(Lightly tested in a local build; bsmith is helping with tryserver run)

Thanks for your try submission (http://hg.mozilla.org/try/pushloghtml?changeset=4974bec3a8d3). It's the best!

Watch https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=4974bec3a8d3 for your results to come in.

Builds and logs will be available at http://<email address hidden>.

This directory won't be created until the first builds are uploaded, so please be patient.

Created attachment 562978
diff android 2.3 bionic code vs WIP patch

Download full text (8.9 KiB)

Comment on attachment 562978
diff android 2.3 bionic code vs WIP patch

>--- getaddrinfo.c 2011-09-27 23:03:51.000000000 -0700
>+++ /Users/gal/workspace/B2G/./glue/gonk/bionic/libc/netbsd/net/getaddrinfo.c 2011-08-16 00:45:19.000000000 -0700
>@@ -31,16 +31,6 @@
> */
>
> /*
>- * This version of getaddrinfo.c is derived from Android 2.3 "Gingerbread",
>- * which contains uncredited changes by Android/Google developers. It has
>- * been modified in 2011 for use in the Android build of Mozilla Firefox by
>- * Mozilla contributors (including Michael Edwards <email address hidden>).
>- * These changes are offered under the same license as the original NetBSD
>- * file, whose copyright and license are unchanged above.
>- */
>-#define ANDROID_CHANGES 1
>-
>-/*
> * Issues to be discussed:
> * - Thread safe-ness must be checked.
> * - Return values. There are nonstandard return values defined and used
>@@ -87,14 +77,10 @@
> * friends.
> */
>
>-#include <fcntl.h>
> #include <sys/cdefs.h>
> #include <sys/types.h>
>-#include <sys/stat.h>
> #include <sys/param.h>
> #include <sys/socket.h>
>-#include <sys/un.h>
>-#include <sys/mman.h>
> #include <net/if.h>
> #include <netinet/in.h>
> #include <arpa/inet.h>
>@@ -102,10 +88,10 @@
> #include <assert.h>
> #include <ctype.h>
> #include <errno.h>
>-#include <fcntl.h>
> #include <netdb.h>
> #include "resolv_private.h"
> #include <stddef.h>
>+#include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
>@@ -114,87 +100,6 @@
> #include <stdarg.h>
> #include "nsswitch.h"
>
>-#ifdef ANDROID_CHANGES
>-#include <sys/system_properties.h>
>-#endif /* ANDROID_CHANGES */
>-
>-typedef struct _pseudo_FILE {
>- int fd;
>- off_t maplen;
>- void* mapping;
>- off_t offset;
>-} _pseudo_FILE;
>-
>-#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
>-
>-static void
>-_pseudo_fclose(_pseudo_FILE* fp)
>-{
>- fp->offset = 0;
>- if (fp->mapping != MAP_FAILED) {
>- (void) munmap(fp->mapping, fp->maplen);
>- fp->mapping = MAP_FAILED;
>- }
>- fp->maplen = 0;
>- if (fp->fd != -1) {
>- (void) close(fp->fd);
>- fp->fd = -1;
>- }
>-}
>-
>-static _pseudo_FILE*
>-_pseudo_fopen_r(_pseudo_FILE* fp, const char* fname)
>-{
>- struct stat statbuf;
>- fp->fd = open(fname, O_RDONLY);
>- if (fp->fd < 0) {
>- fp->fd = -1;
>- return NULL;
>- }
>- if ((0 != fstat(fp->fd, &statbuf)) || (statbuf.st_size <= 0)) {
>- close(fp->fd);
>- fp->fd = -1;
>- return NULL;
>- }
>- fp->maplen = statbuf.st_size;
>- fp->mapping = mmap(NULL, fp->maplen, PROT_READ, MAP_PRIVATE, fp->fd, 0);
>- if (fp->mapping == MAP_FAILED) {
>- close(fp->fd);
>- fp->fd = -1;
>- return NULL;
>- }
>- fp->offset = 0;
>- return fp;
>-}
>-
>-static void
>-_pseudo_rewind(_pseudo_FILE* fp)
>-{
>- fp->offset = 0;
>-}
>-
>-static char*
>-_pseudo_fgets(char* buf, int bufsize, _pseudo_FILE* fp)
>-{
>- char* current;
>- char* endp;
>- int maxcopy = fp->maplen - fp->offset;
>- if (fp->mapping == MAP_FAILED)
>- return NULL;
>- if (maxcopy > bufsize - 1)
>- maxcopy = bufsize - 1;
>- ...

Read more...

Created attachment 562979
diff android 2.3 bionic code vs WIP patch

Comment on attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

Review of attachment 562974:
-----------------------------------------------------------------

See also https://bug687367.bugzilla.mozilla.org/attachment.cgi?id=562979, which shows the delta between the new files and the Android 2.3 fork of NetBSD from which they were derived.

dougt, can you land when you r+, medwards has no commit access yet. thanks

Comment on attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap

Review of attachment 562974:
-----------------------------------------------------------------

Discussed on irc. I wanted to see the ofllow:

* add new files, in patched state
* patch that removes the DNS proxy lookaside from the original Android 2.3 files
* patch, incremental, that replaces stdio with mmap
* changes to configure.in and Makefile.in to compile the new file and do the wrap mechanics
* pre-honeycomb only

Detecting >= Honeycomb can only be done from Java. (See http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsSystemInfo.cpp#146 and Christopher's comments on http://stackoverflow.com/questions/2441680/how-do-i-know-what-android-ndk-version-im-running-in .) Are you OK with polling it during startup and poking it into a global? (Yuck, I know, but a callback into Java on every call to getaddrinfo()/freeaddrinfo() sounds painful.)

JNI can do this pretty easily, right?

how about this Michael... lets do this for all versions of android, then follow up making this pre-honeycomb only.

Created attachment 563867
Updated patch, with Honeycomb handling

Comment on attachment 563867
Updated patch, with Honeycomb handling

Review of attachment 563867:
-----------------------------------------------------------------

looking better. Please fixup. mwu should also look at these changes when the patch is cleaned up.

mwu - are you comfortable with the license here? is it similar to the APKOpen stuff?

::: memory/mozutils/Makefile.in
@@ +83,5 @@
> ifeq (Android, $(OS_TARGET))
> # Add Android linker
> EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror

Why is this needed, but wasn't needed to wrap malloc?

::: other-licenses/android/002-replace-stdio-with-mmap.patch
@@ +28,5 @@
> +@@ -94,7 +105,6 @@
> + #include <netdb.h>
> + #include "resolv_private.h"
> + #include <stddef.h>
> +-#include <stdio.h>

why this change?

@@ +54,5 @@
> ++
> ++#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
> ++
> ++static void
> ++_pseudo_fclose(_pseudo_FILE* fp)

test for a null fp? and elsewhere

@@ +111,5 @@
> ++ if (maxcopy > bufsize - 1)
> ++ maxcopy = bufsize - 1;
> ++ if (maxcopy <= 0)
> ++ return NULL;
> ++ //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);

remove

@@ +259,5 @@
> +@@ -1373,8 +1508,6 @@
> + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> + return 0;
> +- } else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {

what is this change about?

@@ +268,5 @@
> +@@ -1414,8 +1547,6 @@
> + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> + return 60;
> +- } else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {

and this?

@@ +278,5 @@
> + }
> +
> + static void
> +-_sethtent(FILE **hostf)
> ++_sethtent(_pseudo_FILE *hostf)

can't you just

#undef FILE
#define FILE _pseudo_FILE

for this and similar redefs. at the top of this file? It will safe a bunch of these changes.

@@ +313,5 @@
> + const char *addr;
> + char hostbuf[8*1024];
> +
> +-// fprintf(stderr, "_gethtent() name = '%s'\n", name);
> ++ //fprintf(stderr, "_gethtent() name = '%s'\n", name);

revert this change.

@@ +335,5 @@
> + tname = cp;
> + if ((cp = strpbrk(cp, " \t")) != NULL)
> + *cp++ = '\0';
> +-// fprintf(stderr, "\ttname = '%s'", tname);
> ++ //fprintf(stderr, "\ttname = '%s'\n", tname);

same.

@@ +368,5 @@
> + name = va_arg(ap, char *);
> + pai = va_arg(ap, struct addrinfo *);
> +
> +-// fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);
> ++ //fprintf(stderr, "_files_getaddrinfo() name = '%s'\n", name);

same.

::: xpcom/base/nsSystemInfo.cpp
@@ +48,5 @@
> #ifdef ANDROID
> #include "AndroidBridge.h"
> +
> +extern "C" {
> +extern volatile int android_sdk_version;

volatile keyword not needed. you should not use extern here either. (this is the implicit definition). you will need extern for the explicit declaration (where it is used outside of this file)

Download full text (4.7 KiB)

(In reply to Doug Turner (:dougt) from comment #27)
> Comment on attachment 563867 [diff] [details] [review]
> Updated patch, with Honeycomb handling
>
> Review of attachment 563867 [diff] [details] [review]:
> -----------------------------------------------------------------
>
> looking better. Please fixup. mwu should also look at these changes when
> the patch is cleaned up.
>
> mwu - are you comfortable with the license here? is it similar to the
> APKOpen stuff?
>
> ::: memory/mozutils/Makefile.in
> @@ +83,5 @@
> > ifeq (Android, $(OS_TARGET))
> > # Add Android linker
> > EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> > SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> > +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror
>
> Why is this needed, but wasn't needed to wrap malloc?

Now that we call __real_getaddrinfo() from __wrap_getaddrinfo(), libmozutils linking breaks without it.

> ::: other-licenses/android/002-replace-stdio-with-mmap.patch
> @@ +28,5 @@
> > +@@ -94,7 +105,6 @@
> > + #include <netdb.h>
> > + #include "resolv_private.h"
> > + #include <stddef.h>
> > +-#include <stdio.h>
>
> why this change?

Because the point of the patch is to remove all use of stdio from getaddrinfo(). The commented-out debug log lines, if restored, should be changed from stderr to the Android log mechanism (which is useful on a non-rooted device, unlike stderr).

> @@ +54,5 @@
> > ++
> > ++#define _PSEUDO_FILE_INITIALIZER { -1, 0, MAP_FAILED, 0 }
> > ++
> > ++static void
> > ++_pseudo_fclose(_pseudo_FILE* fp)
>
> test for a null fp? and elsewhere

Really just needs a "restrict" keyword, since it's never valid to pass it a NULL pointer.

> @@ +111,5 @@
> > ++ if (maxcopy > bufsize - 1)
> > ++ maxcopy = bufsize - 1;
> > ++ if (maxcopy <= 0)
> > ++ return NULL;
> > ++ //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);
>
> remove

remove what? just the commented-out log line?

> @@ +259,5 @@
> > +@@ -1373,8 +1508,6 @@
> > + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> > + return 0;
> > +- } else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
>
> what is this change about?

That macro isn't defined by the older headers in the NDK we're using. It's a side effect of another patch that was already applied to the version of getaddrinfo.c I started from; we want the rest of that patch.

> @@ +268,5 @@
> > +@@ -1414,8 +1547,6 @@
> > + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + if (IN6_IS_ADDR_LOOPBACK(&addr6->sin6_addr)) {
> > + return 60;
> > +- } else if (IN6_IS_ADDR_ULA(&addr6->sin6_addr)) {
>
> and this?

ditto

> @@ +278,5 @@
> > + }
> > +
> > + static void
> > +-_sethtent(FILE **hostf)
> > ++_sethtent(_pseudo_FILE *hostf)
>
> can't you just
>
> #undef FILE
> #define FILE _pseudo_FILE
>
> for this and similar redefs. at the top of this file? It will safe a bunch
> of these changes.

No. Note the different number of *s. I changed the APIs so that the _pseudo_FILE struct is caller-allocated, in order to avo...

Read more...

(In reply to Doug Turner (:dougt) from comment #27)
> mwu - are you comfortable with the license here? is it similar to the
> APKOpen stuff?

The license looks fine to me but IANAL. You can always open up a bug to have legal check. The APKOpen stuff was also lifted from bionic.

It would probably be nice to have *just* the bionic files come in as their own check-in and then the actual changes/integration in another patch.

(In reply to Michael K. Edwards from comment #28)
> (In reply to Doug Turner (:dougt) from comment #27)
> > Comment on attachment 563867 [diff] [details] [review] [diff] [details] [review]
> > Updated patch, with Honeycomb handling
> >
> > Review of attachment 563867 [diff] [details] [review] [diff] [details] [review]:
> > -----------------------------------------------------------------
> >
> > looking better. Please fixup. mwu should also look at these changes when
> > the patch is cleaned up.
> >
> > mwu - are you comfortable with the license here? is it similar to the
> > APKOpen stuff?
> >
> > ::: memory/mozutils/Makefile.in
> > @@ +83,5 @@
> > > ifeq (Android, $(OS_TARGET))
> > > # Add Android linker
> > > EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> > > SHARED_LIBRARY_LIBS += $(call EXPAND_LIBNAME_PATH,android,$(DEPTH)/other-licenses/android)
> > > +WRAP_LDFLAGS = -Wl,--wrap=getaddrinfo,--wrap=freeaddrinfo,--wrap=gai_strerror
> >
> > Why is this needed, but wasn't needed to wrap malloc?
>
> Now that we call __real_getaddrinfo() from __wrap_getaddrinfo(), libmozutils
> linking breaks without it.

Just call getaddrinfo() from __wrap_getaddrinfo(), and you don't need -Wl,--wrap... like __wrap_dl* functions call dl* (see dlfcn.h)

(In reply to Mike Hommey [:glandium] from comment #31)
> Just call getaddrinfo() from __wrap_getaddrinfo(), and you don't need
> -Wl,--wrap... like __wrap_dl* functions call dl* (see dlfcn.h)

err, dlfcn.c.

How are we doing with a new patch here?

Created attachment 564480
Updated patch, addressing review comments

This updated patch isn't properly tested yet. I also didn't address Mike Hommey's comment, because I think it's safer to use __real_getaddrinfo() and thus require that direct users of getaddrinfo() use the --wrap flags. (Indirect users don't need them.)

michael, did you look at the debug test failures? Are they random, or were they caused by this patch?

I looked at them with people on #mobile (mbrubeck, IIRC) on Friday, who declared them to be symptoms of perma-orange issues on Android, unrelated to my patch.

(In reply to Michael K. Edwards from comment #35)
> This updated patch isn't properly tested yet. I also didn't address Mike
> Hommey's comment, because I think it's safer to use __real_getaddrinfo() and
> thus require that direct users of getaddrinfo() use the --wrap flags.
> (Indirect users don't need them.)

There could be a concern if libmozutils was much bigger and was doing much more. Anyways, if ted is fine with that (i.e. using __real_getaddrinfo and --wrap flags), then so be it.

I defer to glandium here.

ooook. So, the only calls intended for libc's getaddrinfo() from libmozutils, at the moment, are from that wrapper code. I don't see any reason in the near future (and even longer term) for something from libmozutils to call getaddrinfo(). I don't think there's a strong case for cluttering the build rules with additional --wrap flags (for libmozutils, that is ; other uses are required).

The point is not that other code in libmozutils needs to call getaddrinfo(); it doesn't. The point is that libraries that call getaddrinfo() need to be linked against libmozutils, with the appropriate --wrap flags, in order to ensure that their calls to getaddrinfo() get rewritten as calls to __wrap_getaddrinfo(). Otherwise they will get the wrong getaddrinfo() (the bionic one) when the library is loaded via dlopen(). The only occurrence of the getaddrinfo() symbol in the final set of binaries should be the one inside the implementation of __wrap_getaddrinfo(); and the "official" way to achieve that is to call __real_getaddrinfo() in the source code and let the linker rewrite that to getaddrinfo().

the "official" way to achieve that is for when all your code compiles to a program or library. Here, we have libmozutils, and the rest. The rest needs --wrap so that getaddrinfo() calls are properly redirected to __wrap_getaddrinfo and linked against the corresponding symbols in libmozutils. Libmozutils, on the other hand, just needs to declare __wrap_getaddrinfo functions, and these functions can just call getaddrinfo. No need for __real_getaddrinfo, no need for additional --wrap in memory/mozutils/Makefile.in. That was my point.

Fathi Boudra (fboudra) on 2011-10-05
Changed in linaro-android:
status: New → Triaged

medwards is heads down in B2G work and mwu agreed to take this over the finish line and land it. Reassigning.

The backout was backed out until we can figure out which of several patches broke Android tests:
https://hg.mozilla.org/mozilla-central/rev/bcf12565b95b

Leaving this bug open, as we still plan to back it out ASAP.

The crashes look exotic to me, in that they are segfaults due to an assortment of unreasonable values for the "name" pointer, at what happens to be the first place where it is dereferenced after it is received through the inter-thread dispatch mechanism. Either it doesn't make it through dispatch correctly or it gets corrupted thereafter. The only reasonable thing I can think of to do is to dereference the pointer in the caller (nsHostResolver::ThreadFunc) and re-land the patch (with the fixes from 693280), and see whether the crashes appear in ThreadFunc or down in strlen. (One could also add a couple more dereferences along the way, in __wrap_getaddrinfo and _dns_getaddrinfo.)

This was also re-landed and then re-backed out on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a5b88a4500aa
https://hg.mozilla.org/integration/mozilla-inbound/rev/09ca3ba9ccb0

(These should be a no-op in the next merge from inbound to m-c. Inbound sheriff, you can just leave this bug OPEN after merging these two changesets.)

jchen points out that it could be strlen(domain) that fails, and that the domain pointer is extracted from a structure returned by a function internal to the NetBSD resolver code (__res_get_state()), which we haven't copied. If this is where the problem originates -- due to incompatible layout of struct _res_state, or some such -- we're going to have to pull in res_state.c, and perhaps more (if we also need to replace res_ninit()). Does someone who can reproduce the crashes want to check this hypothesis before we dive further into it?

Changed in firefox:
status: In Progress → Confirmed

This seems to be quite a rat hole here. Is there an easier fix? Can we do DNS lookups from one thread only? Can we ferry the lookups to the Java layer? (which would give us free caching, btw). Doug? Brian?

I suggested that we just put a lock around the call site in comment #1, then fix it for real. This allows us to unfuck our users. We can fix it correctly post that commit.

medwards was argueing that might not help since the crash is in stdio, which is called from other paths as well. Though, since this is all one huge clusterfuck in Android, I would be down to just giving that a try and see whether the crash frequency goes down. Also, we should check what webkit does.

I do think that throwing a lock around the use of getaddrinfo() will reduce the crash frequency, because the contention on the stdio FILE table is most often caused by concurrent DNS requests initiated from two or more worker threads. It won't *fix* the problem; but it will probably stem the bleeding.

(In reply to Patrick McManus from comment #2)
> If we really are forced to go down this path I would prefer
> MAX_RESOLVER_THREADS_FOR_ANY_PRIORITY, and
> MAX_RESOLVER_THREADS_FOR_HIGH_PRIORITY were changed into prefs that could be
> set to {1,0} for this case rather than the lock. But I really think that's a
> big step back for phones as modern as gingerbread!

Let's stop the crash now and deal with the performance fallout later.

Patrick, could you take this bug? It seems to be something that urgently needs to be fixed and you seem to know best what needs to be done. If you cannot take it up right away, please assign it to me and I will figure it out.

Created attachment 566450
patch

Lets get this minimal fix landed. I heard we only have around 10k honeycomb users, so we can take a couple days for a follow-up fix with dynamic detection.

Comment on attachment 566450
patch

Review of attachment 566450:
-----------------------------------------------------------------

I got any/high backwards in my bugzilla comment.

I'll post a different patch when I test it (shortly!).

bz will probably be the right reviewer.

Created attachment 566517
no-concurrent-android-dns patch 2

This will create concurrency of 1 and disable dns prefetch for android.

A followup bug is really important.

Comment on attachment 566517
no-concurrent-android-dns patch 2

r=me

http://hg.mozilla.org/integration/mozilla-inbound/rev/332bc69d0f2d

please file a high priority followup bug. Resolutions that timeout will severely gum up the works on mobile now.

(In reply to Andreas Gal :gal from comment #57)
> Created attachment 566450 [diff] [details] [review]
> patch
>
> Lets get this minimal fix landed. I heard we only have around 10k honeycomb
> users, so we can take a couple days for a follow-up fix with dynamic
> detection.

FWIW, we have >120K honeycomb users. Not that it changes the type of fix.

Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that correct Mark?

(In reply to Andreas Gal :gal from comment #64)
> Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that
> correct Mark?

"Gingerbread (3.0)" is confusing me, but the breakdown:

Gingerbread (2.3.*): >1.4 million users
Honeycomb (3.*): >120K users

Ok, thats curious. In the tablet planning meetings we were using wrong numbers than. Where is the 120k number coming from?

(In reply to Andreas Gal :gal from comment #66)
> Ok, thats curious. In the tablet planning meetings we were using wrong
> numbers than. Where is the 120k number coming from?

The Android Market developer portal

Note that the Android Market gives us numbers of "Active Installs", which I believe counts devices that are active (i.e. still pinging the Market server) and currently have Firefox installed (i.e. have not uninstalled it).

This is very different from our own "Active Daily Users" metric based on blocklist pings, which only counts actual usage.

We currently have a total of about 2,700,000 "Active Installs" but only about 330,000 "Active Daily Users". Comparing usage metrics is impossible unless you actually specify which metric you are talking about.

also note that we typically use a multiplier on ADUs to estimate our actual install base since not everyone uses there browser every day or long enough to get a block list ping. I believe the current multiplier is around 6x. So if we have 10K ADUs we'd estimate 60k active users, which would be half our 120k current installs.

I see. Thanks.

Changed in firefox:
status: Confirmed → Fix Released

Noticed that this was identified as one of the top 3 crashes for Android - adjusted tracking. There should be no stability downside to the patch, but there is also no good way to determine what if any performance regression there would be

Note: patch fixes crash, but does so by disabling parallelism for Android DNS lookups.

Comment on attachment 566517
no-concurrent-android-dns patch 2

[Triage Comment]
* Approving for Aurora given this is a top crasher
* Leaving approval‑mozilla‑beta as "?" since we'd reconsider if another bug causes us to re-roll

Will check into mozilla-aurora after finishing XPCOM proxy patch fixups.

Thanks, Alex, Brian.

We respun for a different issue so decided to take this on beta (and I transplanted to aurora):

http://hg.mozilla.org/releases/mozilla-aurora/rev/4b3528fc67a8
http://hg.mozilla.org/releases/mozilla-beta/rev/dc48fbe54c4a

Adding [qa+] for bug verification tracking.

We'll look at crash-stats before and after the fix, but it would be great if anyone has suggestions on how to verify this bug directly.

Juan, reproducing the bug seems to be difficult, and I don't have a lot to offer other than it was happening mostly on dual core devices, pre-honeycomb. I never observed it myself.

Jchen might be able to offer some more help - he added a stack trace highlighting the weakness in thise code in bug 662936 comment 54.

I think Bug 689989 should be enough. After removing the original workaround, our tests on the Tegra boards should be able to verify the fix.

Amit Pundir (pundiramit) wrote :

Marking old (Gingerbread) open bugs to "Won't fix" state (on Linaro-Android helpdesk) as fixes are done only on ICS track now. Please raise this bug again against latest Linaro-ICS release if it still exist.

Changed in linaro-android:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.