Bionic domain name functions are not thread-safe on pre-3.0 Android
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linaro Android |
Won't Fix
|
High
|
Unassigned | ||
Mozilla Firefox |
Fix Released
|
High
|
Bug Description
In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et al) are not thread-safe because stdio is not thread-safe... These functions rely on stdio for reading from the /etc/hosts file.
AFAIK in Gecko, we launch worker threads which call these domain functions, and we crash inside bionic when we run into such a race condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in 6.0.
I haven't seen single-core devices running into these crashes, but on dual-core devices they are a lot more severe. The ugly fix would be using locks around getaddrinfo calls. Another workaround would be setting the CPU affinity on worker threads; doesn't fix the issue, but should make it a lot less common. Also, might be possible to not use getaddrinfo, but that seems to have implications in IPv6 support (Bug 626866);
Complete thread on: https:/
In Mozilla Bugzilla #687367, Jimnchen+bmo (jimnchen+bmo) wrote : | #1 |
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #2 |
lets add locking around the domain resolution and ifdef it for #android. This should only be done for pre-honeycomb (i think). jim, want to throw up a patch?
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #3 |
(In reply to Doug Turner (:dougt) from comment #1)
> lets add locking around the domain resolution and ifdef it for #android.
> This should only be done for pre-honeycomb (i think). jim, want to throw up
> a patch?
wow. That could have such a huge negative impact. We need parallelism on high latency operations.
Do you have a pointer to the android getaddrinfo() implementation in question? I can't find it because android.
Maybe we can clone that code and remove the stdio (or just lock it?)?
If we really are forced to go down this path I would prefer MAX_RESOLVER_
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #4 |
slow is better than crashy. You're right, we also could just move the bionic code into nspr (or necko), but I am not sure how much that code fans out. That would be better than locking for sure.
In Mozilla Bugzilla #687367, Jimnchen+bmo (jimnchen+bmo) wrote : | #5 |
(In reply to Patrick McManus from comment #2)
> (In reply to Doug Turner (:dougt) from comment #1)
> > lets add locking around the domain resolution and ifdef it for #android.
> > This should only be done for pre-honeycomb (i think). jim, want to throw up
> > a patch?
>
> wow. That could have such a huge negative impact. We need parallelism on
> high latency operations.
>
> Do you have a pointer to the android getaddrinfo() implementation in
> question? I can't find it because android.
> Is the io conditional on anything - a lazy init perhaps?
>
Here's a mirror (code is forked from netbsd): https:/
fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are called by _files_
(In reply to Doug Turner (:dougt) from comment #3)
> slow is better than crashy. You're right, we also could just move the
> bionic code into nspr (or necko), but I am not sure how much that code fans
> out. That would be better than locking for sure.
Yeah. Also setting affinity will keep parallelism while reducing chance of race condition to pre-dual-core levels, which wasn't really an issue back then (I don't think).
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #6 |
> Here's a mirror (code is forked from netbsd):
> https:/
> getaddrinfo.c
>
> fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are
> called by _files_
Thanks Jim!
Obviously this code can crash on single cpu devices too, it just doesn't happen often. right? It seems worth trying to fix that. I don't really think process affinity is a replacement for a data safety model.
> I don't see a way to skip that
One thing we could do would be to cache the contents of the file in ram in a read-only threadsafe buffer. Update it under lock for the same conditions we call res_ninit() right now.
A slightly uglier approach would be to just put a lock around _files_
Either way means pretty much importing the c file you referenced and calling that version of getaddrinfo on android < honeycomb, but that actually looks relatively easy to do to me. (almost everything in there is static, so just drag it along and make the necessary changes).
In Mozilla Bugzilla #687367, Jimnchen+bmo (jimnchen+bmo) wrote : | #7 |
(In reply to Patrick McManus from comment #5)
> > I don't see a way to skip that
>
> One thing we could do would be to cache the contents of the file in ram in a
> read-only threadsafe buffer. Update it under lock for the same conditions we
> call res_ninit() right now.
>
> A slightly uglier approach would be to just put a lock around
> _files_
> couldn't either of these still have races against the content process saving
> a download, for instance? Or is the issue that all the threads are reading
> /etc/hosts?
Hm, good point; actually stdio is not thread-safe in general :(
There might be other race conditions, but the one I found happens when allocating FILE handles: (the netbsd code had locks but they were removed in the android fork; boo)
https:/
So in theory, even if we lock getaddrinfo, we could race against other stdio usages.
> Either way means pretty much importing the c file you referenced and calling
> that version of getaddrinfo on android < honeycomb, but that actually looks
> relatively easy to do to me. (almost everything in there is static, so just
> drag it along and make the necessary changes).
Thanks for the ideas!
In Mozilla Bugzilla #687367, Azakai (azakai) wrote : | #8 |
Sorry for the long list of questions here.
Do we know why other apps on Android running in parallel do not hit this crash? Do they do all their IO through Java? Or does it have to do with each Android app being a separate unix user somehow?
Also, do we know why we are not hitting this with our two-process model? I guess we simply don't do much stdio stuff in the child? Surely we do plenty of it indirectly though, through various libraries (font loading, for example)?
How sure are we that the problem is limited to getaddrinfo and gethostbyname? Do we run a risk of this crash with every file opened, for example? Or do we feel it is limited to those two functions?
If we pull this code and fix it in our codebase, wouldn't we run into potential problems with different versions of bionic on different phones? (I mean, bionic could be patched for some issue on some device, and our single fixed implementation would not have that stuff.)
Finally, how does the actual threading issue happen: On one side we are calling getaddrinfo or gethostbyname, what is the other thread doing that causes the problem, do we know?
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #9 |
(In reply to Alon Zakai (:azakai) from comment #7)
> Finally, how does the actual threading issue happen: On one side we are
> calling getaddrinfo or gethostbyname, what is the other thread doing that
> causes the problem, do we know?
I would assume, from the way this is described, the other thread is also commonly doing getaddrinfo(). There are (up to) 8 different dedicated DNS threads because getaddrinfo() is a blocking network operation that we need to perform in parallel. The system libraries do not provide an async API for it, so we do it in parallel.
It would also be common for those parallel lookups to start almost simultaneously, maximizing the chances to be executing the same stdio code simultaneously, because they are discovered in the same HTML parse (perhaps as sharded domain names for images or perhaps as DNS pefetches discovered by parsing links).
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #10 |
My plan is to pull in the gingerbread bionic implementation of getaddrinfo() (in libc/netbsd/
I intend to add getaddrinfo.o to libmozutils, inside /other-
visibility: | private → public |
affects: | linaro-landing-team-ti → lucidspice |
description: | updated |
affects: | lucidspice → firefox |
Changed in firefox: | |
importance: | Undecided → Unknown |
status: | New → Unknown |
Changed in firefox: | |
importance: | Unknown → High |
status: | Unknown → In Progress |
Changed in linaro-android: | |
importance: | Undecided → High |
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #11 |
medwards will try to fix this today for android, we need a separate patch for B2G later
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #12 |
Michael, can you please post a status update?
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #13 |
I'm setting up to push the candidate fix to the TryChooser. Should have a testable build this afternoon.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #14 |
I couldn't find any try server runs with this. Can you please link it?
Also, can you please attach the patch?
Thanks.
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #15 |
Note that I have hacked out the code in getaddrinfo.c that looks aside to dnsproxyd, since the DNS proxy is not enabled in Android 2.3 or earlier, according to http://
(see android_
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #16 |
Created attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap
(Lightly tested in a local build; bsmith is helping with tryserver run)
In Mozilla Bugzilla #687367, Bsmith-mozilla (bsmith-mozilla) wrote : | #17 |
Thanks for your try submission (http://
Watch https:/
Builds and logs will be available at http://<email address hidden>.
This directory won't be created until the first builds are uploaded, so please be patient.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #18 |
Created attachment 562978
diff android 2.3 bionic code vs WIP patch
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #19 |
Comment on attachment 562978
diff android 2.3 bionic code vs WIP patch
>--- getaddrinfo.c 2011-09-27 23:03:51.000000000 -0700
>+++ /Users/
>@@ -31,16 +31,6 @@
> */
>
> /*
>- * This version of getaddrinfo.c is derived from Android 2.3 "Gingerbread",
>- * which contains uncredited changes by Android/Google developers. It has
>- * been modified in 2011 for use in the Android build of Mozilla Firefox by
>- * Mozilla contributors (including Michael Edwards <email address hidden>).
>- * These changes are offered under the same license as the original NetBSD
>- * file, whose copyright and license are unchanged above.
>- */
>-#define ANDROID_CHANGES 1
>-
>-/*
> * Issues to be discussed:
> * - Thread safe-ness must be checked.
> * - Return values. There are nonstandard return values defined and used
>@@ -87,14 +77,10 @@
> * friends.
> */
>
>-#include <fcntl.h>
> #include <sys/cdefs.h>
> #include <sys/types.h>
>-#include <sys/stat.h>
> #include <sys/param.h>
> #include <sys/socket.h>
>-#include <sys/un.h>
>-#include <sys/mman.h>
> #include <net/if.h>
> #include <netinet/in.h>
> #include <arpa/inet.h>
>@@ -102,10 +88,10 @@
> #include <assert.h>
> #include <ctype.h>
> #include <errno.h>
>-#include <fcntl.h>
> #include <netdb.h>
> #include "resolv_private.h"
> #include <stddef.h>
>+#include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
>@@ -114,87 +100,6 @@
> #include <stdarg.h>
> #include "nsswitch.h"
>
>-#ifdef ANDROID_CHANGES
>-#include <sys/system_
>-#endif /* ANDROID_CHANGES */
>-
>-typedef struct _pseudo_FILE {
>- int fd;
>- off_t maplen;
>- void* mapping;
>- off_t offset;
>-} _pseudo_FILE;
>-
>-#define _PSEUDO_
>-
>-static void
>-_pseudo_
>-{
>- fp->offset = 0;
>- if (fp->mapping != MAP_FAILED) {
>- (void) munmap(fp->mapping, fp->maplen);
>- fp->mapping = MAP_FAILED;
>- }
>- fp->maplen = 0;
>- if (fp->fd != -1) {
>- (void) close(fp->fd);
>- fp->fd = -1;
>- }
>-}
>-
>-static _pseudo_FILE*
>-_pseudo_
>-{
>- struct stat statbuf;
>- fp->fd = open(fname, O_RDONLY);
>- if (fp->fd < 0) {
>- fp->fd = -1;
>- return NULL;
>- }
>- if ((0 != fstat(fp->fd, &statbuf)) || (statbuf.st_size <= 0)) {
>- close(fp->fd);
>- fp->fd = -1;
>- return NULL;
>- }
>- fp->maplen = statbuf.st_size;
>- fp->mapping = mmap(NULL, fp->maplen, PROT_READ, MAP_PRIVATE, fp->fd, 0);
>- if (fp->mapping == MAP_FAILED) {
>- close(fp->fd);
>- fp->fd = -1;
>- return NULL;
>- }
>- fp->offset = 0;
>- return fp;
>-}
>-
>-static void
>-_pseudo_
>-{
>- fp->offset = 0;
>-}
>-
>-static char*
>-_pseudo_
>-{
>- char* current;
>- char* endp;
>- int maxcopy = fp->maplen - fp->offset;
>- if (fp->mapping == MAP_FAILED)
>- return NULL;
>- if (maxcopy > bufsize - 1)
>- maxcopy = bufsize - 1;
>- ...
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #20 |
Created attachment 562979
diff android 2.3 bionic code vs WIP patch
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #21 |
Comment on attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap
Review of attachment 562974:
-------
See also https:/
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #22 |
dougt, can you land when you r+, medwards has no commit access yet. thanks
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #23 |
Comment on attachment 562974
WIP: Android 2.3 getaddrinfo(), with stdio replaced by mmap(), adjusted for use with --wrap
Review of attachment 562974:
-------
Discussed on irc. I wanted to see the ofllow:
* add new files, in patched state
* patch that removes the DNS proxy lookaside from the original Android 2.3 files
* patch, incremental, that replaces stdio with mmap
* changes to configure.in and Makefile.in to compile the new file and do the wrap mechanics
* pre-honeycomb only
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #24 |
Detecting >= Honeycomb can only be done from Java. (See http://
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #25 |
JNI can do this pretty easily, right?
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #26 |
how about this Michael... lets do this for all versions of android, then follow up making this pre-honeycomb only.
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #27 |
Created attachment 563867
Updated patch, with Honeycomb handling
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #28 |
Comment on attachment 563867
Updated patch, with Honeycomb handling
Review of attachment 563867:
-------
looking better. Please fixup. mwu should also look at these changes when the patch is cleaned up.
mwu - are you comfortable with the license here? is it similar to the APKOpen stuff?
::: memory/
@@ +83,5 @@
> ifeq (Android, $(OS_TARGET))
> # Add Android linker
> EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> SHARED_LIBRARY_LIBS += $(call EXPAND_
> +WRAP_LDFLAGS = -Wl,--wrap=
Why is this needed, but wasn't needed to wrap malloc?
::: other-licenses/
@@ +28,5 @@
> +@@ -94,7 +105,6 @@
> + #include <netdb.h>
> + #include "resolv_private.h"
> + #include <stddef.h>
> +-#include <stdio.h>
why this change?
@@ +54,5 @@
> ++
> ++#define _PSEUDO_
> ++
> ++static void
> ++_pseudo_
test for a null fp? and elsewhere
@@ +111,5 @@
> ++ if (maxcopy > bufsize - 1)
> ++ maxcopy = bufsize - 1;
> ++ if (maxcopy <= 0)
> ++ return NULL;
> ++ //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);
remove
@@ +259,5 @@
> +@@ -1373,8 +1508,6 @@
> + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + if (IN6_IS_
> + return 0;
> +- } else if (IN6_IS_
what is this change about?
@@ +268,5 @@
> +@@ -1414,8 +1547,6 @@
> + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> + if (IN6_IS_
> + return 60;
> +- } else if (IN6_IS_
and this?
@@ +278,5 @@
> + }
> +
> + static void
> +-_sethtent(FILE **hostf)
> ++_sethtent(
can't you just
#undef FILE
#define FILE _pseudo_FILE
for this and similar redefs. at the top of this file? It will safe a bunch of these changes.
@@ +313,5 @@
> + const char *addr;
> + char hostbuf[8*1024];
> +
> +-// fprintf(stderr, "_gethtent() name = '%s'\n", name);
> ++ //fprintf(stderr, "_gethtent() name = '%s'\n", name);
revert this change.
@@ +335,5 @@
> + tname = cp;
> + if ((cp = strpbrk(cp, " \t")) != NULL)
> + *cp++ = '\0';
> +-// fprintf(stderr, "\ttname = '%s'", tname);
> ++ //fprintf(stderr, "\ttname = '%s'\n", tname);
same.
@@ +368,5 @@
> + name = va_arg(ap, char *);
> + pai = va_arg(ap, struct addrinfo *);
> +
> +-// fprintf(stderr, "_files_
> ++ //fprintf(stderr, "_files_
same.
::: xpcom/base/
@@ +48,5 @@
> #ifdef ANDROID
> #include "AndroidBridge.h"
> +
> +extern "C" {
> +extern volatile int android_
volatile keyword not needed. you should not use extern here either. (this is the implicit definition). you will need extern for the explicit declaration (where it is used outside of this file)
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #29 |
(In reply to Doug Turner (:dougt) from comment #27)
> Comment on attachment 563867 [diff] [details] [review]
> Updated patch, with Honeycomb handling
>
> Review of attachment 563867 [diff] [details] [review]:
> -------
>
> looking better. Please fixup. mwu should also look at these changes when
> the patch is cleaned up.
>
> mwu - are you comfortable with the license here? is it similar to the
> APKOpen stuff?
>
> ::: memory/
> @@ +83,5 @@
> > ifeq (Android, $(OS_TARGET))
> > # Add Android linker
> > EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> > SHARED_LIBRARY_LIBS += $(call EXPAND_
> > +WRAP_LDFLAGS = -Wl,--wrap=
>
> Why is this needed, but wasn't needed to wrap malloc?
Now that we call __real_
> ::: other-licenses/
> @@ +28,5 @@
> > +@@ -94,7 +105,6 @@
> > + #include <netdb.h>
> > + #include "resolv_private.h"
> > + #include <stddef.h>
> > +-#include <stdio.h>
>
> why this change?
Because the point of the patch is to remove all use of stdio from getaddrinfo(). The commented-out debug log lines, if restored, should be changed from stderr to the Android log mechanism (which is useful on a non-rooted device, unlike stderr).
> @@ +54,5 @@
> > ++
> > ++#define _PSEUDO_
> > ++
> > ++static void
> > ++_pseudo_
>
> test for a null fp? and elsewhere
Really just needs a "restrict" keyword, since it's never valid to pass it a NULL pointer.
> @@ +111,5 @@
> > ++ if (maxcopy > bufsize - 1)
> > ++ maxcopy = bufsize - 1;
> > ++ if (maxcopy <= 0)
> > ++ return NULL;
> > ++ //fprintf(stderr, "_pseudo_fgets(): copying up to %d bytes\n", maxcopy);
>
> remove
remove what? just the commented-out log line?
> @@ +259,5 @@
> > +@@ -1373,8 +1508,6 @@
> > + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + if (IN6_IS_
> > + return 0;
> > +- } else if (IN6_IS_
>
> what is this change about?
That macro isn't defined by the older headers in the NDK we're using. It's a side effect of another patch that was already applied to the version of getaddrinfo.c I started from; we want the rest of that patch.
> @@ +268,5 @@
> > +@@ -1414,8 +1547,6 @@
> > + const struct sockaddr_in6 *addr6 = (const struct sockaddr_in6 *)addr;
> > + if (IN6_IS_
> > + return 60;
> > +- } else if (IN6_IS_
>
> and this?
ditto
> @@ +278,5 @@
> > + }
> > +
> > + static void
> > +-_sethtent(FILE **hostf)
> > ++_sethtent(
>
> can't you just
>
> #undef FILE
> #define FILE _pseudo_FILE
>
> for this and similar redefs. at the top of this file? It will safe a bunch
> of these changes.
No. Note the different number of *s. I changed the APIs so that the _pseudo_FILE struct is caller-allocated, in order to avo...
In Mozilla Bugzilla #687367, Mwu-mozilla (mwu-mozilla) wrote : | #30 |
(In reply to Doug Turner (:dougt) from comment #27)
> mwu - are you comfortable with the license here? is it similar to the
> APKOpen stuff?
The license looks fine to me but IANAL. You can always open up a bug to have legal check. The APKOpen stuff was also lifted from bionic.
It would probably be nice to have *just* the bionic files come in as their own check-in and then the actual changes/integration in another patch.
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #31 |
pushed to try: https:/
In Mozilla Bugzilla #687367, Mh+mozilla (mh+mozilla) wrote : | #32 |
(In reply to Michael K. Edwards from comment #28)
> (In reply to Doug Turner (:dougt) from comment #27)
> > Comment on attachment 563867 [diff] [details] [review] [diff] [details] [review]
> > Updated patch, with Honeycomb handling
> >
> > Review of attachment 563867 [diff] [details] [review] [diff] [details] [review]:
> > -------
> >
> > looking better. Please fixup. mwu should also look at these changes when
> > the patch is cleaned up.
> >
> > mwu - are you comfortable with the license here? is it similar to the
> > APKOpen stuff?
> >
> > ::: memory/
> > @@ +83,5 @@
> > > ifeq (Android, $(OS_TARGET))
> > > # Add Android linker
> > > EXTRA_DSO_LDOPTS += $(ZLIB_LIBS)
> > > SHARED_LIBRARY_LIBS += $(call EXPAND_
> > > +WRAP_LDFLAGS = -Wl,--wrap=
> >
> > Why is this needed, but wasn't needed to wrap malloc?
>
> Now that we call __real_
> linking breaks without it.
Just call getaddrinfo() from __wrap_
In Mozilla Bugzilla #687367, Mh+mozilla (mh+mozilla) wrote : | #33 |
(In reply to Mike Hommey [:glandium] from comment #31)
> Just call getaddrinfo() from __wrap_
> -Wl,--wrap... like __wrap_dl* functions call dl* (see dlfcn.h)
err, dlfcn.c.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #34 |
How are we doing with a new patch here?
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #35 |
Created attachment 564480
Updated patch, addressing review comments
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #36 |
This updated patch isn't properly tested yet. I also didn't address Mike Hommey's comment, because I think it's safer to use __real_
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #37 |
michael, did you look at the debug test failures? Are they random, or were they caused by this patch?
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #38 |
I looked at them with people on #mobile (mbrubeck, IIRC) on Friday, who declared them to be symptoms of perma-orange issues on Android, unrelated to my patch.
In Mozilla Bugzilla #687367, Mh+mozilla (mh+mozilla) wrote : | #39 |
(In reply to Michael K. Edwards from comment #35)
> This updated patch isn't properly tested yet. I also didn't address Mike
> Hommey's comment, because I think it's safer to use __real_
> thus require that direct users of getaddrinfo() use the --wrap flags.
> (Indirect users don't need them.)
There could be a concern if libmozutils was much bigger and was doing much more. Anyways, if ted is fine with that (i.e. using __real_getaddrinfo and --wrap flags), then so be it.
In Mozilla Bugzilla #687367, Ted Mielczarek (ted-mielczarek) wrote : | #40 |
I defer to glandium here.
In Mozilla Bugzilla #687367, Mh+mozilla (mh+mozilla) wrote : | #41 |
ooook. So, the only calls intended for libc's getaddrinfo() from libmozutils, at the moment, are from that wrapper code. I don't see any reason in the near future (and even longer term) for something from libmozutils to call getaddrinfo(). I don't think there's a strong case for cluttering the build rules with additional --wrap flags (for libmozutils, that is ; other uses are required).
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #42 |
The point is not that other code in libmozutils needs to call getaddrinfo(); it doesn't. The point is that libraries that call getaddrinfo() need to be linked against libmozutils, with the appropriate --wrap flags, in order to ensure that their calls to getaddrinfo() get rewritten as calls to __wrap_
In Mozilla Bugzilla #687367, Mh+mozilla (mh+mozilla) wrote : | #43 |
the "official" way to achieve that is for when all your code compiles to a program or library. Here, we have libmozutils, and the rest. The rest needs --wrap so that getaddrinfo() calls are properly redirected to __wrap_getaddrinfo and linked against the corresponding symbols in libmozutils. Libmozutils, on the other hand, just needs to declare __wrap_getaddrinfo functions, and these functions can just call getaddrinfo. No need for __real_getaddrinfo, no need for additional --wrap in memory/
Changed in linaro-android: | |
status: | New → Triaged |
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #44 |
medwards is heads down in B2G work and mwu agreed to take this over the finish line and land it. Reassigning.
In Mozilla Bugzilla #687367, Mwu-mozilla (mwu-mozilla) wrote : | #45 |
Landed with wrapping adjusted.
In Mozilla Bugzilla #687367, Blassey-bugs (blassey-bugs) wrote : | #46 |
backed out due to the crashes reported in bug 693103:
https:/
In Mozilla Bugzilla #687367, Mbrubeck-x (mbrubeck-x) wrote : | #47 |
The backout was backed out until we can figure out which of several patches broke Android tests:
https:/
Leaving this bug open, as we still plan to back it out ASAP.
In Mozilla Bugzilla #687367, Mbrubeck-x (mbrubeck-x) wrote : | #48 |
Re-landed the backout:
https:/
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #49 |
The crashes look exotic to me, in that they are segfaults due to an assortment of unreasonable values for the "name" pointer, at what happens to be the first place where it is dereferenced after it is received through the inter-thread dispatch mechanism. Either it doesn't make it through dispatch correctly or it gets corrupted thereafter. The only reasonable thing I can think of to do is to dereference the pointer in the caller (nsHostResolver
In Mozilla Bugzilla #687367, Mbrubeck-x (mbrubeck-x) wrote : | #50 |
This was also re-landed and then re-backed out on inbound:
https:/
https:/
(These should be a no-op in the next merge from inbound to m-c. Inbound sheriff, you can just leave this bug OPEN after merging these two changesets.)
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #51 |
jchen points out that it could be strlen(domain) that fails, and that the domain pointer is extracted from a structure returned by a function internal to the NetBSD resolver code (__res_
In Mozilla Bugzilla #687367, Mak77 (mak77) wrote : | #52 |
Changed in firefox: | |
status: | In Progress → Confirmed |
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #53 |
This seems to be quite a rat hole here. Is there an easier fix? Can we do DNS lookups from one thread only? Can we ferry the lookups to the Java layer? (which would give us free caching, btw). Doug? Brian?
In Mozilla Bugzilla #687367, Doug-turner (doug-turner) wrote : | #54 |
I suggested that we just put a lock around the call site in comment #1, then fix it for real. This allows us to unfuck our users. We can fix it correctly post that commit.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #55 |
medwards was argueing that might not help since the crash is in stdio, which is called from other paths as well. Though, since this is all one huge clusterfuck in Android, I would be down to just giving that a try and see whether the crash frequency goes down. Also, we should check what webkit does.
In Mozilla Bugzilla #687367, Michael K. Edwards (m-k-edwards) wrote : | #56 |
I do think that throwing a lock around the use of getaddrinfo() will reduce the crash frequency, because the contention on the stdio FILE table is most often caused by concurrent DNS requests initiated from two or more worker threads. It won't *fix* the problem; but it will probably stem the bleeding.
In Mozilla Bugzilla #687367, Bsmith-mozilla (bsmith-mozilla) wrote : | #57 |
(In reply to Patrick McManus from comment #2)
> If we really are forced to go down this path I would prefer
> MAX_RESOLVER_
> MAX_RESOLVER_
> set to {1,0} for this case rather than the lock. But I really think that's a
> big step back for phones as modern as gingerbread!
Let's stop the crash now and deal with the performance fallout later.
Patrick, could you take this bug? It seems to be something that urgently needs to be fixed and you seem to know best what needs to be done. If you cannot take it up right away, please assign it to me and I will figure it out.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #58 |
Created attachment 566450
patch
Lets get this minimal fix landed. I heard we only have around 10k honeycomb users, so we can take a couple days for a follow-up fix with dynamic detection.
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #59 |
Comment on attachment 566450
patch
Review of attachment 566450:
-------
I got any/high backwards in my bugzilla comment.
I'll post a different patch when I test it (shortly!).
bz will probably be the right reviewer.
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #60 |
Created attachment 566517
no-concurrent-
This will create concurrency of 1 and disable dns prefetch for android.
A followup bug is really important.
In Mozilla Bugzilla #687367, Bzbarsky (bzbarsky) wrote : | #61 |
Comment on attachment 566517
no-concurrent-
r=me
In Mozilla Bugzilla #687367, Patrick McManus (mcmanus-ducksong) wrote : | #62 |
http://
please file a high priority followup bug. Resolutions that timeout will severely gum up the works on mobile now.
In Mozilla Bugzilla #687367, Mak77 (mak77) wrote : | #63 |
In Mozilla Bugzilla #687367, Mark-finkle (mark-finkle) wrote : | #64 |
(In reply to Andreas Gal :gal from comment #57)
> Created attachment 566450 [diff] [details] [review]
> patch
>
> Lets get this minimal fix landed. I heard we only have around 10k honeycomb
> users, so we can take a couple days for a follow-up fix with dynamic
> detection.
FWIW, we have >120K honeycomb users. Not that it changes the type of fix.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #65 |
Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that correct Mark?
In Mozilla Bugzilla #687367, Mark-finkle (mark-finkle) wrote : | #66 |
(In reply to Andreas Gal :gal from comment #64)
> Sorry I misspoke. I meant we only have 10k gingerbread (3.0) users. Is that
> correct Mark?
"Gingerbread (3.0)" is confusing me, but the breakdown:
Gingerbread (2.3.*): >1.4 million users
Honeycomb (3.*): >120K users
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #67 |
Ok, thats curious. In the tablet planning meetings we were using wrong numbers than. Where is the 120k number coming from?
In Mozilla Bugzilla #687367, Mark-finkle (mark-finkle) wrote : | #68 |
(In reply to Andreas Gal :gal from comment #66)
> Ok, thats curious. In the tablet planning meetings we were using wrong
> numbers than. Where is the 120k number coming from?
The Android Market developer portal
In Mozilla Bugzilla #687367, Mbrubeck-x (mbrubeck-x) wrote : | #69 |
Note that the Android Market gives us numbers of "Active Installs", which I believe counts devices that are active (i.e. still pinging the Market server) and currently have Firefox installed (i.e. have not uninstalled it).
This is very different from our own "Active Daily Users" metric based on blocklist pings, which only counts actual usage.
We currently have a total of about 2,700,000 "Active Installs" but only about 330,000 "Active Daily Users". Comparing usage metrics is impossible unless you actually specify which metric you are talking about.
In Mozilla Bugzilla #687367, Blassey-bugs (blassey-bugs) wrote : | #70 |
also note that we typically use a multiplier on ADUs to estimate our actual install base since not everyone uses there browser every day or long enough to get a block list ping. I believe the current multiplier is around 6x. So if we have 10K ADUs we'd estimate 60k active users, which would be half our 120k current installs.
In Mozilla Bugzilla #687367, Andreas Gal (gal) wrote : | #71 |
I see. Thanks.
Changed in firefox: | |
status: | Confirmed → Fix Released |
In Mozilla Bugzilla #687367, Sjhworkman (sjhworkman) wrote : | #72 |
Noticed that this was identified as one of the top 3 crashes for Android - adjusted tracking. There should be no stability downside to the patch, but there is also no good way to determine what if any performance regression there would be
Note: patch fixes crash, but does so by disabling parallelism for Android DNS lookups.
In Mozilla Bugzilla #687367, Akeybl (akeybl) wrote : | #73 |
Comment on attachment 566517
no-concurrent-
[Triage Comment]
* Approving for Aurora given this is a top crasher
* Leaving approval‑
In Mozilla Bugzilla #687367, Bsmith-mozilla (bsmith-mozilla) wrote : | #74 |
Will check into mozilla-aurora after finishing XPCOM proxy patch fixups.
In Mozilla Bugzilla #687367, Sjhworkman (sjhworkman) wrote : | #75 |
Thanks, Alex, Brian.
In Mozilla Bugzilla #687367, Clegnitto (clegnitto) wrote : | #76 |
We respun for a different issue so decided to take this on beta (and I transplanted to aurora):
http://
http://
In Mozilla Bugzilla #687367, Jbecerra-mozilla (jbecerra-mozilla) wrote : | #77 |
Adding [qa+] for bug verification tracking.
We'll look at crash-stats before and after the fix, but it would be great if anyone has suggestions on how to verify this bug directly.
In Mozilla Bugzilla #687367, Sjhworkman (sjhworkman) wrote : | #78 |
Juan, reproducing the bug seems to be difficult, and I don't have a lot to offer other than it was happening mostly on dual core devices, pre-honeycomb. I never observed it myself.
Jchen might be able to offer some more help - he added a stack trace highlighting the weakness in thise code in bug 662936 comment 54.
In Mozilla Bugzilla #687367, Jimnchen+bmo (jimnchen+bmo) wrote : | #79 |
I think Bug 689989 should be enough. After removing the original workaround, our tests on the Tegra boards should be able to verify the fix.
Amit Pundir (pundiramit) wrote : | #80 |
Marking old (Gingerbread) open bugs to "Won't fix" state (on Linaro-Android helpdesk) as fixes are done only on ICS track now. Please raise this bug again against latest Linaro-ICS release if it still exist.
Changed in linaro-android: | |
status: | Triaged → Won't Fix |
Offshoot from Bug 662936, which was the same bug occurring on dual-core Tegras running talos.
In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et al) are not thread-safe because stdio is not thread-safe... These functions rely on stdio for reading from the /etc/hosts file.
AFAIK in Gecko, we launch worker threads which call these domain functions, and we crash inside bionic when we run into such a race condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in 6.0.
I haven't seen single-core devices running into these crashes, but on dual-core devices they are a lot more severe. The ugly fix would be using locks around getaddrinfo calls. Another workaround would be setting the CPU affinity on worker threads; doesn't fix the issue, but should make it a lot less common. Also, might be possible to not use getaddrinfo, but that seems to have implications in IPv6 support (Bug 626866); I don't know enough to say.