wget 1.13.4 crashed with SIGSEGV in malloc_consolidate()

Bug #1022124 reported by Jim Salter
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
wget
Unknown
Unknown
wget (Ubuntu)
New
Undecided
Unassigned

Bug Description

wget on Precise (wget v1.13.4) crashes with segfault when mirroring sites. THIS IS A REGRESSION: the same behavior does not occur in wget v1.12 as found on Ubuntu Lucid.

System being tested is fully up to date Ubuntu Precise x64 (Server), test machine on Lucid (which does not segfault) is fully up to date Ubuntu Lucid x64 (Server).

Sample output:

<pre>
me@box:/tmp$ wget -m --delete-after http://www.[redacted]/

[several pages of OK output redacted]

Served from: www.[redacted] @ 2012-07-07 11:56:50 -->] done.
2012-07-07 13:56:50 ERROR 404: Not Found.

Dequeuing http://www.[redacted]/shows-page/ at depth 1
Queue count 144, maxcount 149.
--2012-07-07 13:56:50-- http://www.[redacted]/shows-page/
Disabling further reuse of socket 3.
Closed fd 3
Found www.[redacted] in host_name_addresses_map (0x23fd220)
Segmentation fault
</pre>

Using wget directly on the page which appears to have been processing during the segfault does NOT result in another segfault:

<pre>
root@www:/tmp# wget --delete-after http://www.[redacted]/shows-page/
--2012-07-07 14:00:03-- http://www.[redacted]/shows-page/
Resolving www.[redacted] (www.[redacted])... 173.193.169.42
Connecting to www.[redacted] (www.[redacted])|173.193.169.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'

    [ <=> ] 84,289 --.-K/s in 0s

2012-07-07 14:00:07 (161 MB/s) - `index.html' saved [84289]

Removing index.html.
</pre>

Revision history for this message
Jim Salter (jrssnet) wrote :

Ah HA! I don't know about the bug in wget, but I found the oddness in the site being crawled which was *causing* wget to trip *its* bug:

<script src=”http://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js”></script>

Took forever to spot this: somebody put "pretty quotes" in a CSS file in a WordPress theme - browsers, wget included, don't recognize the pretty quotes as quotes for coding purposes, so you end up trying to fetch a really, really broken URL:

http://www.[redacted]/%E2%80%9Dhttp:/ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js%E2%80%9D

Most browsers just try to get that URL, fail at it, and move on with life: but the new version of wget in 12.04 actually *segfaults* when it encounters that, which of course it only will if recursion is turned on. If it helps: you really do ONLY get this in recursion; an attempt to fetch the botched URL manually - either using the HTML escape codes, or using the prettyquotes directly at the shell - results in the expected 404, not a segfault.

Revision history for this message
Jason Conti (jconti) wrote :

I just tried creating a sample website where one of the files contained the script line above, and I cannot reproduce the crash (index.html links to base/test.html which has the <script> line). With wget -m (and -r) I just get:

--2012-07-08 14:02:14-- http://localhost:8998/base/%E2%80%9Dhttp://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js%E2%80%9D
Connecting to localhost (localhost)|127.0.0.1|:8998... connected.
HTTP request sent, awaiting response... 404 File not found
2012-07-08 14:02:14 ERROR 404: File not found.

Could you provide a sample file (or collection of files) that consistently produces the segfault? Or a gdb backtrace of the crash? (Or both?)

Some information on the web server might be useful too. The above test was just against the built-in python http server.

Revision history for this message
Jim Salter (jrssnet) wrote :

I had trouble reproducing it in a clean environment too, Jason - I made a plain HTML file that had the include line in it and fetched that recursively, but like you, just got a 404.

I can't give you a copy of the "bad" stuff, because it's a production website - database-and-PHP, on Apache, very few actual "files" involved. I could get you a gdb trace, if you'll give me an example command line to follow, though. All I have to do is put the pretty quotes back in, and I'll get segfaulting wget once more - and I don't mind doing that on a dev environment.

description: updated
Revision history for this message
Jason Conti (jconti) wrote :

Running:

gdb --args wget -m url;

Followed by 'run' until the sigsegv, and then 'bt' should get a decent trace, though having at least libc6-dbg installed as well at the dbgsym package for wget from http://ddebs.ubuntu.com/pool/main/w/wget/ will get us a better trace.

Revision history for this message
Jim Salter (jrssnet) wrote :
Download full text (6.0 KiB)

OK. Re-broke the site, installed all that you requested, replicated the sigsegv and got your backtrace.

root@www:/tmp# gdb --args wget -m -nd --delete-after http://www.[redacted]/
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /usr/bin/wget...Reading symbols from /usr/lib/debug/usr/bin/wget...done.
done.
(gdb) run
Starting program: /usr/bin/wget -m -nd --delete-after http://www.[redacted]/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--2012-07-08 15:29:24-- http://www.[redacted]/
Resolving www.[redacted] (www.[redacted])... 127.0.0.1, 173.193.169.42
Connecting to www.[redacted] (www.[redacted])|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'

    [ <=> ] 120,371 --.-K/s in 0.001s

2012-07-08 15:29:29 (167 MB/s) - `index.html' saved [120371]

Loading robots.txt; please ignore errors.
--2012-07-08 15:29:29-- http://www.[redacted]/robots.txt
Connecting to www.[redacted] (www.[redacted])|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 110 [text/plain]
Saving to: `robots.txt'

100%[=========================================================================>] 110 --.-K/s in 0s

2012-07-08 15:29:29 (9.53 MB/s) - `robots.txt' saved [110/110]

Removing robots.txt.
Removing index.html.

--2012-07-08 15:29:29-- http://www.[redacted]/feed/
Connecting to www.[redacted] (www.[redacted])|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: `index.html'

    [ <=> ] 46,030 --.-K/s in 0s

2012-07-08 15:29:30 (194 MB/s) - `index.html' saved [46030]

Removing index.html.

--2012-07-08 15:29:30-- http://www.[redacted]/feed/atom/
Connecting to www.[redacted] (www.[redacted])|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/atom+xml]
Saving to: `index.html'

    [ <=> ] 50,910 --.-K/s in 0s

2012-07-08 15:29:32 (169 MB/s) - `index.html' saved [50910]

Removing index.html.

--2012-07-08 15:29:32-- http://www.[redacted]/xmlrpc.php
Connecting to www.[redacted] (www.[redacted])|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54 [text/plain]
Saving to: `xmlrpc.php'

100%[=========================================================================>] 54 --.-K/s in 0s

Last-modified header miss...

Read more...

Jason Conti (jconti)
summary: - segfault in wget 1.13.4
+ wget 1.13.4 crashed with SIGSEGV in malloc_consolidate()
Revision history for this message
Jason Conti (jconti) wrote :

Thanks for the backtrace. It looks like some sort of memory corruption going on. So if you fix the quotes in the <script> tag, the segfault doesn't happen? It's interesting because the SIGSEGV isn't actually happening when working on the corrupt url, we successfully get the 404 and then move on to /shows-page/ url, and the SIGSEGV doesn't occur until trying to convert 'www.[redacted]' to the current locale. Out of curiosity, what is the current locale of the machine?

A valgrind log ( https://wiki.ubuntu.com/Valgrind ) may help, but the bug may be difficult to track down without being easily reproducible.

Revision history for this message
Jim Salter (jrssnet) wrote :

That's correct Jason - when I fix the include in header.php to use normal quotes instead of typographic quotes, I can then mirror the entire site without segfaulting. Put the typographic quotes back in, and poof - segfault, at the same spot every time.

Interestingly, *every single page* has the bad include in it - so wget has already "mirrored" several of the 404's generated by the bad include by the time it segfaults. I don't know why. Some kind of corner case, pretty apparently. (But, again, wget 1.12 does not segfault at all. So it's a regression, not just a weird corner case nobody has ever ever seen before.)

Revision history for this message
Jim Salter (jrssnet) wrote :
Download full text (14.5 KiB)

Here's your valgrind result. Interestingly, it doesn't segfault and crash when run from valgrind - it just keeps on trucking. Nevertheless, you can certainly see where it WAS crashing, and tons of errors occurring there.

--2012-07-08 20:07:22-- http://www.[redacted]/%E2%80%9Dhttp://ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js%E2%80%9D
Reusing existing connection to www.[redacted]:80.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.[redacted]/%E2%80%9Dhttp:/ajax.googleapis.com/ajax/libs/jquery/1.5/jquery.min.js%E2%80%9D [following]
==10570== Invalid read of size 1
==10570== at 0x4C2BFA2: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x42BE73: remote_to_utf8 (iri.c:272)
==10570== by 0x427244: url_parse (url.c:700)
==10570== by 0x4248F9: retrieve_url (retr.c:794)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570== Address 0x636f0a0 is 0 bytes inside a block of size 94 free'd
==10570== at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x4248E2: retrieve_url (retr.c:791)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570==
==10570== Invalid read of size 1
==10570== at 0x4C2BFB4: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x42BE73: remote_to_utf8 (iri.c:272)
==10570== by 0x427244: url_parse (url.c:700)
==10570== by 0x4248F9: retrieve_url (retr.c:794)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570== Address 0x636f0a1 is 1 bytes inside a block of size 94 free'd
==10570== at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x4248E2: retrieve_url (retr.c:791)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570==
==10570== Invalid read of size 1
==10570== at 0x42BE7D: remote_to_utf8 (iri.c:274)
==10570== by 0x427244: url_parse (url.c:700)
==10570== by 0x4248F9: retrieve_url (retr.c:794)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570== Address 0x636f0a0 is 0 bytes inside a block of size 94 free'd
==10570== at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x4248E2: retrieve_url (retr.c:791)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570==
==10570== Invalid read of size 1
==10570== at 0x42BE8C: remote_to_utf8 (iri.c:274)
==10570== by 0x427244: url_parse (url.c:700)
==10570== by 0x4248F9: retrieve_url (retr.c:794)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==10570== Address 0x636f0a1 is 1 bytes inside a block of size 94 free'd
==10570== at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10570== by 0x4248E2: retrieve_url (retr.c:791)
==10570== by 0x422734: retrieve_tree (recur.c:283)
==10570== by 0x4053BA: main (main.c:1388)
==1057...

Revision history for this message
Jason Conti (jconti) wrote :

The troublesome code from the valgrind log in iri.c (and the only major change to iri.c since lucid) is from commit 2f6aa1d7417df1dfc58597777686fbd77179b9fd:

diff --git a/src/iri.c b/src/iri.c
index 08cfde4..9b16639 100644
--- a/src/iri.c
+++ b/src/iri.c
@@ -264,6 +264,21 @@ remote_to_utf8 (struct iri *i, const char *str, const char **new)
   if (!i->uri_encoding)
     return false;

+ /* When `i->uri_encoding' == "UTF-8" there is nothing to convert. But we must
+ test for non-ASCII symbols for correct hostname processing in `idn_encode'
+ function. */
+ if (!strcmp (i->uri_encoding, "UTF-8"))
+ {
+ int i, len = strlen (str);
+ for (i = 0; i < len; i++)
+ if ((unsigned char) str[i] >= (unsigned char) '\200')
+ {
+ *new = strdup (str);
+ return true;
+ }
+ return false;
+ }
+
   cd = iconv_open ("UTF-8", i->uri_encoding);
   if (cd == (iconv_t)(-1))
     return false;

Might be worth seeing if dropping that patch (which was only added to avoid converting to UTF-8 twice, and seems kind of unsafe) and rebuilding wget fixes the issue. If so, might be worth raising a bug upstream so they can work out a proper fix.

Revision history for this message
Jason Conti (jconti) wrote :

I have prepared a test package at https://launchpad.net/~jconti/+archive/testing/+files/wget_1.13.4-2ubuntu2_amd64.deb with the above patch reverted ( debdiff: https://launchpad.net/~jconti/+archive/testing/+files/wget_1.13.4-2ubuntu1_1.13.4-2ubuntu2.diff.gz ), so that we might determine if that is indeed the code segment causing trouble. (Should have versioned the package with ~ppa at the end, but forgot, oh well).

There should be no harm in dropping the code, since it is just an attempt to avoid encoding the string twice (and I have been running wget without the patch without any issues today). If you cannot reproduce the issue, then it would seem this is definitely an upstream bug. I'm not really certain how best to fix it, so probably best to raise the issue with them.

Revision history for this message
Jim Salter (jrssnet) wrote :

Jason, I downloaded your deb, used dpkg-deb -x to extract the files, and ran the binary directly - no segfault. So, confirmed: the patch you reverted is the culprit.

Revision history for this message
Jason Conti (jconti) wrote :

Excellent, thanks for testing. I have forwarded the bug upstream at http://savannah.gnu.org/bugs/index.php?36823 and hopefully we have enough information to work out a fix.

Revision history for this message
Jim Salter (jrssnet) wrote :

Thank you Jason! I appreciate your help. =)

Revision history for this message
Giuseppe Scrivano (gscrivano) wrote :

could you please redirect me to a webpage where it is easier to trigger this bug? I am going to roll-out a new release of wget in the next weeks and I would like to get this fixed upstream before it.

Thanks!

Revision history for this message
Jim Salter (jrssnet) wrote :

Giuseppe, I'm sorry, but this is a pretty odd corner case I've had trouble reproducing outside a proprietary site I can't give anybody else access to. I would like to help, though - if you can give me a binary for the new version which will run on Ubuntu Precise, I can re-break the dev copy of the site in the same way (wget only segfaulted on the "broken" version of the site, which used typographic quotes in a <script src> instead of standard quotes) and test your new version against it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.