unable to access URLs (doesnt recognize dash "-" in URL)

Bug #121467 reported by Guillaume Lecomte on 2007-06-21
44
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mozilla Firefox
Fix Released
Medium
firefox (Ubuntu)
Low
Mozilla Bugs
firefox-3.0 (Ubuntu)
Undecided
Unassigned

Bug Description

i can't acces to that page: http://gui-.deviantart.com with ubuntu (firefox, opera, epiphany, ...). On windows it's working with all navigators.

In addition, the backquote character (0x60) appears to have been let through in both the code and the comment. Is there any particular reason why this character should be allowed in DNS lookups, or is it just being let through by default?

I propose that we block it, unless there's a good reason for keeping this in, in which case, again, the reason for doing so should be documented.

Just for completeness sake, I wrote a small program to double-check: the characters being allowed appear to be:

$ + - . 0 1 2 3 4 5 6 7 8 9 : A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z }

removing the LDH character set and the dot leaves the following:

$ + : [ ] _ ` }

I understand that some people might be using '$', '_', and '+' in non-RFC-compliant machine names, and that we don't want to break them, and that '[', ':', and ']' are used in IPv6 addresses, and we don't want to break that either.

But I cannot think of any reason to include '`' or '}'.

Currently, the list is phrased as a blacklist, with 50 entries. That means that every character will have to be scanned over all 50 entries.

Given that I had to squint at a table of hex digits to find the above, would be surely be more self-documenting and more secure to have a whitelist, with (I propose) the following 70 entries:

'$+-.0123456789:[]_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

This would also have the advantage of _explicitly_ excluding oddities such as NULs and Unicode characters with codepoints > 128.

In addition, if a linear scan is used, and the characters are in a sensible order (eg lowercase letters before uppercase), then even performance can be slightly better than before, since there is no need to scan through the entire list in order to admit a valid character in the common case, and most searches will terminate after scanning at most 44 characters in the list (and perhaps half that on average).

do we really need to allow $ ?

Created attachment 241132
Patch to implement the changes described: not smoketested yet

The original comment should of course read netwerk/base/src/nsURLHelper.cpp

The attached patch:
* makes the list an explicit whitelist, as per comment #3
* removes the close-curly-brace and backquote characters from the allowed list, as per comment #2

It seems to work OK for both normal ASCII domain names and IDNs, and detects the characters not in the list.

NB: it has not been smoketested yet.

Following on with the whitelist idea: we probably should have two different whitelists:

For DNS names and IPv4 dotted quads:
   $ + - . 0 1 2 3 4 5 6 7 8 9 _
   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
   a b c d e f g h i j k l m n o p q r s t u v w x y z

For IPv6 literals:
   [ : ] 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

Doing this will further reduce the number of combinations of characters which might be used for spoofing, particularly in edge cases involving exotic Unicode characters and Unicode normalization interactions. The checks should be carried out in the order given, which will make the common case the fast case.

Created attachment 241137
Patch to implement the dual-whitelist approach: not smoketested yet

This implements the dual-whitelist approach of comment #6. Note that the earlier changes of disallowing backquote and close-curly-brace still apply.

NB Not smoketested

One more comment: on the face of it, the original blacklist version of the code could, on the face of it, have leaked DNS lookups containing characters with the top bit set, since they were not explicitly forbidden.

It's important that this should not happen (consider, for example, the possibility of leaking UTF-8 strings) -- the current version fixes this.

i can't acces to that page: http://gui-.deviantart.com with ubuntu (firefox, opera, epiphany, ...). On windows it's working with all navigators.

John Vivirito (gnomefreak) wrote :

Thank you for reporting this bug with us.
Is that the only page you cant access?
What browsers are you using on windows to access it?

Changed in firefox:
importance: Undecided → Low
John Vivirito (gnomefreak) wrote :

IM wondering if the reason you cant access the site is due to this line:
Mozilla/5.0 (X11; U; Linux i686;
I will look into this more tomorrow.

Alex (jelly) wrote :

I was with Guillaume on the forum. And noticed the same on my Feisty.

I tried with these web browsers, and got the same result:

Opera: V.9.21 ; Firefox V.2.0.0.4 ; Epiphany V.2.18.1 ; Konqueror V.3.5.6

On Windows these ones work (under VMware in my case):

Firefox V.2.0.0.4 ; Opera: V.9.21 ; I.E. V.6.0

But IE under Wine cannot join the URL, as noticed by someone else on the forum.

I tried to get the page with wget it doesn't work... On Windows Firefox it's working, is it correct to assign the bug to the firefox package?

i--.deviantart.com have the same problem, i guess it concern all the url with "-." in the server adress.

John Vivirito (gnomefreak) wrote :

Im not so sure its the browsers to be honest. Its very odd that a website will not open with any browsers. For the moment we will leave it against firefox till i can figure out what to do with it. I will look into it more on Monday or Tuesday and see what i can come up with.

marcus84 (marcrios84) wrote :

I thing this can be interesting: I cannot ping any direccion containing '-' in the servername
the error is 'unknown host'

I have tried another thing:

From mac os x I get the ip of http://gui-.deviantart.com
of course no problem, the ip is: 209.85.51.247 at this moment.

If I try to ping this IP from gutsy I have response
but if I put this ip in firefox I cannot see the page too

I hope this is usefull...

David Blanco (dablanco) wrote :

I think this is not a bug related with any browser nor the operating system. AFAIK using a "-" before the dot is not allowed by at least RFC 1035 (Domain names - implementation and specification), which says:

"The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less."

It´s clear to me that "-whatever-.dot.com" is an illegal host name and then it must not be resolved. The fact is that any application that relies its name resolving in C´s "gethostbyname" calls (on Linux: ping, dig, nslookup, and the like...) will fail with illegal host names, which IMHO is the right behaviour.

Please someone let me know if I am wrong. Anyway, it´s not a firefox issue :)

Greetings from Spain

So i will never be able to access to these pages under Linux? Somebody have an idea to access to these pages? For the moment, I'm launching a virtual windows xp to view these pages, not very fast...

ATorre (aedelatorre) wrote :

It is not a bug. GNU/Linux works properly and follow the standard:

RFC DOMAIN NAMES - CONCEPTS AND FACILITIES

tools.ietf.org/html/rfc1034

"The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less."

moo (zrsi30ur15) wrote :

Guillaume86 yes, you can. but you have to contact the web hosting company of the page so that they change the address to a correct one. it is their fault if they use characters that shouldn't be allowed and break compatibility (that's the goal of RFC's and standards).

Actually it's deviantart who doesn't restrict the usernames of the membres
to the correct rules, i hope they can transfert my data to another
profile...

2007/12/5, moo <email address hidden>:
>
> Guillaume86 yes, you can. but you have to contact the web hosting
> company of the page so that they change the address to a correct one. it
> is their fault if they use characters that shouldn't be allowed and
> break compatibility (that's the goal of RFC's and standards).
>
> --
> unable to acces to a specific url with any navigator.
> https://bugs.launchpad.net/bugs/121467
> You received this bug notification because you are a direct subscriber
> of the bug.
>

as found in discussion, this isn't a firefox bug.

Changed in firefox:
status: Incomplete → Invalid
Alexander Sack (asac) wrote :

and neither a firefox 3.0 one.

Changed in firefox-3.0:
status: New → Invalid

*** Bug 377808 has been marked as a duplicate of this bug. ***

If we're going to fix this (and we should) we need to get it into a beta to make sure no one with any traffic is using '[' or the other odd characters. Nominating for blocking.

(And if we don't switch to using a whitelist for Firefox 3, we should land the patch in bug 377808 instead to fix the blacklist.)

Moving to b4 - Michal can you take this?

Created attachment 304724
whitelist for checking host names

This patch changes blacklist to whitelist in net_IsValidHostName(). I didn't include second whitelist from patch #241137 because it isn't RFC 4007 compliant. PR_StringToNetAddr() should check it correctly.

Checking strings against set of characters in net_FindCharNotInSet and net_FindCharInSet is highly inefficient. This is probably not a real bottleneck, because it isn't called too often. Checking against some bitmap would be probably ideal but it would be hard to read. What about to introduce ranges? Somehing like:

char *
net_FindCharNotInRangesAndSet(const char *iter, const char *stop, const char *ranges, const char *set);

Checking would be done against ranges first and then against set.

Set "$+-.0123456789_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" can be replaced by ranges "az09-.AZ" and set "$+_" and it should be faster.

Any opinions?

Comment on attachment 304724
whitelist for checking host names

+ // Whitelist for DNS names (RFC 1035) with extra characters added
+ // for pragmatic reasons: this will also match IPv4 dotted quads.

What characters did you add for IPv4 dotted quads?

also, claiming that "the common characters are near the start of the list" and then starting the list with $ doesn't seem right...

Created attachment 308398
fixed comment and order of some characters in whitelist

(In reply to comment #15)
> What characters did you add for IPv4 dotted quads?

Good question. I just took the whitelist with comment from previous patch. Every IPv4 dotted quad can be described by characters allowed in RFC1035. Added characters was IMO "$+_" but comment was misleading. In new patch I changed the comment and also changed the order of characters.

Comment on attachment 308398
fixed comment and order of some characters in whitelist

+ // willing to send to lower-level DNS logic This is more

missing a dot here

+ // the commonest characters will tend to be near the start of
+ // the list.

the most common characters in a hostname are lowercase alphanumeric letters IMO, not dots and digits.

Created attachment 308413
fixed according to hints in comment #12

Comment on attachment 308413
fixed according to hints in comment #12

you don't have to ask for review again if I had marked review+ and all you did was fix the things I mentioned

(In reply to comment #20)
> (From update of attachment 308413 [details])
> you don't have to ask for review again if I had marked review+ and all you did
> was fix the things I mentioned
>

Sorry, but I added also keywork "checkin-needed" because I don't have checkin privileges. So I also wanted to have + for the final patch. Can somebody check it in, please?

Checking in netwerk/base/src/nsURLHelper.cpp;
/cvsroot/mozilla/netwerk/base/src/nsURLHelper.cpp,v <-- nsURLHelper.cpp
new revision: 1.71; previous revision: 1.70
done

It's not just that URL, but all URLs with a dash before a dot.

Changed in firefox:
status: Unknown → Fix Released

I had a similar problem with a web page that used - in the data portion of the url, eg: http://server/file.php?-list
The server reported it as http://server/file.php?%E2%80%93list

I then typed it in using the - key from the keypad instead of the one by the backspace and it worked just fine.
Also, after using the keypad I found I could use the normal - and it worked fine until I rebooted. After that, it went back to the %E2%80%93 until I used the keypad again. Very odd.

An addendum to my previous comment:
I went and compared the two dashes and found the unicode values for the two dashes.
–(U+2013 EN-DASH) is what the system starts off using until you use the keypad. This is invalid in URLs.
-(U+002D HYPHEN-MINUS) is what the keypad types. This is is valid in URLs.

My code-reading skills are modest, so I'll ask rather than risk an invalid bug.

I was playing with some ad hoc testing of this function and I found that something like: www.host\.com fails with "hostname not found" error page.

I *think* it's because this function is used like this:

http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsHostResolver.cpp

417 // ensure that we are working with a valid hostname before proceeding. see
418 // bug 304904 for details.
419 if (!net_IsValidHostName(nsDependentCString(host)))
420 return NS_ERROR_UNKNOWN_HOST;

Should we have a different error conditions for illegal/invalid hostnames?

benc, I think you're right. The "Try Again" button isn't very useful in that situation either. Can you file a new bug?

petski (petski) wrote :

Despite the RFC says otherwise, it seems like the firefox guys made a fix: http://mxr.mozilla.org/firefox/source/netwerk/base/src/nsURLHelper.cpp#925

Ofir Klinger (klinger-ofir) wrote :

I don't think that it is a fix, since like mentioned in here (https://bugzilla.mozilla.org/show_bug.cgi?id=196852#c5), the source of the problem is in the Linux resolver. Therefore, if Firefox doesn't bypass the resolver, this bug won't be fixed.

P.S
I don't really know what that code those, but as far as I understand, it doesn't bypass the resolver...

petski (petski) wrote :

IMHO that's not totally true. I've tried to access the page while running a sniffer (port 53) in the background. You can see the host gets queried and the response is valid as well.

Ofir Klinger (klinger-ofir) wrote :

Is there any way of testing the fix? (nightly builds will include the fix?)

I want to test it, so we will know if it solves the bug.

mix (mixmadmen) wrote :

Hello,
there is a fix for this, and it works on ANY website. Not just Deviant Art.
It seems that DA is one of the few (or the only?) major communitiy to not follow the rules for ARPANET host names in this regard.
Deviant Art won't correct it since so many users already have names with this format, but have a workaround. You might want to try reading their FAQ in the future:

http://help.deviantart.com/664/

Hope this helps.
PS, yes, it works on Ubuntu. Tested on Hardy without trouble.

Changed in firefox:
importance: Unknown → Medium
Enkouyami (furyhamster) wrote :

mix, the link isn't a fix at all. It just states this "This is normal as they're not standard host naming (aka considered illegal char naming). In IE such usually works, however not in Firefox or on Linux.

To stop this we no longer allow such naming of accounts.

If you have an account with such, you may wish to create a new username. "

Why hasn't this problem been fixed yet? Firefox, google chrome, Safari, and IE allow it in Windows, so why isn't it allowed in Ubuntu?

PS, I'm using Ubuntu 10.10 Maverick Meerkat

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.