Comment 1 for bug 2051597

Revision history for this message
scoder (scoder) wrote : Re: [Bug 2051597] [NEW] lxml.html.tostring() incorrectly applies urlencoding to top level domain names

Hmm. Interesting request. I understand where you're coming from. However, it's not lxml that does the parsing and serialising here, but libxml2.

In general, I think the parser library cannot know if you took care to encode your URL according to your needs or not, so it cannot easily decide what to do. Even correctly understanding the URL scheme and treating it accordingly would require a lot of knowledge in the serialiser, and bares a risk of accidental double encoding, which would be difficult to work around for users.

Finally, it's not even clear that the behaviour is wrong. Browsers should be absolutely capable of parsing URLs from a utf8 encoded HTML file and passing them correctly into an HTTP request.

Overall, I don't think lxml or libxml2 should change the URLs they get. It seems better, safer and more versatile to leave the encoding (or not) to the users.