lxml

lxml.html.tostring() incorrectly applies urlencoding to top level domain names

Bug #2051597 reported by Timo Brembeck on 2024-01-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

When processing urls, the `lxml.html.tostring()` method applies urlencoding for utf-8 strings, which makes sense for the path part of the url. For the domain part however, this doesn't makes sense, since domain names can only contain ascii characters, which means the correct encoding for utf-8 characters in domain names is punycode.

A minimal example to reproduce this issue:
```
In [1]: from lxml.html import fromstring, tostring

In [2]: element = fromstring("<a href='https://www.bafög.de'>https://www.bafög.de</a>")

In [3]: tostring(element, encoding="unicode")
Out[3]: '<a href="https://www.baf%C3%B6g.de">https://www.bafög.de</a>'
```

While the correct encoding would be `https://www.xn--bafg-7qa.de`.

If somebody would point me into the right direction, I would be happy to contribute a fix for this.

Python : sys.version_info(major=3, minor=11, micro=6, releaselevel='final', serial=0)
lxml.etree : (5, 1, 0, 0)
libxml used : (2, 12, 3)
libxml compiled : (2, 12, 3)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)

Revision history for this message

scoder (scoder) wrote on 2024-01-30: Re: [Bug 2051597] [NEW] lxml.html.tostring() incorrectly applies urlencoding to top level domain names

Hmm. Interesting request. I understand where you're coming from. However, it's not lxml that does the parsing and serialising here, but libxml2.

In general, I think the parser library cannot know if you took care to encode your URL according to your needs or not, so it cannot easily decide what to do. Even correctly understanding the URL scheme and treating it accordingly would require a lot of knowledge in the serialiser, and bares a risk of accidental double encoding, which would be difficult to work around for users.

Finally, it's not even clear that the behaviour is wrong. Browsers should be absolutely capable of parsing URLs from a utf8 encoded HTML file and passing them correctly into an HTTP request.

Overall, I don't think lxml or libxml2 should change the URLs they get. It seems better, safer and more versatile to leave the encoding (or not) to the users.

Revision history for this message

Timo Brembeck (timo-42) wrote on 2024-01-30:

@scoder thanks a lot for the quick response and your detailed feedback!

I agree with the point that parsing the URL probably gets very messy and needs a lot of edge cases to consider, which is probably out of scope for an html serializer.

However, I would consider the current behavior a bug, since even if browsers can understand practically anything even if it doesn't comply to the RFC, other libraries might not be that forgiving.
For example, the requests library throws an error if I try to fetch that URL:

```
In [1]: import requests

In [2]: requests.get("https://www.baf%C3%B6g.de")

...

ConnectionError: HTTPSConnectionPool(host='www.baf%c3%b6g.de', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f60c12295d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

```

So in the end, I think no encoding would be better than urlencoding. Should I open another issue on libxml2's side then to propose leaving the URLs unchanged? (Or maybe at least provide an option to turn urlencoding off?)

Revision history for this message

Timo Brembeck (timo-42) wrote on 2024-01-30:

For reference, here the ticket I created for libxml2:

https://gitlab.gnome.org/GNOME/libxml2/-/issues/674