lxml.html.tostring() incorrectly applies urlencoding to top level domain names
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
When processing urls, the `lxml.html.
A minimal example to reproduce this issue:
```
In [1]: from lxml.html import fromstring, tostring
In [2]: element = fromstring("<a href='https:/
In [3]: tostring(element, encoding="unicode")
Out[3]: '<a href="https:/
```
While the correct encoding would be `https:/
If somebody would point me into the right direction, I would be happy to contribute a fix for this.
Python : sys.version_
lxml.etree : (5, 1, 0, 0)
libxml used : (2, 12, 3)
libxml compiled : (2, 12, 3)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)
Hmm. Interesting request. I understand where you're coming from. However, it's not lxml that does the parsing and serialising here, but libxml2.
In general, I think the parser library cannot know if you took care to encode your URL according to your needs or not, so it cannot easily decide what to do. Even correctly understanding the URL scheme and treating it accordingly would require a lot of knowledge in the serialiser, and bares a risk of accidental double encoding, which would be difficult to work around for users.
Finally, it's not even clear that the behaviour is wrong. Browsers should be absolutely capable of parsing URLs from a utf8 encoded HTML file and passing them correctly into an HTTP request.
Overall, I don't think lxml or libxml2 should change the URLs they get. It seems better, safer and more versatile to leave the encoding (or not) to the users.