lxml

Bug #2051597
Comment #2

Comment 2 for bug 2051597

Revision history for this message

Timo Brembeck (timo-42) wrote on 2024-01-30:

@scoder thanks a lot for the quick response and your detailed feedback!

I agree with the point that parsing the URL probably gets very messy and needs a lot of edge cases to consider, which is probably out of scope for an html serializer.

However, I would consider the current behavior a bug, since even if browsers can understand practically anything even if it doesn't comply to the RFC, other libraries might not be that forgiving.
For example, the requests library throws an error if I try to fetch that URL:

```
In [1]: import requests

In [2]: requests.get("https://www.baf%C3%B6g.de")

...

ConnectionError: HTTPSConnectionPool(host='www.baf%c3%b6g.de', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f60c12295d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

```

So in the end, I think no encoding would be better than urlencoding. Should I open another issue on libxml2's side then to propose leaving the URLs unchanged? (Or maybe at least provide an option to turn urlencoding off?)