Comment 9 for bug 1949271

Revision history for this message
Tom Ritchford (tom.swirly) wrote :

I can also reproduce this on lxml 4.9.2, MacOS 12.6.2, Python 3.11.1, with a very small example.

This took me too long to figure out. The fact that etree.HTML(s) sometimes returns `None` and sometimes returns a broken HTML object is a bit confounding.

**There's a near-trivial workaround - call `.encode()` on any `str` that you pass in.**

Here's the code to reproduce, which also shows the workaround.

    from lxml import etree

    def round_trip(s):
        html = etree.HTML(s)
        assert html is not None, 'etree.HTML(s) returned None!'

        result = etree.tostring(html, pretty_print=True, method="html")

        print('Result:')
        print(result.decode())

    def compare_both():
        p1 = '<html><head><title>ANT</title></head><body></body></html>'

        # Works
        round_trip(p1)
        round_trip('\n' + p1)

        p2 = '<html><head><title>🐜</title></head><body></body></html>'

        # The workaround!
        round_trip(p2.encode())
        round_trip(('\n' + p2).encode())

        # Fails
        round_trip(p2) # Wrong answer
        round_trip('\n' + p2) # etree.HTML(s) returns None

and here are the results:

    Result:
    <html>
    <head><title>ANT</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>ANT</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>&#240;&#159;&#144;&#156;</title></head>
    <body></body>
    </html>

    Result:
    <html>
    <head><title>&#240;&#159;&#144;&#156;</title></head>
    <body></body>
    </html>

    Result:
    <html><body><p>h t m l &gt; </p></body></html>

    Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
      File "<frozen runpy>", line 88, in _run_code
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 93, in <module>
        compare_both()
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 64, in compare_both
        round_trip('\n' + p2) # Returns None
        ^^^^^^^^^^^^^^^^^^^^^
      File "/Users/tom/synthetic/code/multi/multi/tweak_index.py", line 41, in round_trip
        assert html is not None, 'etree.HTML(s) returned None!'
               ^^^^^^^^^^^^^^^^
    AssertionError: etree.HTML(s) returned None!
            0.03 real 0.02 user 0.00 sys