'<' character causes incorrect parsing

Bug #2051301 reported by Vidhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

I've noticed different parsing behaviors on two installations of lxml

When parsing an html document containing a '<' character, lxml removes it and all text content after it

EG:
```
from lxml import html
d = """<html>
  <body>
    10 < 1000
  </body>
</html>"""

print(html.tostring(html.fromstring(d)).decode())

---

<html>
  <body>
    10
</body></html>
```
Python : sys.version_info(major=3, minor=10, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 6, 5, 0)
libxml used : (2, 9, 13)
libxml compiled : (2, 9, 13)
libxslt used : (1, 1, 35)
libxslt compiled : (1, 1, 35)

This issue is not present on

Python : sys.version_info(major=3, minor=10, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 6, 5, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

```
from lxml import html
d = """<html>
  <body>
    10 < 1000
  </body>
</html>"""

print(html.tostring(html.fromstring(d)).decode())

---

<html>
  <body>
    10 &lt; 1000
  </body>
</html>
```

Tags: html
Revision history for this message
scoder (scoder) wrote :

Works for me with libxml2 2.12.3.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.