lxml

'<' character causes incorrect parsing

Bug #2051301 reported by Vidhu on 2024-01-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

I've noticed different parsing behaviors on two installations of lxml

When parsing an html document containing a '<' character, lxml removes it and all text content after it

EG:
```
from lxml import html
d = """<html>
  <body>
    10 < 1000
  </body>
</html>"""

print(html.tostring(html.fromstring(d)).decode())

---

<html>
<body>
10
</body></html>
```
Python : sys.version_info(major=3, minor=10, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 6, 5, 0)
libxml used : (2, 9, 13)
libxml compiled : (2, 9, 13)
libxslt used : (1, 1, 35)
libxslt compiled : (1, 1, 35)

This issue is not present on

Python : sys.version_info(major=3, minor=10, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 6, 5, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

```
from lxml import html
d = """<html>
  <body>
    10 < 1000
  </body>
</html>"""

print(html.tostring(html.fromstring(d)).decode())

---

<html>
  <body>
    10 < 1000
  </body>
</html>
```

Tags:

Revision history for this message

scoder (scoder) wrote on 2024-01-27:

Works for me with libxml2 2.12.3.

Changed in lxml:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.