Incremental parsing cannot parse contents if data is split at certain positions

Bug #2058828 reported by Theodore Chang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Low
scoder

Bug Description

Python : sys.version_info(major=3, minor=10, micro=11, releaselevel='final', serial=0)
lxml.etree : (5, 1, 0, 0)
libxml used : (2, 10, 3)
libxml compiled : (2, 10, 3)
libxslt used : (1, 1, 37)
libxslt compiled : (1, 1, 37)

See also: https://github.com/rajatomar788/pywebcopy/issues/123

If the fed data is split inside a `href`, nothing can be parsed.

See example:

Wrong:

```python
    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_', b'135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # nothing
    parser.close()
```

Expected:

```python
    from lxml import etree

    parser = etree.HTMLPullParser()
    for data in (b'<root><a href="2011-03-13_135411/">2011-03-13_135411/</a></root>',):
        parser.feed(data)
        for _, elem in parser.read_events():
            print(elem.tag) # a root
    parser.close()
```

Revision history for this message
scoder (scoder) wrote :

Works for me with libxml2 2.12.6. The binary wheels of lxml 5.1 should be using 2.12.5, I guess that's ok as well.

Changed in lxml:
status: New → Invalid
Revision history for this message
Theodore Chang (tlcfem) wrote :

It looks like on Windows, the precompiled binary is downloaded during installation.

https://github.com/lxml/lxml/blob/82a42601e3a7f1a44aadc5a948c2b6fde3e0e407/buildlibxml.py#L34C1-L41C6

```py
# use pre-built libraries on Windows

def download_and_extract_windows_binaries(destdir):
    url = "https://api.github.com/repos/lxml/libxml2-win-binaries/releases?per_page=5"
```

Maybe consider bumping up version in libxml2-win-binaries repo?

Revision history for this message
scoder (scoder) wrote :

Ah, right, this seems to require libxml2 2.11+. I've added your test as a regression test for newer versions and updated the Windows libraries.

https://github.com/lxml/lxml/commit/807fd66704471434c8f1e2f9b6da66497ce43590
https://github.com/lxml/libxml2-win-binaries/commit/8e1f55e596ed14660064c7c19837f4171f5b0842

Changed in lxml:
importance: Undecided → Medium
milestone: none → 5.1.1
status: Invalid → Fix Committed
importance: Medium → Low
scoder (scoder)
Changed in lxml:
assignee: nobody → scoder (scoder)
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.