break tag not correctly parsed when directly followed by another tag

Bug #1095945 reported by Sushain Cherivirala
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Apertium
New
Undecided
Unassigned
libxml2
Fix Released
Undecided
Unassigned
lxml
Invalid
Undecided
scoder

Bug Description

When a break tag is directly followed on the next line by another tag, lxml fails to properly recognize the break tag, i.e. disregarding the tag and consequently the cleaned output contains no line breaks. In the case that the two tags are separated by text, lxml properly handles them. Furthermore, the same problem occurs if multiple break tags are present as shown in test.html.

For example,

helloworld1<br />
<br />
- <em>helloworld1</em>

will be parsed correctly and will result in the output of

helloworld1
- helloworld1

However,

helloworld1<br />
<br />
<em>- helloworld1</em>

will be parsed incorrectly and will result in output of

helloworld1- helloworld1

Further examples (test.html) with test script (test.py) and my output (text.xml) are attached.

Supplementary Requested information:
Python : sys.version_info(major=3, minor=2, micro=3, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Revision history for this message
Sushain Cherivirala (sushain97) wrote :
Revision history for this message
Xinx (xinx) wrote :

Investigations proofed that line breaks between HTML elements are removed (iff there are only whitespaces between elements) by the libxml2 HTML processor (which in turn is invoked by lxml's HTML reader). Hence, there is no solution which could be offered by lxml.

Revision history for this message
scoder (scoder) wrote :

Verified with libxml2 2.7.8:

>>> print(et.tostring(et.fromstring(h, et.HTMLParser())))
<html><body><p>helloworld1<br/><br/>\n- <em>helloworld1</em></p></body></html>

Found to be fixed in libxml2 2.9.0:

>>> print(et.tostring(et.fromstring(h, et.HTMLParser())))
<html><body><p>helloworld1<br/>\n<br/>\n- <em>helloworld1</em>\n</p></body></html>

no longer affects: libxml2
Changed in lxml:
assignee: nobody → scoder (scoder)
Changed in libxml2:
status: New → Fix Released
Revision history for this message
scoder (scoder) wrote :

not a bug in lxml.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.