break tag not correctly parsed when directly followed by another tag

Bug #1095945 reported by Sushain Cherivirala on 2013-01-04
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fix Released

Bug Description

When a break tag is directly followed on the next line by another tag, lxml fails to properly recognize the break tag, i.e. disregarding the tag and consequently the cleaned output contains no line breaks. In the case that the two tags are separated by text, lxml properly handles them. Furthermore, the same problem occurs if multiple break tags are present as shown in test.html.

For example,

helloworld1<br />
<br />
- <em>helloworld1</em>

will be parsed correctly and will result in the output of

- helloworld1


helloworld1<br />
<br />
<em>- helloworld1</em>

will be parsed incorrectly and will result in output of

helloworld1- helloworld1

Further examples (test.html) with test script ( and my output (text.xml) are attached.

Supplementary Requested information:
Python : sys.version_info(major=3, minor=2, micro=3, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Sushain Cherivirala (sushain97) wrote :
Xinx (xinx) wrote :

Investigations proofed that line breaks between HTML elements are removed (iff there are only whitespaces between elements) by the libxml2 HTML processor (which in turn is invoked by lxml's HTML reader). Hence, there is no solution which could be offered by lxml.

scoder (scoder) wrote :

Verified with libxml2 2.7.8:

>>> print(et.tostring(et.fromstring(h, et.HTMLParser())))
<html><body><p>helloworld1<br/><br/>\n- <em>helloworld1</em></p></body></html>

Found to be fixed in libxml2 2.9.0:

>>> print(et.tostring(et.fromstring(h, et.HTMLParser())))
<html><body><p>helloworld1<br/>\n<br/>\n- <em>helloworld1</em>\n</p></body></html>

no longer affects: libxml2
Changed in lxml:
assignee: nobody → scoder (scoder)
Changed in libxml2:
status: New → Fix Released
scoder (scoder) wrote :

not a bug in lxml.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers