break tag not correctly parsed when directly followed by another tag
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Apertium |
New
|
Undecided
|
Unassigned | ||
libxml2 |
Fix Released
|
Undecided
|
Unassigned | ||
lxml |
Invalid
|
Undecided
|
scoder |
Bug Description
When a break tag is directly followed on the next line by another tag, lxml fails to properly recognize the break tag, i.e. disregarding the tag and consequently the cleaned output contains no line breaks. In the case that the two tags are separated by text, lxml properly handles them. Furthermore, the same problem occurs if multiple break tags are present as shown in test.html.
For example,
helloworld1<br />
<br />
- <em>helloworld1
will be parsed correctly and will result in the output of
helloworld1
- helloworld1
However,
helloworld1<br />
<br />
<em>- helloworld1</em>
will be parsed incorrectly and will result in output of
helloworld1- helloworld1
Further examples (test.html) with test script (test.py) and my output (text.xml) are attached.
Supplementary Requested information:
Python : sys.version_
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Investigations proofed that line breaks between HTML elements are removed (iff there are only whitespaces between elements) by the libxml2 HTML processor (which in turn is invoked by lxml's HTML reader). Hence, there is no solution which could be offered by lxml.