Questionable handling of implied end tags
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
According to https:/
"A dt element's end tag can be omitted if the dt element is immediately followed by another dt element or a dd element."
Based on that language, I would have expected this
from lxml import html
html.tostring(
to have resulted in
b'<dl><
but instead I get this:
b'<dl><
... so the path of the second dt becomes dl/dt/dt instead of dl/dt as expected.
Note that, by contrast, the handling of li elements appears to match what the spec says:
"An li element's end tag can be omitted if the li element is immediately followed by another li element or if there is no more content in the parent element."
html.tostring(
b'<ol><
I tried pre-flighting this on the mailing list, but I've been getting strange mail delivery failures for my most recent messages:
*** MAIL DELIVERY FAILURE REPORT ***
The original message was received at Sun, 08 Mar 2020 09:12:09 -0700 (PDT)
from host by mail-wr1-
<email address hidden>.
Subject: Questionable handling of implied end tags
From: <email address hidden>
Mail delivery to the following recipient has finally failed:
<email address hidden>
Last reason: 550 5.1.0
Explanation: host mxa.eu.mailgun.org [18.195.181.121] said: Recipient rejected:
Transcript of session:
... while talking to mxa.eu.mailgun.org [18.195.181.121]:
>>> RCPT TO:<email address hidden>
<<< 550 5.1.0 Recipient rejected: <email address hidden>
Seems odd that the recipient address doesn't match my own. At any rate, I'm going straight to the bug tracker. Here's the requested environment report:
Python : sys.version_
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
That's on MacOS. Also reproducible on Linux and Windows, Python 3.8.0, lxml 4.5.0.