Malformed HTML mail wedged processing due to lxml parsing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
KARL3 |
Fix Released
|
Low
|
Chris Rossi |
Bug Description
I applied a quick fix to the parsing of HTML, based on the traceback below. Below the traceback is an example of the HTML that broke mail-in.
Some near-term and longer-term (separate ticket) work remains. Near term:
1) Make a test that tries with malformed content.
2) Wrap the two cases of document_fromstring with an exception handler that logs and skips. If something is so bad that lxml.html can't handle it, I'm ok with tossing it out.
As an alternative to (2), have the exception handler return a body that says "<p>Content was not parsable as HTML.</p>" or something.
Traceback
=========
$ ./cronscripts/
=======
Draining mailbox : /home/zope
=======
Dry-run : False
Pending queue : /home/zope/Maildir
=======
Processing mail-in content
=======
Maildir root: /home/zope
Pending queue: /home/zope/
ZODB URI: zeo://localhost
=======
Traceback (most recent call last):
File "bin/mailin", line 74, in <module>
osi.
File "/var/db/
MailinRunner
File "/var/db/
if self.handleMess
File "/var/db/
self.
File "/var/db/
IMailinHandl
File "/var/db/
alerts.
File "/var/db/
self.
File "/var/db/
message = alert.message
File "/var/db/
html = etree.fromstrin
File "lxml.etree.pyx", line 2435, in lxml.etree.
File "parser.pxi", line 1511, in lxml.etree.
File "parser.pxi", line 1383, in lxml.etree.
File "parser.pxi", line 892, in lxml.etree.
File "parser.pxi", line 538, in lxml.etree.
File "parser.pxi", line 624, in lxml.etree.
File "parser.pxi", line 564, in lxml.etree.
lxml.etree.
<p>In oil-rich Nigeria, Africa's most populous nation, where watchdog
groups say efforts to combat corruption are backsliding
<a href="http://
Nuhu Ribadu,
<http://
html> who built a well-trained staff of investigators at the Economic
and Financial Crimes Commission, said he fled his homeland into
self-imposed exile in England in December. Officials had sent Mr. Ribadu
away to a training course a year earlier, soon after his agency charged
a wealthy, politically connected former governor with trying to bribe
officials on his staff with huge sacks stuffed with $15 million in $100
bills. Mr. Ribadu, who was dismissed from the police force last year,
said he had received death threats and was fired upon in September by
assailants.</p>
Changed in karl3: | |
milestone: | m18 → m19 |
Changed in karl3: | |
status: | New → In Progress |
If left to my own devices, in the case of extreme unparsability I would just escape everything as though it were plain text. Is that ok, or do you prefer to log and discard?