Malformed HTML mail wedged processing due to lxml parsing

Bug #386065 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Low
Chris Rossi

Bug Description

I applied a quick fix to the parsing of HTML, based on the traceback below. Below the traceback is an example of the HTML that broke mail-in.

Some near-term and longer-term (separate ticket) work remains. Near term:

1) Make a test that tries with malformed content.

2) Wrap the two cases of document_fromstring with an exception handler that logs and skips. If something is so bad that lxml.html can't handle it, I'm ok with tossing it out.

As an alternative to (2), have the exception handler return a body that says "<p>Content was not parsable as HTML.</p>" or something.

Traceback
=========

$ ./cronscripts/cron-process-mail.sh
==============================================================================
Draining mailbox : /home/zope
==============================================================================
Dry-run : False
Pending queue : /home/zope/Maildir

=================================================================
Processing mail-in content
=================================================================
Maildir root: /home/zope
Pending queue: /home/zope/Maildir/pending.db
ZODB URI: zeo://localhost:8884
=================================================================
Traceback (most recent call last):
 File "bin/mailin", line 74, in <module>
   osi.scripts.mailin.main()
 File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 279, in main
   MailinRunner(sys.argv)()
 File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 263, in __call__
   if self.handleMessage(message_id):
 File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 242, in handleMessage
   self.processMessage(message, info, text, attachments)
 File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 226, in processMessage
   IMailinHandler(target).handle(message, info, text, attachments)
 File "/var/db/karl3/src/karl.content/karl/content/adapters/mailin.py", line 113, in handle
   alerts.emit(entry, offline_request)
 File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 55, in emit
   self._queue_digest(context, profile, request)
 File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 65, in _queue_digest
   message = alert.message
 File "/var/db/karl3/src/karl.content/karl/content/views/adapters.py", line 272, in message
   html = etree.fromstring(body)
 File "lxml.etree.pyx", line 2435, in lxml.etree.fromstring (src/lxml/lxml.etree.c:24170)
 File "parser.pxi", line 1511, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:64170)
 File "parser.pxi", line 1383, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:63041)
 File "parser.pxi", line 892, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:59878)
 File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:56751)
 File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:57595)
 File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:56993)
lxml.etree.XMLSyntaxError: Failed to parse QName 'http:', line 20, column 7

<p>In oil-rich Nigeria, Africa's most populous nation, where watchdog
groups say efforts to combat corruption are backsliding
<a href="http://www.hrw.org/en/news/2009/06/07/nigeria-abusers-reign-midterm">http://www.hrw.org/en/news/2009/06/07/nigeria-abusers-reign-midterm</a> ,
Nuhu Ribadu,
<http://www.pbs.org/frontlineworld/stories/bribe/2009/05/lowell-bergman.
html> who built a well-trained staff of investigators at the Economic
and Financial Crimes Commission, said he fled his homeland into
self-imposed exile in England in December. Officials had sent Mr. Ribadu
away to a training course a year earlier, soon after his agency charged
a wealthy, politically connected former governor with trying to bribe
officials on his staff with huge sacks stuffed with $15 million in $100
bills. Mr. Ribadu, who was dismissed from the police force last year,
said he had received death threats and was fired upon in September by
assailants.</p>

Tags: mail
Changed in karl3:
milestone: m18 → m19
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

If left to my own devices, in the case of extreme unparsability I would just escape everything as though it were plain text. Is that ok, or do you prefer to log and discard?

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 386065] Re: Malformed HTML mail wedged processing due to lxml parsing
Download full text (4.9 KiB)

Sounds fine to me. Makes for very, very good visibility on the
problem. In fact, the user can do a cleanup themselves, getting it
out of the <pre> by editing the blog entry/comment.

--Paul

On Jun 17, 2009, at 4:11 PM, Chris Rossi wrote:

> If left to my own devices, in the case of extreme unparsability I
> would
> just escape everything as though it were plain text. Is that ok, or
> do
> you prefer to log and discard?
>
> --
> Malformed HTML mail wedged processing due to lxml parsing
> https://bugs.launchpad.net/bugs/386065
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Porting KARL to a new architecture: New
>
> Bug description:
>
> I applied a quick fix to the parsing of HTML, based on the traceback
> below. Below the traceback is an example of the HTML that broke
> mail-in.
>
> Some near-term and longer-term (separate ticket) work remains. Near
> term:
>
> 1) Make a test that tries with malformed content.
>
> 2) Wrap the two cases of document_fromstring with an exception
> handler that logs and skips. If something is so bad that lxml.html
> can't handle it, I'm ok with tossing it out.
>
> As an alternative to (2), have the exception handler return a body
> that says "<p>Content was not parsable as HTML.</p>" or something.
>
> Traceback
> =========
>
> $ ./cronscripts/cron-process-mail.sh
> =
> =
> =
> =
> =
> =
> =
> =
> ======================================================================
> Draining mailbox : /home/zope
> =
> =
> =
> =
> =
> =
> =
> =
> ======================================================================
> Dry-run : False
> Pending queue : /home/zope/Maildir
>
> =================================================================
> Processing mail-in content
> =================================================================
> Maildir root: /home/zope
> Pending queue: /home/zope/Maildir/pending.db
> ZODB URI: zeo://localhost:8884
> =================================================================
> Traceback (most recent call last):
> File "bin/mailin", line 74, in <module>
> osi.scripts.mailin.main()
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 279, in main
> MailinRunner(sys.argv)()
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 263, in
> __call__
> if self.handleMessage(message_id):
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 242, in
> handleMessage
> self.processMessage(message, info, text, attachments)
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 226, in
> processMessage
> IMailinHandler(target).handle(message, info, text, attachments)
> File "/var/db/karl3/src/karl.content/karl/content/adapters/
> mailin.py", line 113, in handle
> alerts.emit(entry, offline_request)
> File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 55, in
> emit
> self._queue_digest(context, profile, request)
> File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 65, in
> _queue_digest
> message = alert.message
> File "/var/db/karl3/src/karl.content/karl/content/views/
> adapters.py", line 272, in message
> html = etree.fromstrin...

Read more...

Changed in karl3:
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Just noticed that the code where this is happening is actually for outgoing alerts (triggered by incoming mail). Even more specifically, it's for the digest version of those alerts. So the fix will only impact alert digests. Everything else should already be working ok.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :
Download full text (4.9 KiB)

I think the incoming happens to use that adapter or machinery.

On Jun 18, 2009, at 11:30 AM, Chris Rossi wrote:

> Just noticed that the code where this is happening is actually for
> outgoing alerts (triggered by incoming mail). Even more specifically,
> it's for the digest version of those alerts. So the fix will only
> impact alert digests. Everything else should already be working ok.
>
> --
> Malformed HTML mail wedged processing due to lxml parsing
> https://bugs.launchpad.net/bugs/386065
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Porting KARL to a new architecture: In Progress
>
> Bug description:
>
> I applied a quick fix to the parsing of HTML, based on the traceback
> below. Below the traceback is an example of the HTML that broke
> mail-in.
>
> Some near-term and longer-term (separate ticket) work remains. Near
> term:
>
> 1) Make a test that tries with malformed content.
>
> 2) Wrap the two cases of document_fromstring with an exception
> handler that logs and skips. If something is so bad that lxml.html
> can't handle it, I'm ok with tossing it out.
>
> As an alternative to (2), have the exception handler return a body
> that says "<p>Content was not parsable as HTML.</p>" or something.
>
> Traceback
> =========
>
> $ ./cronscripts/cron-process-mail.sh
> =
> =
> =
> =
> =
> =
> =
> =
> ======================================================================
> Draining mailbox : /home/zope
> =
> =
> =
> =
> =
> =
> =
> =
> ======================================================================
> Dry-run : False
> Pending queue : /home/zope/Maildir
>
> =================================================================
> Processing mail-in content
> =================================================================
> Maildir root: /home/zope
> Pending queue: /home/zope/Maildir/pending.db
> ZODB URI: zeo://localhost:8884
> =================================================================
> Traceback (most recent call last):
> File "bin/mailin", line 74, in <module>
> osi.scripts.mailin.main()
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 279, in main
> MailinRunner(sys.argv)()
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 263, in
> __call__
> if self.handleMessage(message_id):
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 242, in
> handleMessage
> self.processMessage(message, info, text, attachments)
> File "/var/db/karl3/src/osi/osi/scripts/mailin.py", line 226, in
> processMessage
> IMailinHandler(target).handle(message, info, text, attachments)
> File "/var/db/karl3/src/karl.content/karl/content/adapters/
> mailin.py", line 113, in handle
> alerts.emit(entry, offline_request)
> File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 55, in
> emit
> self._queue_digest(context, profile, request)
> File "/var/db/karl3/src/karl/karl/utilities/alerts.py", line 65, in
> _queue_digest
> message = alert.message
> File "/var/db/karl3/src/karl.content/karl/content/views/
> adapters.py", line 272, in message
> html = etree.fromstring(body)
> File "lxml...

Read more...

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote : Re: [Bug 386065] Re: Malformed HTML mail wedged processing due to lxml parsing

Incoming mail generates content, which in turn generates alerts, and
therefore ends up calling this code.

On Thu, Jun 18, 2009 at 12:03 PM, Paul Everitt <email address hidden> wrote:

>
> I think the incoming happens to use that adapter or machinery.
>
>
>

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

As far as I can tell, via testing, lxml.document_fromstring, added in Paul's previous fix, seems to be willing to parse just about anything we throw at it. I have not found a vector in which it fails. I have checked in a regression test and tested the test by reverting the code, but otherwise things seem to be good here.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Shane has this working.

Changed in karl3:
status: Fix Committed → Fix Released
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Oops, meant to say "Chris".

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.