Comment 11 for bug 34758

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 34758] Re: librarian will set type to text/html though it should be text/plain

On Mon, Jul 07, 2008 at 10:14:04AM -0000, Gavin Panella wrote:
> The debdiff case was dealt with recently in bug 229040, but it looks
> like a more general fix is necessary. The problem, afaict, is in the
> `zope.app.content_types.text_type` function, which does some really
> awful guessing:
>
> def text_type(s):
> s = s.strip()
> # Yuk. See if we can figure out the type by content.
> if s.lower().startswith('<html>') or '</' in s:
> return 'text/html'
> elif s.startswith('<?xml'):
> return 'text/xml'
> else:
> return 'text/plain'
>
> We could call `z.a.content_types.guess_content_type` with an explicit
> default, then `text_type` will never be called. Some people may rely on
> the fact that real XML and HTML files are currently detected, so a
> refined version of `text_type` may be needed instead.

I think simply replacing the first test with something a bit less liberal,
e.g.:

s = s.strip().lower()
if (s.startswith('<html') or s.startswith('<!doctype html') or
    s.startswith('<title') or s.startswith('<head') ):

would be fine. FYI, this example code reflects what file(1) does today,
which seems reasonable enough.

A more comprehensive option might be to validate the file using an HTML
parser, which I'm sure is around somewhere in launchpad already.

--
 - mdz