On Mon, Jul 07, 2008 at 10:14:04AM -0000, Gavin Panella wrote:
> The debdiff case was dealt with recently in bug 229040, but it looks
> like a more general fix is necessary. The problem, afaict, is in the
> `zope.app.content_types.text_type` function, which does some really
> awful guessing:
>
> def text_type(s):
> s = s.strip()
> # Yuk. See if we can figure out the type by content.
> if s.lower().startswith('<html>') or '</' in s:
> return 'text/html'
> elif s.startswith('<?xml'):
> return 'text/xml'
> else:
> return 'text/plain'
>
> We could call `z.a.content_types.guess_content_type` with an explicit
> default, then `text_type` will never be called. Some people may rely on
> the fact that real XML and HTML files are currently detected, so a
> refined version of `text_type` may be needed instead.
I think simply replacing the first test with something a bit less liberal,
e.g.:
s = s.strip().lower()
if (s.startswith('<html') or s.startswith('<!doctype html') or
s.startswith('<title') or s.startswith('<head') ):
would be fine. FYI, this example code reflects what file(1) does today,
which seems reasonable enough.
A more comprehensive option might be to validate the file using an HTML
parser, which I'm sure is around somewhere in launchpad already.
On Mon, Jul 07, 2008 at 10:14:04AM -0000, Gavin Panella wrote: content_ types.text_ type` function, which does some really ).startswith( '<html> ') or '</' in s: '<?xml' ): types.guess_ content_ type` with an explicit
> The debdiff case was dealt with recently in bug 229040, but it looks
> like a more general fix is necessary. The problem, afaict, is in the
> `zope.app.
> awful guessing:
>
> def text_type(s):
> s = s.strip()
> # Yuk. See if we can figure out the type by content.
> if s.lower(
> return 'text/html'
> elif s.startswith(
> return 'text/xml'
> else:
> return 'text/plain'
>
> We could call `z.a.content_
> default, then `text_type` will never be called. Some people may rely on
> the fact that real XML and HTML files are currently detected, so a
> refined version of `text_type` may be needed instead.
I think simply replacing the first test with something a bit less liberal,
e.g.:
s = s.strip().lower() '<html' ) or s.startswith( '<!doctype html') or h('<title' ) or s.startswith( '<head' ) ):
if (s.startswith(
s.startswit
would be fine. FYI, this example code reflects what file(1) does today,
which seems reasonable enough.
A more comprehensive option might be to validate the file using an HTML
parser, which I'm sure is around somewhere in launchpad already.
--
- mdz