Error parsing non-ascii content

Bug #117799 reported by Tom Haddon
2
Affects Status Importance Assigned to Milestone
loggerhead
Fix Released
High
Robey Pointer

Bug Description

The following URL produces a 500 error on loggerhead:

http://codebrowse.launchpad.net/~ubuntu-mobile/libosso/ubuntu/revision/tfheen%40err.no-20070518073040-txvtf9d5wearg8xo?start_revid=tfheen%40err.no-20070523081627-sjk5afjy2z3khe80

The relevant portions of the log are:

  File
"/srv/codebrowse.launchpad.net/turbogears/lib/python2.4/site-packages/kid-0.9.5-py2.4.egg/kid/parser.py",
line 432, in feed
    raise expat.ExpatError(e)
ExpatError: Error parsing XML:
<xml>&nbsp;*&nbsp;Contact:&nbsp;Kimmo&nbsp;H?m?l?inen&nbsp;&lt;kimmo.hamalai
...
INF [20070529-21:13:05.666] turbogears.access: 82.211.81.156 - - "GET /%
 <email address hidden>?start_revid=tfheen%40err.no-20070523081627-sjk5afjy2z3khe80 HTTP/1.1" 500 791 "http://codebrowse.launchpad.net/~ubuntu-mobile/libosso/ubuntu/changes" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.3) Gecko/20070515 Ubuntu/7.10 (gutsy) Firefox/2.0.0.3"

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I can certainly confirm this.

Changed in loggerhead:
importance: Undecided → High
status: Unconfirmed → Confirmed
Revision history for this message
Mikkel Høgh (mikl) wrote :
Revision history for this message
Robey Pointer (robey) wrote :

i made a local mirror of the mikl-danish branch, and that annotate works fine here. maybe removing the XML() stuff in the templates fixed this bug too?

Revision history for this message
Robey Pointer (robey) wrote :

but i can definitely reproduce the ~ubuntu-mobile branch error. it looks like one of the lines in the file is in Latin-1 instead of UTF-8.

i think my assumption that bazaar stores unicode files was wrong. file contents appear to be stored as binary data, and if that's coincidentally UTF-8, then loggerhead is happy, otherwise things blow up.

i guess we should try some heuristics here. if the file isn't UTF-8, Latin-9 is probably a good fallback.

people who use other encodings like KOI8 are going to be screwed here, and i'm not sure if there's anything we can do to help, short of adding a complex encoding-type guesser or something.

Revision history for this message
Robey Pointer (robey) wrote :

fixed in 139 on my branch.

Changed in loggerhead:
assignee: nobody → robey
status: Confirmed → Fix Committed
Revision history for this message
Barry Warsaw (barry) wrote :

In my case, all the files are ascii. They're just Mailman code in POP (Plain Ol' Python :).

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Released in 1.2.

Changed in loggerhead:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.