UnicodeDecodeError while POSTing forms with non-ascii characters.

Bug #44919 reported by Diogo Matsubara
56
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

I was unable to manually reproduce this bug, however wrote a pagetest that crashes exactly like OOPS-134C24.

https://chinstrap.ubuntu.com/~dsilvers/paste/filejU7yVl.html

Things to note:
HTTP_ACCEPT_LANGUAGE zh-cn
There's no HTTP_ACCEPT_CHARSET header.
field.displayname \xb3\xc2\xbb\xaa\xbe\xfc are probably chinese characters.

Recently: OOPS-980H1350, OOPS-980F724, OOPS-980F729, OOPS-980G753, OOPS-980G2610, OOPS-980B840, OOPS-981A1843, OOPS-981D131, OOPS-981F112, OOPS-1223B591, OOPS-1223C1763, OOPS-1223F7576

Revision history for this message
Stuart Bishop (stub) wrote :

'input == self._missing' is implicitly trying to convert the displayname to Unicode to do the comparison, as self._missing is Unicode. Either the invalid UTF-8 needs to be caught earlier, or the Z3 widget machinery needs to be fixed to cope with this case. I suspect the latter but haven't looked very closely.

Changed in launchpad:
status: Unconfirmed → Confirmed
Revision history for this message
Björn Tillenius (bjornt) wrote : Re: Oops report analysis 2006-05-14

On Mon, May 15, 2006 at 07:35:06PM -0300, Diogo Matsubara wrote:
> > 1 UnicodeDecodeError: 'ascii' codec can't decode byte INSTANCE-ID in position 0: ordinal not in range(128)
> > 0% from search bots, 100% referred from local sites
> > 1 https://launchpad.net/token/d6m8h5zs8mN70rfRVZJb/+newaccount
> > OOPS-134C24
>
> Reported, not assigned:
> https://launchpad.net/products/launchpad/+bug/44919
>
> This one puzzled me, I wrote a pagetest that gives me a identical oops, but I
> don't know exactly why it crashes. Also I couldn't reproduce it manually.
>
> https://chinstrap.ubuntu.com/~dsilvers/paste/filejU7yVl.html
>
> Any ideas?

Hmm, this one is a bit tricky to handle. Normally the for input is
converted to unicode strings, by using the encoding the resulting page
will be encoded with. This is usually utf-8, however, the client's
browser decided that the input should be encoded using utf-16. That
means that Zope can't decode the input to unicode, so it leaves it as a
normal str object, which will cause problem since a lot of code assumes
that it deals with unicode only.

I'm not quite sure how we can handle that smoothly. Maybe we should
modify Zope, so that it always succeeds decoding the string, by using
'ignore', 'replace' or something like that when decoding it, instead of
'strict', which is the default. I have to check first if there's a use
case for leaving an input undecoded, though, it could have something to
do how file uploads are handled.

I don't think modifying the widget machinery is a feasible solution,
since it's not only the widget machinery that is affected. A lot of
code assumes that it gets a unicode string when it gets some input from
the request.

Revision history for this message
Diogo Matsubara (matsubara) wrote : Re: UnicodeDecodeError while registering a new account.

Recent OOPS-137D180
This time was in the password widget and also no HTTP_ACCEPT_CHARSET header.

Changed in launchpad:
assignee: nobody → launchpad-infrastructure
Revision history for this message
James Henstridge (jamesh) wrote :

For this sort of problem, I think we should be rejecting the user input with an appropriate error page.

Using 'replace' or 'ignore' mode will result in data loss, which is inconvenient for things like people's names, and might make people's accounts unusable if used on their passwords.

The other half of the problem is to see if we can do anything more to encourage web browsers to send us UTF-8 data.

Christian Reis (kiko)
Changed in launchpad:
assignee: launchpad-infrastructure → nobody
Changed in launchpad:
assignee: nobody → bjornt
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Raising importance since this is happening quite frequently lately. OOPS-317C76

Changed in launchpad:
importance: Medium → High
Revision history for this message
Björn Tillenius (bjornt) wrote :

I've added accept-charset parameters to the forms where OOPSes like this are triggered. That may or may not fix this bug, we'll see whether the OOPSes stop occurring after the fix has been rolled out.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

Changing to fix committed as per Bjorn's comment. I'll re-open if it still occurs after the rollout.

Changed in launchpad:
status: Confirmed → Fix Committed
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Seems to happen with IE 6.0 as per:

OOPS-350A347
OOPS-350C337

Changed in launchpad:
status: Fix Committed → Confirmed
Revision history for this message
Diogo Matsubara (matsubara) wrote :

It was decided in today's meeting that Bjorn would arrange a call with Steve to discuss a proper solution for this bug.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

A recent run of German-language spam caused a spate of these:

OOPS-501B230, OOPS-501B426, OOPS-501C1101, OOPS-501C652, OOPS-501E812, OOPS-501C252, OOPS-501B129, OOPS-501B670, OOPS-501C165, OOPS-501D328, OOPS-501A122, OOPS-501B205, OOPS-501B1220.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

OOPS-874A388 is a recent occurance.

<matsubara> flacoste: hi, did you file a bug about the badly encoded query string OOPS?
<flacoste> matsubara: no
<flacoste> matsubara: would consider as part of an existing actually, i don't think it's different
<matsubara> flacoste: which one?
<flacoste> bug 44919
<flacoste> it's the fact that when the query isn't encoded properly it's a bytes string instead of unicode
<flacoste> there might be a way to blanket those
<flacoste> into a UFD
<flacoste> wouldn't solve the issues with fields sent using POST though
<flacoste> maybe we could also blanket that at the publiation level
<BjornT> flacoste: well, the question is whether UFD is the right thing there. it's the browser that is misbehaving. UFD will be just as bad as an oops for the user. the user won't understand what is wrong, and will think that launchpad is broken.
<flacoste> BjornT: right, it's just as bad for the user, but it won't be part of our OOPS report anymore :-)
<flacoste> which I think is what is annoying kiko
<flacoste> BjornT: a "proper" fix might be to try to parse those strings using the first value in HTTP_ACCEPT_ENCODING
<flacoste> since that's what it seems to be in these cases
<BjornT> flacoste: true. i would use a different exception, though (like BrokenUserAgent)
<flacoste> right
<matsubara> flacoste: what you suggest is like trying to detect what encoding they're sending to us?
<flacoste> matsubara: well not auto-detect, but guess for broken one based on their HTTP_ACCEPT_ENCODING
<flacoste> and only try the first value
<matsubara> flacoste: can't we use chardet for that?
<flacoste> matsubara: what charset are you talking about?
<matsubara> flacoste: http://chardet.feedparser.org/
<BjornT> flacoste: you can't use HTTP_ACCEPT_ENCODING. you have to use HTTP_ACCEPT_CHARSET, which most of these oops seem to lack...
<BjornT> i think replacing undecodable characters with ? would do fine.
<flacoste> BjornT: well, the two OOPSes that kiko pasted do contain a HTTP_ACCEPT_CHARSET header, but the first value is wrong in the first case and would work in the second case, so that's not bullet-proof
<flacoste> chardet might be better
<flacoste> well, using ? causes data lossage on the client
<flacoste> so i say either raise BrokenUserAgent
<flacoste> or try to detect encoding using chardet, although that might also lead to data lossage
<BjornT> flacoste: i think some data lossage is ok. if the user has a broken browser, he's probably used to not being able to use non-ascii characters...

Revision history for this message
Ursula Junque (ursinha) wrote :

Recently: OOPS-968C1869

Revision history for this message
Ursula Junque (ursinha) wrote :

Two more recent occurrences in translations: OOPS-979B2903 and OOPS-979C3111

Ursula Junque (ursinha)
description: updated
Ursula Junque (ursinha)
description: updated
Revision history for this message
Stuart Bishop (stub) wrote :

We should just raise a UFD here, possibly adding a nice message telling the user their browser is stuffed.

We should not guess, as we want to keep corrupt data out of our system. Some places, attempting to guess or munge the data might be harmless but in other cases it could have serious implications.

Ursula Junque (ursinha)
description: updated
Changed in launchpad-foundations:
assignee: Björn Tillenius (bjornt) → nobody
Changed in launchpad:
importance: High → Critical
Revision history for this message
Tim Penhey (thumper) wrote :

I'm pretty sure that this particular bug is fixed. Many of the oopses attributed to this bug are actually other ones some with their own bugs. I'm closing this one now.

Changed in launchpad:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.