Zope 2

Default IUserPreferredCharsets' use of Zope 2's request problematic

Bug #160968 reported by Daniel Nouri on 2007-11-08

This bug report is a duplicate of: Bug #143873: getPreferredCharsets() returns iso-8859-1 and not utf-8 when HTTP_ACCEPT_CHARSET not present in request. Edit Remove

Affects		Status	Importance	Assigned to	Milestone
	Zope 2	Fix Released	Undecided	Unassigned

Bug Description

The IUserPreferredCharsets implementation of Zope 3 found in zope.publisher.http.HTTPCharsets has the following condition in it to check if the HTTP_ACCEPT_CHARSET header is available:

header_present = 'HTTP_ACCEPT_CHARSET' in self.request

However, with Zope 2's request will return '' (the empty string) for any header that starts with 'HTTP_', see ZPublisher.HTTPRequest.HTTPRequest.get.

Ultimately, this results in the HTTPCharsets.getPreferredCharsets to return ['iso-8859-1'], where it should really return 'UTF-8'.

To understand this problem better, look at Products.Five.browser.decode.processInputs, which uses the negotiator to find out which charset to use to convert form variables. For browsers that do not send the 'HTTP_ACCEPT_CHARSET' header, this will result in wrongly encoded form values. To reproduce this, fill in Chinese characters to any Five formlib form with Internet Explorer 6.0. Since Firefox sends HTTP_ACCEPT_CHARSET, it's not a problem there.

Revision history for this message

Shimizukawa (shimizukawa) wrote on 2008-06-15:

Use default-zpublisher-encoding if HTTP_ACCEPT_CHARSET was not provided. Edit (8.0 KiB, text/plain)

This problem has come to light by Plone3.

zope.formlib needs unicode decoded field object, and Products.Five.browser.decode.processInputs provides unicode converted request.form. Charset provided by IUserPreferredCharsets.getPreferredCharsets(), and getPreferredCharsets() decide charset by HTTP_ACCEPT_CHARSET. If HTTP_ACCEPT_CHARSET was not sent from client browser (IE6,7, Safari), getPreferredCharsets() return iso-8859-1.

I think use default-zpublisher-encoding value if HTTP_ACCEPT_CHARSET was not provided.

references::

- https://bugs.launchpad.net/zope2/+bug/143873
- http://dev.plone.org/plone/ticket/8185

Revision history for this message

Malthe Borch (mborch) wrote on 2008-06-16:

Fwiw, bug 143873 (referenced) was fixed in r84616; however, this seems not to have propagted to a Zope 2 release (at least not 2.10.x).

Revision history for this message

Ole Christian Helset (ochelset) wrote on 2010-02-25:

Using Zope 2.11.5, default-zpublisher-encoding utf-8, rendering content fails in IE and Safari, as they (at the time of writing) doesn't provide the Accept-Charset header, if the content contains a string in utf-8.

In http.py (zope/publisher/http.py), the HTTPCharsets.getPreferredCharsets() method returns an empty list, causing a UnicodeDecodeError in zope, when a tal:content string contains utf-8 encoded string with fi. norwegian characters (ø > \xc3\xb8).

I made a simple test, just a default page template, giving it a title with such a character (fi. Pølse):
<html>
 <head>
 <meta http-equiv="content-type" content="text/html;charset=utf-8">
 </head>
 <body>
 <tal:block content="python:repr(template.title)" /> 
 <tal:block content="python:repr(template.title.encode('latin-1'))" /> 
 <tal:block content="python:repr(template.title.encode('utf-8'))" /> 
 <tal:block content="python:title" define="title python:template.title" /> 
 <tal:block content="python:title" define="title python:template.title.encode('utf-8')" /> 
 </body>
</html>

In Firefox the output is fine:
u'P\xf8lse'
'P\xf8lse'
'P\xc3\xb8lse'
Pølse
Pølse

In IE and Safari it raises a UnicodeDecodeError

If HTTPCharsets.getPreferredCharsets() returns ['utf-8'], it works fine in IE and Safari as well.

My changes to http.py:
from zope.publisher.base import RequestDataGetter
+from ZPublisher import Converters

...

        # Quoting RFC 2616, $14.2: If no "*" is present in an Accept-Charset
        # field, then all character sets not explicitly mentioned get a
        # quality value of 0, except for ISO-8859-1, which gets a quality
        # value of 1 if not explicitly mentioned.
        # And quoting RFC 2616, $14.2: "If no Accept-Charset header is
        # present, the default is that any character set is acceptable."
        if not sawstar and not sawiso88591 and header_present:
- charsets.append((1.0, 'iso-8859-1'))
+ charsets.append((1.0, Converters.default_encoding))
        # UTF-8 is **always** preferred over anything else.
        # Reason: UTF-8 is not specific and can encode the entire unicode
        # range , unlike many other encodings. Since Zope can easily use very
        # different ranges, like providing a French-Chinese dictionary, it is
        # always good to use UTF-8.
        charsets.sort(sort_charsets)
        charsets = [charset for quality, charset in charsets]
- if sawstar and 'utf-8' not in charsets:
+ if not sawstar and 'utf-8' not in charsets: # IS THIS BAD, TO FORCE IN UTF-8???
            charsets.insert(0, 'utf-8')

The question is then, is this a problem, forcing utf-8 here (or the default-zpublisher-encoding) when the HTTP_ACCEPT_CHARSET is missing in the request?

In Firefox the output is fine:
u'P\xf8lse'
'P\xf8lse'
'P\xc3\xb8lse'
Pølse
Pølse

In IE and Safari it raises a UnicodeDecodeError

If HTTPCharsets.getPreferredCharsets() returns ['utf-8'], it works fine in IE and Safari as well.

My changes to http.py:
from zope.publisher.base import RequestDataGetter
+from ZPublisher import Converters

...

# Quoting RFC 2616, $14.2: If no "*" is present in an Accept-Charset
        # field, then all character sets not explicitly mentioned get a
        # quality value of 0, except for ISO-8859-1, which gets a quality
        # value of 1 if not explicitly mentioned.
        # And quoting RFC 2616, $14.2: "If no Accept-Charset header is
        # present, the default is that any character set is acceptable."
        if not sawstar and not sawiso88591 and header_present:
-            charsets.append((1.0, 'iso-8859-1'))
+            charsets.append((1.0, Converters.default_encoding))
        # UTF-8 is **always** preferred over anything else.
        # Reason: UTF-8 is not specific and can encode the entire unicode
        # range , unlike many other encodings. Since Zope can easily use very
        # different ranges, like providing a French-Chinese dictionary, it is
        # always good to use UTF-8.
        charsets.sort(sort_charsets)
        charsets = [charset for quality, charset in charsets]
-        if sawstar and 'utf-8' not in charsets:
+        if not sawstar and 'utf-8' not in charsets: # IS THIS BAD, TO FORCE IN UTF-8???
            charsets.insert(0, 'utf-8')

The question is then, is this a problem, forcing utf-8 here (or the default-zpublisher-encoding) when the HTTP_ACCEPT_CHARSET is missing in the request?

Revision history for this message

Tres Seaver (tseaver) wrote on 2010-05-08:

AFAICT, this bug is a duplicate of lp:143873, for which we have long since released fixed versions.