Products.PageTemplates.unicodeconflictresolver: doubtful determination of server encoding

Reported by Dieter Maurer on 2008-08-04
2
Affects Status Importance Assigned to Milestone
Zope 2
Wishlist
Unassigned

Bug Description

The unicode resolver registered by default "Products.PageTemplates.unicodeconflictresolver.PreferredCharsetResolver"
looks at the request and determines from its prefered encodings the encoding used by the server.

This is very doubtful as the server probably will never base its encoding on the preferences of its (varying) clients.

I suggest to add a new configuration option telling Zope which encoding is typically used by the server. In principle, the Python default encoding could be used for this, but apparently the Python community does not like to use this default encoding. That calls for a new option.

Andreas Jung (ajung) wrote :

As discussed internally, it is legitimate to take the preferred charsets into account. As said there is no official API for determine the internal encoding of the backend application. This is better than ignoring it. The only solution would be to introduce an official API in form of an utility that could be registered with a site-manager returning the backend encoding

Changed in zope2:
importance: Undecided → Wishlist
Dieter Maurer (d.maurer) wrote :

> As discussed internally, it is legitimate to take the preferred charsets into account.

I do not know where you get your meaning of "legitimate" from, but looking at the
client's (!) charset preferences to determine the server's (!) charset seams unreasonable
(to speak carefully).

Andreas Jung (ajung) wrote :

As written earlier: this heuristic is working in reallity. For the problem you had in your application: fix the handling with different encodings. If necessary register your own resolver performing a lookup to the backend encoding in some way. As said: there is no official API for doing this right now and therefore the resolver can not be smarter.

Changed in zope2:
status: New → Won't Fix
Dieter Maurer (d.maurer) wrote :

You will see, how the "so called" heuristic will fail in reality (our reality will just be one case).
I will be happy to point each such report to this mishandled bug report....

What the heuristic really does: it hides exceptions and causes wrong conversions instead -- as it is unsensefull to draw conclusions about the server side encoding from the clients charset preferences.

Andreas Jung (ajung) wrote :

A heuristic making legacy or badly written code working with the new ZPT implementation is much better than letting them standing in a rain. A heuristic that works for most of the cases is better than having nothing. And for the remaining part where the heuristic fails: you know the options for doing something better (based on knowledge that the heuristic can't have).

Dieter Maurer (d.maurer) wrote :

Here is a detailed critique on this so called "heuristics".

From Zope 2.10 on, Zope uses the unicode based page template implementation
of Zope 3. While Zope 3 consistently uses unicode to represent text,
this is not the case for Zope 2. As a result, Zope 2 page templates
have a high probablity to have to handle text represented by the Python
"str" datatype in some encoding.
It uses a (mis-named) so called "IUnicodeEncodingConflictResolver"
utility to handle conversion of such text into unicode.

By default, a "PreferredCharsetResolver" is registered as this utility.
"PreferredCharsetResolver" uses a completely insane approach to
determine the unknown encoding used by the server: it looks
at the current request and its "Accept-Charset" header to determine
a list of candidate encodings -- and uses the first one that does
not cause a "UnicodeDecodeError".
This is insane for the following reasons:

  * The server side used encoding is (almost) completely unrelated
     to the client's accepted charsets.

     The approach therefore uses an arbitrary set of encodings
     to guess the encoding used on the server side.

  * 8-bit encodings can decode almost any "str" string -- but
     the probability that they do the right thing is small
     unless it is the true encoding used by the string at hand.

     Cheching for the absence of "UnicodeDecodeError"s
     is thus an extremely week check -- even an impractical check

  * As different clients have different "Accept-Charset",
     the decoding behaviour get apparently non-deterministic:
     some clients will get the correct one, others will see
     wrong characters -- an analysis nightmare, unless one knows
     about this insanity

  * The broken implementation requires access to "REQUEST" via
     acquisition. As a consequence, it does not work in
     situations without "REQUEST" -- e.g. in scripting situations.

  * "five.localsitemanager" tries hard to remove "REQUEST" from
     the acquisition context for local utilities looked up by its
     registries. The broken implementation therefore can fail
     on these utilities and on objects wrapped in their acquisition
     context.

Andreas Jung (ajung) wrote :

"""
 The broken implementation requires access to "REQUEST" via
    acquisition. As a consequence, it does not work in
    situations without "REQUEST" -- e.g. in scripting situations.

 * "five.localsitemanager" tries hard to remove "REQUEST" from
    the acquisition context for local utilities looked up by its
    registries. The broken implementation therefore can fail
    on these utilities and on objects wrapped in their acquisition
    context.
"""
The dependency of the REQUEST has been removed on the trunk (and you got an patch
for internal usage)...so no further reason for complaining again

The other points remain uncommented because they were discussed and commented already
earlier.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers