BrowserRequest and HTTPRequest contain a mixture of str and unicode strings

Bug #98374 reported by Björn Tillenius
2
Affects Status Importance Assigned to Milestone
zope.publisher
Invalid
Medium
Unassigned

Bug Description

The environment in the request contains a mixture of str and unicode strings because some HTTP headers are explicitly not converted to unicode, and sometimes form values can't be decoded with utf-8, so they are left as they are (as a str object).

This causes subtle problems here and there, since Zope3 is said to use unicode internally, no one cares to check if the string is a unicode or str string. This is IMHO the right thing to do, since otherwise we would have to add tons of checks everywhere.

One example of this causing problem is when you want to convert the request to a string. It simply joins the environment strings together, and if any of the str strings contain non-ascii characters it will break since it can't be converted to unicode.

My suggestion is that we always keep the request environment variables as unicode strings, if some header or form value can't be decode it with the default strategy, we should use iso-8859-1 which is the standard encoding if no encoding is given.

Tags: core issue
Revision history for this message
Steve Alexander (stevea) wrote :

I am curious; in which RFC is it written that iso-8859-1 is a standard encoding for HTTP headers?

The solution you propose is an improvement on the current situation.

I'd like to propose something a little different: how about converting HTTP headers to unicode, (perhaps using an ASCII codec, and replacing unknown characters with '?', perhaps using iso-8859-1), but also keeping the unencoded HTTP headers accessible through an API on the request.

That way, we can add handlers to the request processing that will custom convert particular HTTP headers.

Revision history for this message
Stephan Richter (srichter) wrote :

Changes: submitter email, edited transcript, importance (medium => critical)

Revision history for this message
Jim Fulton (jim-zope) wrote :

First, a distinction needs to be made between form variables and HTTP headers. They are different beasts.

I would not want to see HTTP headers decoded unless an RFC could be cited in support, in which case, I want a code commen to that effect.

WRT form variables, I want to see a proposal before we do anything.

There may even be outstanfing proposals on this topic. An idea
that was floated at one point was to include form enccodings
in hidden variables. That is, when we generate a form, we should
record the encoding used in a hidden variable.

In any case, this would be too large a change for 3.3.

Revision history for this message
Jim Fulton (jim-zope) wrote :

Changes: edited transcript, importance (critical => medium)

Revision history for this message
Florent Guillaume (efge) wrote :

HTTP 1.1 specifies ISO-8859-1 for all headers, with an escaping mechanism which is RFC 2047 (commonly seen is the Subject header of mails).

This should all be decoded to unicode.

Revision history for this message
Tres Seaver (tseaver) wrote :

RFC 2616[1] says about message headers:

  HTTP header fields, which include general-header (section 4.5),
  request-header (section 5.3), response-header (section 6.2), and
  entity-header (section 7.1) fields, follow the same generic
  format as that given in Section 3.1 of RFC 822.

RFC 822 headers must be ASCII[2].

Is there a revised HTTP 1.1 spec which allows non-ASCII headers?

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2

[2] http://www.ietf.org/rfc/rfc822.txt ; see section 3.1.1.

Revision history for this message
Florent Guillaume (efge) wrote :

RFC2616 says in section 4.2 (Message headers):

       message-header = field-name ":" [ field-value ]
       field-name = token
       field-value = *( field-content | LWS )
       field-content = <the OCTETs making up the field-value
                        and consisting of either *TEXT or combinations
                        of token, separators, and quoted-string>

In section 2.2 it defines the BNF used and says:

   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047
   [14].

       TEXT = <any OCTET except CTLs,
                        but including LWS>

Changed in zope3:
status: New → Confirmed
Tres Seaver (tseaver)
affects: zope3 → zope.publisher
Revision history for this message
Colin Watson (cjwatson) wrote :

The zope.publisher project on Launchpad has been archived at the request of the Zope developers (see https://answers.launchpad.net/launchpad/+question/683589 and https://answers.launchpad.net/launchpad/+question/685285). If this bug is still relevant, please refile it at https://github.com/zopefoundation/zope.publisher.

Changed in zope.publisher:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.