zope.publisher

BrowserRequest and HTTPRequest contain a mixture of str and unicode strings

Bug #98374 reported by Björn Tillenius on 2006-06-14

Affects		Status	Importance	Assigned to	Milestone
	zope.publisher	Invalid	Medium	Unassigned

Bug Description

The environment in the request contains a mixture of str and unicode strings because some HTTP headers are explicitly not converted to unicode, and sometimes form values can't be decoded with utf-8, so they are left as they are (as a str object).

This causes subtle problems here and there, since Zope3 is said to use unicode internally, no one cares to check if the string is a unicode or str string. This is IMHO the right thing to do, since otherwise we would have to add tons of checks everywhere.

One example of this causing problem is when you want to convert the request to a string. It simply joins the environment strings together, and if any of the str strings contain non-ascii characters it will break since it can't be converted to unicode.

My suggestion is that we always keep the request environment variables as unicode strings, if some header or form value can't be decode it with the default strategy, we should use iso-8859-1 which is the standard encoding if no encoding is given.

Tags:

Revision history for this message

Steve Alexander (stevea) wrote on 2006-06-15:

I am curious; in which RFC is it written that iso-8859-1 is a standard encoding for HTTP headers?

The solution you propose is an improvement on the current situation.

I'd like to propose something a little different: how about converting HTTP headers to unicode, (perhaps using an ASCII codec, and replacing unknown characters with '?', perhaps using iso-8859-1), but also keeping the unencoded HTTP headers accessible through an API on the request.

That way, we can add handlers to the request processing that will custom convert particular HTTP headers.

Revision history for this message

Stephan Richter (srichter) wrote on 2006-06-16:

Changes: submitter email, edited transcript, importance (medium => critical)

Revision history for this message

Jim Fulton (jim-zope) wrote on 2006-06-16:

First, a distinction needs to be made between form variables and HTTP headers. They are different beasts.

I would not want to see HTTP headers decoded unless an RFC could be cited in support, in which case, I want a code commen to that effect.

WRT form variables, I want to see a proposal before we do anything.

There may even be outstanfing proposals on this topic. An idea
that was floated at one point was to include form enccodings
in hidden variables. That is, when we generate a form, we should
record the encoding used in a hidden variable.

In any case, this would be too large a change for 3.3.

Revision history for this message

Jim Fulton (jim-zope) wrote on 2006-06-16:

Changes: edited transcript, importance (critical => medium)

Revision history for this message

Florent Guillaume (efge) wrote on 2006-06-16:

HTTP 1.1 specifies ISO-8859-1 for all headers, with an escaping mechanism which is RFC 2047 (commonly seen is the Subject header of mails).

This should all be decoded to unicode.

Revision history for this message

Tres Seaver (tseaver) wrote on 2006-06-16:

RFC 2616[1] says about message headers:

  HTTP header fields, which include general-header (section 4.5),
  request-header (section 5.3), response-header (section 6.2), and
  entity-header (section 7.1) fields, follow the same generic
  format as that given in Section 3.1 of RFC 822.

RFC 822 headers must be ASCII[2].

Is there a revised HTTP 1.1 spec which allows non-ASCII headers?

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2

[2] http://www.ietf.org/rfc/rfc822.txt ; see section 3.1.1.

Revision history for this message

Florent Guillaume (efge) wrote on 2006-06-16:

RFC2616 says in section 4.2 (Message headers):

       message-header = field-name ":" [ field-value ]
       field-name = token
       field-value = *( field-content | LWS )
       field-content = <the OCTETs making up the field-value
                        and consisting of either *TEXT or combinations
                        of token, separators, and quoted-string>

In section 2.2 it defines the BNF used and says:

   The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047
   [14].

TEXT = <any OCTET except CTLs,
but including LWS>

Christian Theune (ctheune) on 2007-08-12

Changed in zope3:
status:	New → Confirmed

Tres Seaver (tseaver) on 2010-04-12

affects:

zope3 → zope.publisher

Revision history for this message

Colin Watson (cjwatson) wrote on 2019-10-23:

The zope.publisher project on Launchpad has been archived at the request of the Zope developers (see https://answers.launchpad.net/launchpad/+question/683589 and https://answers.launchpad.net/launchpad/+question/685285). If this bug is still relevant, please refile it at https://github.com/zopefoundation/zope.publisher.

Changed in zope.publisher:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.