Apache2 defaults to the wrong character set, it should be UTF-8

Bug #1258546 reported by Lars Noodén
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Apache2 by mistake defaults to windows-1252 instead of UTF-8. The system is now in UTF-8 or at worst ISO-8859. Apache2 should default to a standard character set, such as UTF-8 which is used in the rest of the system.

$ set | grep -i utf
LANG=en_US.UTF-8

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: apache2 2.4.6-2ubuntu4
ProcVersionSignature: Ubuntu 3.12.0-5.13-generic 3.12.2
Uname: Linux 3.12.0-5-generic x86_64
Apache2ConfdDirListing: False
ApportVersion: 2.12.7-0ubuntu1
Architecture: amd64
CurrentDesktop: LXDE
Date: Fri Dec 6 16:59:29 2013
InstallationDate: Installed on 2013-11-19 (17 days ago)
InstallationMedia: Lubuntu 14.04 "Trusty Tahr" - Alpha amd64+mac (20131118)
SourcePackage: apache2
UpgradeStatus: No upgrade log present (probably fresh install)
error.log: Error: [Errno 13] Permission denied: '/var/log/apache2/error.log'
modified.conffile..etc.apache2.mods.available.mime.conf: [modified]
modified.conffile..etc.apache2.sites.available.000.default.conf: [modified]
mtime.conffile..etc.apache2.mods.available.mime.conf: 2013-12-06T14:35:32.967408
mtime.conffile..etc.apache2.sites.available.000.default.conf: 2013-12-06T14:40:35.305416

Revision history for this message
Lars Noodén (larsnooden) wrote :
Revision history for this message
Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

I have checked both Precise and Trusty, and can find no "windows-1252" default that you refer to. I used "wget -S" to see the headers returned by the Apache server, and it did not specify a character set.

Could you please provide exact steps to reproduce this bug on a freshly installed Ubuntu Server system, and detail what you are expecting and what happens on your system instead? Please use "wget -S" or similar to demonstrate that the problem is actually with Apache, and not with a client, or the web page source or similar.

Without a test case, there isn't enough information here for a developer to confirm this issue is a bug, or to begin working on it, so I am marking this bug Incomplete for now.

If you can provide exact steps so that a developer can reproduce the original problem, then please add them to this bug and change the status back to New.

Changed in apache2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Lars Noodén (larsnooden) wrote :

I can do the Server system, too, but right now the steps I have followed to get the problem are:

1. install Ubuntu 12.04 desktop, or Lubuntu 14.04devel desktop (it occurs on both)
2. install Apache2, leaving default configuration settings
3. load an html page from the server in a browser (in 12.04 or 14.04devel)
4. check page info regarding Encoding

Adding AddDefaultCharset utf-8 to the configuration file makes the problem go away.
But could this be a problem with the browser anyway?

$ wget -S http://xx.yy.zz.aa
--2013-12-09 14:38:34-- http://xx.yy.zz.aa/
Connecting to xx.yy.zz.aa:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Mon, 09 Dec 2013 12:38:34 GMT
  Server: Apache/2.2.22 (Ubuntu)
  Last-Modified: Sat, 07 Dec 2013 14:39:28 GMT
  ETag: "222742-b1-4ecf2bae66f2c"
  Accept-Ranges: bytes
  Content-Length: 177
  Vary: Accept-Encoding
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/html
Length: 177 [text/html]

$ wget -S http://xx.yy.zz.bb
--2013-12-09 14:39:46-- http://xx.yy.zz.bb/
Connecting to xx.yy.zz.bb:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Mon, 09 Dec 2013 12:39:46 GMT
  Server: Apache/2.4.6 (Ubuntu)
  Last-Modified: Mon, 25 Nov 2013 16:12:19 GMT
  ETag: "b1-4ec02a0e06c9c"
  Accept-Ranges: bytes
  Content-Length: 177
  Vary: Accept-Encoding
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/html
Length: 177 [text/html]

Revision history for this message
Lars Noodén (larsnooden) wrote :

The one browser is Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0
HTTP_ACCEPT Headers : text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 gzip, deflate en,en-us;q=0.7,sv;q=0.3

The other is: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0
HTTP_ACCEPT Headers : text/html, */* gzip, deflate en-US,en;q=0.5

Revision history for this message
Lars Noodén (larsnooden) wrote :

I've done a fresh installation from the ubuntu-12.04.3-server-i386.iso image and installed Apache2. The Firefox web browser still shows that the pages being served are encoded in "windows-1252" instead of UTF-8, which is what the locale is set to, or ISO-8859 which would be the old standard.

Revision history for this message
Robie Basak (racb) wrote :

I believe browsers typically try to guess. If Apache serves a page that doesn't have any non-ASCII characters in it, then browsers can guess, and "windows-1252" would still be correct, since the document was a strict subset of this charset.

What happens if you serve a UTF-8 encoded file? What does the browser do then?

If you want Apache to assume that everything in /var/www is UTF-8 by default, and explicitly set that in every response, then I can understand such a request, but I think it needs to be coordinated with the Debian packaging, perhaps also including upstream's view on a suitable default.

Revision history for this message
Lars Noodén (larsnooden) wrote :

If I serve a UTF-8 encoded file *AND* set the default myself in Apache, then everything is fine. If the default encoding is left alone, Apache serves it up as "windows-1252" and then UTF-8 encoded letters come out as garbage like this: åäöÅÄÖéÉ

As seen from the browser HTTP_ACCEPT Headers, it seems to be the web server making the choice.

Apache has a defaut encoding. It should be a standard, UTF-8 or ISO-8859, having non-standard windows-1252 in the default configuration just makes a mess. It's easy to fix by AddDefaultCharset to the configuration. However, it would be great if Apache worked with non-English languages out of the box, especially when the locale is set so.

Revision history for this message
Robie Basak (racb) wrote :

> If the default encoding is left alone, Apache serves it up as "windows-1252" and then UTF-8 encoded letters come out as garbage like this: åäöÅÄÖéÉ

I do not see this behaviour:

root@trusty:/var/www# xxd test.txt
0000000: 5363 6872 c3b6 6469 6e67 6572 2773 2043 Schr..dinger's C
0000010: 6174 0a at.
root@trusty:/var/www# wget -S -O/dev/null http://localhost/test.txt
--2013-12-09 15:26:28-- http://localhost/test.txt
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Mon, 09 Dec 2013 15:26:28 GMT
  Server: Apache/2.4.6 (Ubuntu)
  Last-Modified: Mon, 09 Dec 2013 12:19:37 GMT
  ETag: "13-4ed1902654840"
  Accept-Ranges: bytes
  Content-Length: 19
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/plain
Length: 19 [text/plain]
Saving to: ‘/dev/null’

100%[=============================================================================>] 19 --.-K/s in 0s

2013-12-09 15:26:28 (1.52 MB/s) - ‘/dev/null’ saved [19/19]

root@trusty:/var/www#

Here, Apache is just not setting an encoding. It never claims "windows-1252".

> Apache has a defaut encoding.

As you can see from the headers, this does not appear to be true. I can understand that perhaps it does in other circumstances that I haven't been able to test. If this is true, please can you provide steps to reproduce?

> It's easy to fix by AddDefaultCharset to the configuration. However, it would be great if Apache worked with non-English languages out of the box, especially when the locale is set so.

I appreciate that there is a case to perhaps provide a default AddDefaultCharset that matches the system locale, but unfortunately it's not simple since the system locale may not match the encoding of the files you expect to serve from /var/www. This is a tricky issue, and one I think would be better addressed in Debian or upstream than for Ubuntu to diverge from Debian and upstream on this.

Revision history for this message
Lars Noodén (larsnooden) wrote :

If wget is not seeing the wrong encoding then it may be a problem with Firefox instead.

However, the steps to reproduce are

1. install Ubuntu 12.04 desktop, or Lubuntu 14.04devel desktop (it occurs on both)
2. install Apache2, leaving default configuration settings
3. load an html page from the server in Firefox (in 12.04 or 14.04devel)
4. check page info regarding Encoding with ctrl-i

Revision history for this message
Robie Basak (racb) wrote :

Sorry, your test case involving Firefox isn't sufficient to determine validity of a bug in Apache. What is Apache actually sending to Firefox in your case?

Revision history for this message
Lars Noodén (larsnooden) wrote :

It looks like the problem is Firefox then. If no default is set, then it sends wget 'Content-Type: text/html'. If the default is set to utf-8, then it sends wget 'Content-Type: text/html; charset=utf-8'

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for apache2 (Ubuntu) because there has been no activity for 60 days.]

Changed in apache2 (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.