Bug #612843 reported by Jean Jordaan on 2010-08-03
The docs for lxml.html.tostring state:

  if include_meta_content_type is true this will create a ``<meta http-equiv="Content-Type" ...>`` tag in the head

It does not:

In [81]: new_doc = E.HTML(E.HEAD('title'), E.BODY(u'content ë ç ¥'))

In [85]: tostring(new_doc, encoding='utf-8', include_meta_content_type=True)
Out[85]: '<html><head>title</head><body>content \xc3\x83\xc2\xab \xc3\x83\xc2\xa7 \xc3\x82\xc2\xa5</body></html>'

To get the meta tag, I have to create it explicity:

In [87]: new_doc = E.HTML(E.HEAD(E.META({'http-equiv':"Content-Type", 'content':"text/html; charset=utf-8"}),'title'), E.BODY(u'content ë ç ¥'))

Now tostring works the same, with or without include_meta_content_type:

In [90]: tostring(new_doc, include_meta_content_type=True, encoding='utf-8')
Out[90]: '<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type">title</head><body>content \xc3\x83\xc2\xab \xc3\x83\xc2\xa7 \xc3\x82\xc2\xa5</body></html>'

In [91]: tostring(new_doc, encoding='utf-8')
Out[91]: '<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type">title</head><body>content \xc3\x83\xc2\xab \xc3\x83\xc2\xa7 \xc3\x82\xc2\xa5</body></html>'

Is this the proper way to create HTML with encoding specified using lxml?

Python : (2, 6, 5, 'final', 0)
lxml.etree : (2, 2, 4, 0)
libxml used : (2, 7, 6)
libxml compiled : (2, 7, 6)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

scoder (scoder) wrote :

I agree that this is a bit quirky. Basically, it simply runs some string post processing after serialisation and tries to strip the tag that way. The original intention was to deal with the <meta> tag that libxml2 explicitly generates in some cases. Apparently not in this case.

Looks like this feature needs a proper redesign at some point...

