lxml

Doc: In lxml.html.tostring() encoding "unicode" for Python 3

Bug #1284809 reported by Bug Reporter on 2014-02-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Low	scoder	lxml 3.3

Bug Description

In Python 3 unicode changed to str

In few places, examples continue to describe unicode instead of str

"
        The ``encoding`` argument controls the output encoding (defauts to
        ASCII, with &#...; character references for any characters outside
        of ASCII). Note that you can pass the name ``'unicode'`` as
        ``encoding`` argument to serialise to a unicode string.
"

Python : sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 2, 4, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Tags:

Revision history for this message

scoder (scoder) wrote on 2014-02-25:

"unicode" is the correct name here. "str" would be ambiguous.

Revision history for this message

Bug Reporter (bugzilla-mail-box) wrote on 2014-02-25:

> "unicode" is the correct name here

There is no unicode() in Python 3

Revision history for this message

scoder (scoder) wrote on 2014-02-26: Re: [Bug 1284809] Re: Doc: In lxml.html.tostring() encoding "unicode" for Python 3

>> "unicode" is the correct name here
> There is no unicode() in Python 3

But there is Unicode in Python 3.

Revision history for this message

Bug Reporter (bugzilla-mail-box) wrote on 2014-02-26:

> But there is Unicode in Python 3.

And it is called str()

This text is from lxml.html.tostring.__doc__
It uses unicode() like it works, but it doesn't work in Python 3, because there is no unicode() since it was replaced by str()
Therefore, this example is misleading

"
Example::

>>> from lxml import html
>>> root = html.fragment_fromstring('Hello world!')

>>> html.tostring(root)
 b'Hello world!'
 >>> html.tostring(root, method='html')
 b'Hello world!'

>>> html.tostring(root, method='xml')
b'Hello world!'

>>> html.tostring(root, method='text')
b'Helloworld!'

>>> html.tostring(root, method='text', encoding=unicode)
'Helloworld!'

>>> root = html.fragment_fromstring('<div>Hello world!TAIL</div>')
 >>> html.tostring(root[0], method='text', encoding=unicode)
 'Helloworld!TAIL'

>>> html.tostring(root[0], method='text', encoding=unicode, with_tail=False)
'Helloworld!'

>>> doc = html.document_fromstring('Hello world!')
 >>> html.tostring(doc, method='html', encoding=unicode)
 '<html><body>Hello world!</body></html>'

>>> print(html.tostring(doc, method='html', encoding=unicode,
 ... doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"'
 ... ' "http://www.w3.org/TR/html4/strict.dtd">'))
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
 <html><body>Hello world!</body></html>

> "str" would be ambiguous.

Ambiguous with what ?

Revision history for this message

scoder (scoder) wrote on 2014-02-26:

>> But there is Unicode in Python 3.
> And it is called str()

Unicode is actually called Unicode.

http://www.unicode.org/

The Python 2.x *type* "unicode" was renamed to "str" in Py3.

> This text is from lxml.html.tostring.__doc__
> It uses unicode() like it works, but it doesn't work in Python 3, because there is no unicode() since it was replaced by str()
> Therefore, this example is misleading
>
> >>> html.tostring(root, method='text', encoding=unicode)
> 'Helloworld!'

Ah, right. That wasn't updated. Thanks for bringing it up. It should read
encoding="unicode".

https://github.com/lxml/lxml/commit/477fa0b36c5ecd6c26d0ea5190f518ad2f7b196f

>> "str" would be ambiguous.
> Ambiguous with what ?

If I were to read encoding="str" somewhere, I'd be puzzled what it might
mean. Even encoding="unicode" isn't ideal, because Unicode is not an
encoding. But practicality beats purity here, and it's certainly more
obvious than encoding="str".

Changed in lxml:
importance:	Undecided → Low
status:	New → Fix Committed

Revision history for this message

Bug Reporter (bugzilla-mail-box) wrote on 2014-02-26:

> But practicality beats purity here, and it's certainly more
obvious than encoding="str".

Now, I understand.

I tried

>>> lxml.html.tostring(tag, encoding=str)
'абв'
>>>

and it worked well.

Revision history for this message

scoder (scoder) wrote on 2014-02-26:

> I tried encoding=str and it worked well.

Yes, and the lack of readability there is exactly the reason why the usage
pattern changed to encoding="unicode" (i.e. a string name) back when Py3
came up. The fact that the old way still works is purely for backwards
compatibility in Py2.

Revision history for this message

scoder (scoder) wrote on 2014-02-27:

Docs updated in lxml 3.3.2.

Changed in lxml:
assignee:	nobody → scoder (scoder)
status:	Fix Committed → Fix Released
milestone:	none → 3.3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.