Doc: In lxml.html.tostring() encoding "unicode" for Python 3

Bug #1284809 reported by Bug Reporter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Low
scoder

Bug Description

In Python 3 unicode changed to str

In few places, examples continue to describe unicode instead of str

"
        The ``encoding`` argument controls the output encoding (defauts to
        ASCII, with &#...; character references for any characters outside
        of ASCII). Note that you can pass the name ``'unicode'`` as
        ``encoding`` argument to serialise to a unicode string.
"

Python : sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 2, 4, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Tags: docs
Revision history for this message
scoder (scoder) wrote :

"unicode" is the correct name here. "str" would be ambiguous.

Revision history for this message
Bug Reporter (bugzilla-mail-box) wrote :

> "unicode" is the correct name here

There is no unicode() in Python 3

Revision history for this message
scoder (scoder) wrote : Re: [Bug 1284809] Re: Doc: In lxml.html.tostring() encoding "unicode" for Python 3

>> "unicode" is the correct name here
> There is no unicode() in Python 3

But there is Unicode in Python 3.

Revision history for this message
Bug Reporter (bugzilla-mail-box) wrote :

> But there is Unicode in Python 3.

And it is called str()

This text is from lxml.html.tostring.__doc__
It uses unicode() like it works, but it doesn't work in Python 3, because there is no unicode() since it was replaced by str()
Therefore, this example is misleading

"
    Example::

        >>> from lxml import html
        >>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')

        >>> html.tostring(root)
        b'<p>Hello<br>world!</p>'
        >>> html.tostring(root, method='html')
        b'<p>Hello<br>world!</p>'

        >>> html.tostring(root, method='xml')
        b'<p>Hello<br/>world!</p>'

        >>> html.tostring(root, method='text')
        b'Helloworld!'

        >>> html.tostring(root, method='text', encoding=unicode)
        'Helloworld!'

        >>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')
        >>> html.tostring(root[0], method='text', encoding=unicode)
        'Helloworld!TAIL'

        >>> html.tostring(root[0], method='text', encoding=unicode, with_tail=False)
        'Helloworld!'

        >>> doc = html.document_fromstring('<p>Hello<br>world!</p>')
        >>> html.tostring(doc, method='html', encoding=unicode)
        '<html><body><p>Hello<br>world!</p></body></html>'

        >>> print(html.tostring(doc, method='html', encoding=unicode,
        ... doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"'
        ... ' "http://www.w3.org/TR/html4/strict.dtd">'))
        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
        <html><body><p>Hello<br>world!</p></body></html>

"

> "str" would be ambiguous.

Ambiguous with what ?

Revision history for this message
scoder (scoder) wrote :

>> But there is Unicode in Python 3.
> And it is called str()

Unicode is actually called Unicode.

http://www.unicode.org/

The Python 2.x *type* "unicode" was renamed to "str" in Py3.

> This text is from lxml.html.tostring.__doc__
> It uses unicode() like it works, but it doesn't work in Python 3, because there is no unicode() since it was replaced by str()
> Therefore, this example is misleading
>
> >>> html.tostring(root, method='text', encoding=unicode)
> 'Helloworld!'

Ah, right. That wasn't updated. Thanks for bringing it up. It should read
encoding="unicode".

https://github.com/lxml/lxml/commit/477fa0b36c5ecd6c26d0ea5190f518ad2f7b196f

>> "str" would be ambiguous.
> Ambiguous with what ?

If I were to read encoding="str" somewhere, I'd be puzzled what it might
mean. Even encoding="unicode" isn't ideal, because Unicode is not an
encoding. But practicality beats purity here, and it's certainly more
obvious than encoding="str".

Changed in lxml:
importance: Undecided → Low
status: New → Fix Committed
Revision history for this message
Bug Reporter (bugzilla-mail-box) wrote :

> But practicality beats purity here, and it's certainly more
obvious than encoding="str".

Now, I understand.

I tried

>>> lxml.html.tostring(tag, encoding=str)
'<b>абв</b>'
>>>

and it worked well.

Revision history for this message
scoder (scoder) wrote :

> I tried encoding=str and it worked well.

Yes, and the lack of readability there is exactly the reason why the usage
pattern changed to encoding="unicode" (i.e. a string name) back when Py3
came up. The fact that the old way still works is purely for backwards
compatibility in Py2.

Revision history for this message
scoder (scoder) wrote :

Docs updated in lxml 3.3.2.

Changed in lxml:
assignee: nobody → scoder (scoder)
status: Fix Committed → Fix Released
milestone: none → 3.3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.