Incorrect tokenization for lxml.html.Classes

Bug #1934687 reported by danny0838
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

demo:

    list(lxml.html.fromstring('<p class="中 文">test</p>').classes)

expected:

    ['中 文']

actual:

    ['中', '文']

According to HTML spec., classes should be separated by ASCII whitespaces (which is defined as U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) only. Other unicode spaces, such as U+3000 (fullwidth whitespace or " "), should not be considered as a class separator.

ref: https://html.spec.whatwg.org/multipage/dom.html#global-attributes:classes-2

Tags: html
danny0838 (danny0838)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.