Incorrect tokenization for lxml.html.Classes
Bug #1934687 reported by
danny0838
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
demo:
list(
expected:
['中 文']
actual:
['中', '文']
According to HTML spec., classes should be separated by ASCII whitespaces (which is defined as U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) only. Other unicode spaces, such as U+3000 (fullwidth whitespace or " "), should not be considered as a class separator.
ref: https:/