lxml

Overview
Code
Bugs
Blueprints
Translations
Answers

Incorrect tokenization for lxml.html.Classes

Bug #1934687 reported by danny0838 on 2021-07-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

demo:

list(lxml.html.fromstring('<p class="中　文">test</p>').classes)

expected:

['中　文']

actual:

['中', '文']

According to HTML spec., classes should be separated by ASCII whitespaces (which is defined as U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) only. Other unicode spaces, such as U+3000 (fullwidth whitespace or "　"), should not be considered as a class separator.

ref: https://html.spec.whatwg.org/multipage/dom.html#global-attributes:classes-2

See original description

Tags:

danny0838 (danny0838) on 2021-07-05

description:

updated