Aaccording to HTML spec., classes should be separated by ASCII whitespaces (which is defined as U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) only. Other unicode spaces, such as U+3000 (fullwidth whitespace or " "), should not be considered as a class separator.
demo:
list( lxml.html. fromstring( '<p class="中 文">test< /p>').classes)
expected:
['中 文']
actual:
['中', '文']
Aaccording to HTML spec., classes should be separated by ASCII whitespaces (which is defined as U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or U+0020 SPACE) only. Other unicode spaces, such as U+3000 (fullwidth whitespace or " "), should not be considered as a class separator.
ref: https:/ /html.spec. whatwg. org/multipage/ dom.html# global- attributes: classes- 2