Activity log for bug #1842387

Date Who What changed Old value New value Message
2019-09-03 07:50:13 Thomas ten Cate bug added bug
2019-09-03 07:51:30 Thomas ten Cate description The documentation on https://lxml.de/api/lxml.etree.XMLParser-class.html only says: > encoding - override the document encoding It doesn't specify what encodings are valid, so it's reasonable to assume that it's the [list supported by Python](https://docs.python.org/3/library/codecs.html#standard-encodings). However, experimentation reveals that that's not the case: ``` >>> lxml.etree.XMLParser(encoding='utf_8_sig') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/lxml/parser.pxi", line 1518, in lxml.etree.XMLParser.__init__ (src/lxml/etree.c:117909) File "src/lxml/parser.pxi", line 821, in lxml.etree._BaseParser.__init__ (src/lxml/etree.c:110544) LookupError: unknown encoding: 'b'utf_8_sig'' ``` From what I can tell, the encoding must be one of the ones reported by `iconv -l`, which is surprising to Python developers, so it should at least be documented. If there's some way to accept Python encodings here too, then that would of course be even better. Experimentally I found that `utf-8` has the effect I intended (ignoring any UTF-8 encoded BOM), but `utf8` without a hyphen fails: ``` >>> lxml.etree.parse(io.BytesIO(b'\xef\xbb\xbf<abc/>'), parser=lxml.etree.XMLParser(encoding='utf8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185) File "src/lxml/parser.pxi", line 1851, in lxml.etree._parseDocument (src/lxml/etree.c:120981) File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121250) File "src/lxml/parser.pxi", line 1759, in lxml.etree._parseDoc (src/lxml/etree.c:119926) File "src/lxml/parser.pxi", line 1125, in lxml.etree._BaseParser._parseDoc (src/lxml/etree.c:114173) File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738) File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109447) File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError (src/lxml/etree.c:108301) File "<string>", line 1 lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 ``` I'm not sure what the deal is there, and it may be a separate issue. The documentation on https://lxml.de/api/lxml.etree.XMLParser-class.html only says: > encoding - override the document encoding It doesn't specify what encodings are valid, so it's reasonable to assume that it's the [list supported by Python](https://docs.python.org/3/library/codecs.html#standard-encodings). However, experimentation reveals that that's not the case: ``` >>> lxml.etree.XMLParser(encoding='utf_8_sig') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/parser.pxi", line 1518, in lxml.etree.XMLParser.__init__ (src/lxml/etree.c:117909)   File "src/lxml/parser.pxi", line 821, in lxml.etree._BaseParser.__init__ (src/lxml/etree.c:110544) LookupError: unknown encoding: 'b'utf_8_sig'' ``` From what I can tell, the encoding must be one of the ones reported by `iconv -l`, which is surprising to Python developers, so it should at least be documented. If there's some way to accept Python encodings here too, then that would of course be even better. Experimentally I found that `utf-8` has the effect I intended (ignoring any UTF-8 encoded BOM), but `utf8` without a hyphen fails: ``` >>> lxml.etree.parse(io.BytesIO(b'\xef\xbb\xbf<abc/>'), parser=lxml.etree.XMLParser(encoding='utf8')) Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185)   File "src/lxml/parser.pxi", line 1851, in lxml.etree._parseDocument (src/lxml/etree.c:120981)   File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121250)   File "src/lxml/parser.pxi", line 1759, in lxml.etree._parseDoc (src/lxml/etree.c:119926)   File "src/lxml/parser.pxi", line 1125, in lxml.etree._BaseParser._parseDoc (src/lxml/etree.c:114173)   File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738)   File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109447)   File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError (src/lxml/etree.c:108301)   File "<string>", line 1 lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 ``` I'm not sure what the deal is there, and it may be a separate issue. Version info: Python : sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0) lxml.etree : (4, 1, 1, 0) libxml used : (2, 9, 9) libxml compiled : (2, 9, 8) libxslt used : (1, 1, 33) libxslt compiled : (1, 1, 32)
2019-09-03 07:52:19 Thomas ten Cate description The documentation on https://lxml.de/api/lxml.etree.XMLParser-class.html only says: > encoding - override the document encoding It doesn't specify what encodings are valid, so it's reasonable to assume that it's the [list supported by Python](https://docs.python.org/3/library/codecs.html#standard-encodings). However, experimentation reveals that that's not the case: ``` >>> lxml.etree.XMLParser(encoding='utf_8_sig') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/parser.pxi", line 1518, in lxml.etree.XMLParser.__init__ (src/lxml/etree.c:117909)   File "src/lxml/parser.pxi", line 821, in lxml.etree._BaseParser.__init__ (src/lxml/etree.c:110544) LookupError: unknown encoding: 'b'utf_8_sig'' ``` From what I can tell, the encoding must be one of the ones reported by `iconv -l`, which is surprising to Python developers, so it should at least be documented. If there's some way to accept Python encodings here too, then that would of course be even better. Experimentally I found that `utf-8` has the effect I intended (ignoring any UTF-8 encoded BOM), but `utf8` without a hyphen fails: ``` >>> lxml.etree.parse(io.BytesIO(b'\xef\xbb\xbf<abc/>'), parser=lxml.etree.XMLParser(encoding='utf8')) Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185)   File "src/lxml/parser.pxi", line 1851, in lxml.etree._parseDocument (src/lxml/etree.c:120981)   File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121250)   File "src/lxml/parser.pxi", line 1759, in lxml.etree._parseDoc (src/lxml/etree.c:119926)   File "src/lxml/parser.pxi", line 1125, in lxml.etree._BaseParser._parseDoc (src/lxml/etree.c:114173)   File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738)   File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109447)   File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError (src/lxml/etree.c:108301)   File "<string>", line 1 lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 ``` I'm not sure what the deal is there, and it may be a separate issue. Version info: Python : sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0) lxml.etree : (4, 1, 1, 0) libxml used : (2, 9, 9) libxml compiled : (2, 9, 8) libxslt used : (1, 1, 33) libxslt compiled : (1, 1, 32) The documentation on https://lxml.de/api/lxml.etree.XMLParser-class.html only says: > encoding - override the document encoding It doesn't specify what encodings are valid, so it's reasonable to assume that it's the [list supported by Python](https://docs.python.org/3/library/codecs.html#standard-encodings). However, experimentation reveals that that's not the case: >>> lxml.etree.XMLParser(encoding='utf_8_sig') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/parser.pxi", line 1518, in lxml.etree.XMLParser.__init__ (src/lxml/etree.c:117909)   File "src/lxml/parser.pxi", line 821, in lxml.etree._BaseParser.__init__ (src/lxml/etree.c:110544) LookupError: unknown encoding: 'b'utf_8_sig'' From what I can tell, the encoding must be one of the ones reported by `iconv -l`, which is surprising to Python developers, so it should at least be documented. If there's some way to accept Python encodings here too, then that would of course be even better. Experimentally I found that `utf-8` has the effect I intended (ignoring any UTF-8 encoded BOM), but `utf8` without a hyphen fails: >>> lxml.etree.parse(io.BytesIO(b'\xef\xbb\xbf<abc/>'), parser=lxml.etree.XMLParser(encoding='utf8')) Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185)   File "src/lxml/parser.pxi", line 1851, in lxml.etree._parseDocument (src/lxml/etree.c:120981)   File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121250)   File "src/lxml/parser.pxi", line 1759, in lxml.etree._parseDoc (src/lxml/etree.c:119926)   File "src/lxml/parser.pxi", line 1125, in lxml.etree._BaseParser._parseDoc (src/lxml/etree.c:114173)   File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738)   File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109447)   File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError (src/lxml/etree.c:108301)   File "<string>", line 1 lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 I'm not sure what the deal is there, and it may be a separate issue. Version info: Python : sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0) lxml.etree : (4, 1, 1, 0) libxml used : (2, 9, 9) libxml compiled : (2, 9, 8) libxslt used : (1, 1, 33) libxslt compiled : (1, 1, 32)
2023-11-03 10:28:13 scoder lxml: importance Undecided Low
2023-11-03 10:28:13 scoder lxml: status New Fix Committed
2023-11-03 10:28:13 scoder lxml: milestone 5.0
2023-12-30 09:17:59 scoder lxml: status Fix Committed Fix Released