position in lxml error message seems wrong

Bug #1458175 reported by Steven Samuel Cole on 2015-05-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

submitted at : https://bugs.launchpad.net/lxml/+filebug
URL : https://bugs.launchpad.net/lxml/+bug/1458175

Summary : error position in lxml exception message seems wrong
Further information:

Environment: virtual environment on Mac OS X 10.8
Output from bug reporting guidelines script:
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 4, 2, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

Problem: When extra contents after a root xml element is given to lxml for parsing, it correctly reports "Extra content at the end of the document", but the column number included in the error message seems wrong - IF the root element has attributes.

Expected behavior: The same as xmllint (using the same underlying libxml) which indicates the correct position of the error:

# verify version
(venv)host:~ user$ xmllint --version
xmllint: using libxml version 20708
   compiled with: Threads Tree Output Push Reader Patterns Writer
   SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer
   XInclude ISO8859X Unicode Regexps Automata Expr Schemas Schematron
   Modules Debug Zlib

# valid XML (for self-test):
(venv)host:~ user$ echo "<root/>" | xmllint -
<?xml version="1.0"?>
<root/>

# NOTE: This page (https://bugs.launchpad.net/lxml/+filebug) doesn't seem to support any markup
# and I don't know what this report looks like in the end; the ^ do point at the correct position

# invalid xml (extra content):
(venv)host:~ user$ echo "<root/> extra content" | xmllint -
-:1: parser error : Extra content at the end of the document
<root/> extra content
        ^

# invalid xml (extra content) with attribute:
(venv)host:~ user$ echo "<root attr01=\"value01\"/> extra content" | xmllint -
-:1: parser error : Extra content at the end of the document
<root attr01="value01"/> extra content
                         ^

Actual behavior: Demonstrated by this script:

#!/usr/bin/env python

from lxml import etree

test_xml_list = ["<root/>", "<root/> extra content", "<root attr01=\"value01\"/> extra content"]

for test_xml in test_xml_list:
    print 'parse "%s":' % test_xml
    try:
        etree.fromstring(test_xml)
    except etree.XMLSyntaxError as e:
        print e
        print 'test_xml[:e.position[1]]:', test_xml[:e.position[1]]
    print

Output:

(venv)host:~ user$ ./lxml_test.py
parse "<root/>":

parse "<root/> extra content":
Extra content at the end of the document, line 1, column 9
test_xml[:e.position[1]]: <root/> e

parse "<root attr01="value01"/> extra content":
Extra content at the end of the document, line 1, column 16
test_xml[:e.position[1]]: <root attr01="va

The error messaege column information is correct for the first case, but wrong for the second.

description: updated
scoder (scoder) wrote :

These error messages come from the parser in libxml2. There isn't much that lxml could do about them.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers