XMLSchema() uses network lookup
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Medium
|
Unassigned |
Bug Description
Follow up of my mail http://
I wonder why etree.XMLSchema
$ gdb python
GNU gdb (GDB) 7.5.91.
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://
Reading symbols from /usr/bin/
done.
(gdb) break socket
Haltepunkt 1 at 0x4170e0
(gdb) run
Starting program: /usr/bin/python2.7
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_
Python 2.7.4 (default, Apr 19 2013, 18:28:01)
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> etree.LXML_VERSION
(3, 1, 0, 0)
>>> etree.LIBXML_
(2, 9, 0)
>>> etree.XMLSchema
Breakpoint 1, socket () at ../sysdeps/
81 ../sysdeps/
(gdb) bt
#0 socket () at ../sysdeps/
#1 0x00007ffff59b0755 in have_ipv6 () at ../../nanohttp.
#2 0x00007ffff59b0af8 in xmlNanoHTTPConn
#3 0x00007ffff59b2069 in xmlNanoHTTPMeth
method=
at ../../nanohttp.
#4 0x00007ffff59b2403 in xmlNanoHTTPMeth
input=
#5 0x00007ffff5973137 in __xmlParserInpu
enc=
#6 0x00007ffff5947380 in xmlNewInputFrom
filename=
#7 0x00007ffff5975605 in xmlDefaultExter
ctxt=0xa84990) at ../../xmlIO.c:4044
#8 0x00007ffff6133e81 in ?? () from /usr/lib/
#9 0x00007ffff597546f in xmlLoadExternal
at ../../xmlIO.c:4100
#10 0x00007ffff5961c60 in xmlCtxtReadFile
filename=
options=
#11 0x00007ffff59e2f2a in xmlSchemaAddSch
schemaLocat
schemaBuffe
sourceTarge
importNames
#12 0x00007ffff59ebb9a in xmlSchemaParseI
#13 xmlSchemaParseS
at ../../xmlschema
#14 xmlSchemaParseN
at ../../xmlschema
#15 0x00007ffff59eec21 in xmlSchemaParse_
#16 0x00007ffff618f1f4 in ?? () from /usr/lib/
#17 0x00000000004b1a1e in type_call.25713 (type=0x7ffff63
at ../Objects/
#18 0x000000000047c19d in PyObject_Call (kw={'file': 'premis.xsd'}, arg=<optimized out>, func=<type at remote 0x7ffff63fc420>)
at ../Objects/
Python : sys.version_
lxml.etree : (3, 1, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 27)
libxslt compiled : (1, 1, 27)
Hmm, interesting. Thanks for bringing this up. Changing the default behaviour will likely break user code (so can't be done in a bug-fix release), but I agree that this is unexpected given lxml's intention to play safe by default (and in any case, there's currently no way at all to switch off network access here).
It might work to set the expected parser options on the internal parser context that the schema parser uses:
http:// xmlsoft. org/html/ libxml- xmlschemas. html#xmlSchemaV alidCtxtGetPars erCtxt
Want to give it a try? If that fails, the next best solution would be a hard switch-off in the resolver that the schema parser calls in lxml (as can be seen in your stack trace), but that's certainly a lot more invasive.