XMLSchema() uses network lookup

Bug #1234114 reported by Christian Heimes
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

Follow up of my mail http://permalink.gmane.org/gmane.comp.python.lxml.devel/6940

I wonder why etree.XMLSchema(file="premis.xsd") does a network lookup. As far as I know lxml doesn't allow network lookup by default. Despite the default setting the schema validator tries to download the XSD for xlink from a remote resource.

$ gdb python
GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python2.7...Reading symbols from /usr/lib/debug/usr/bin/python2.7...done.
done.
(gdb) break socket
Haltepunkt 1 at 0x4170e0
(gdb) run
Starting program: /usr/bin/python2.7
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 2.7.4 (default, Apr 19 2013, 18:28:01)
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> etree.LXML_VERSION
(3, 1, 0, 0)
>>> etree.LIBXML_VERSION
(2, 9, 0)
>>> etree.XMLSchema(file="premis.xsd")

Breakpoint 1, socket () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0 socket () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ffff59b0755 in have_ipv6 () at ../../nanohttp.c:196
#2 0x00007ffff59b0af8 in xmlNanoHTTPConnectHost (host=host@entry=0xb19190 "www.loc.gov", port=80) at ../../nanohttp.c:1057
#3 0x00007ffff59b2069 in xmlNanoHTTPMethodRedir__internal_alias (URL=0xb184c0 "http://www.loc.gov/standards/xlink/xlink.xsd",
    method=0x7ffff5a345f0 "GET", input=0x0, contentType=0x0, redir=redir@entry=0x0, headers=0x0, ilen=0)
    at ../../nanohttp.c:1385
#4 0x00007ffff59b2403 in xmlNanoHTTPMethod__internal_alias (URL=<optimized out>, method=<optimized out>,
    input=<optimized out>, contentType=<optimized out>, headers=<optimized out>, ilen=<optimized out>) at ../../nanohttp.c:1594
#5 0x00007ffff5973137 in __xmlParserInputBufferCreateFilename (URI=0xb184c0 "http://www.loc.gov/standards/xlink/xlink.xsd",
    enc=XML_CHAR_ENCODING_NONE) at ../../xmlIO.c:2633
#6 0x00007ffff5947380 in xmlNewInputFromFile__internal_alias (ctxt=ctxt@entry=0xa84990,
    filename=filename@entry=0xb184c0 "http://www.loc.gov/standards/xlink/xlink.xsd") at ../../parserInternals.c:1511
#7 0x00007ffff5975605 in xmlDefaultExternalEntityLoader (URL=0xb1bf60 "http://www.loc.gov/standards/xlink/xlink.xsd", ID=0x0,
    ctxt=0xa84990) at ../../xmlIO.c:4044
#8 0x00007ffff6133e81 in ?? () from /usr/lib/python2.7/dist-packages/lxml/etree.so
#9 0x00007ffff597546f in xmlLoadExternalEntity__internal_alias (URL=<optimized out>, ID=0x0, ctxt=0xa84990)
    at ../../xmlIO.c:4100
#10 0x00007ffff5961c60 in xmlCtxtReadFile__internal_alias (ctxt=0xa84990,
    filename=filename@entry=0x9b0811 "http://www.loc.gov/standards/xlink/xlink.xsd", encoding=encoding@entry=0x0,
    options=options@entry=2) at ../../parser.c:15396
#11 0x00007ffff59e2f2a in xmlSchemaAddSchemaDoc (pctxt=pctxt@entry=0x959d70, type=type@entry=1,
    schemaLocation=0x9b0811 "http://www.loc.gov/standards/xlink/xlink.xsd", schemaDoc=schemaDoc@entry=0x0,
    schemaBuffer=schemaBuffer@entry=0x0, schemaBufferLen=schemaBufferLen@entry=0, invokingNode=invokingNode@entry=0xa8fbd0,
    sourceTargetNamespace=sourceTargetNamespace@entry=0x9b0640 "info:lc/xmlns/premis-v2",
    importNamespace=0x9b0623 "http://www.w3.org/1999/xlink", bucket=bucket@entry=0x7fffffffd6e8) at ../../xmlschemas.c:10547
#12 0x00007ffff59ebb9a in xmlSchemaParseImport (node=0xa8fbd0, schema=0xa84850, pctxt=0x959d70) at ../../xmlschemas.c:10823
#13 xmlSchemaParseSchemaTopLevel (nodes=<optimized out>, schema=<optimized out>, ctxt=<optimized out>)
    at ../../xmlschemas.c:9770
#14 xmlSchemaParseNewDocWithContext (pctxt=pctxt@entry=0x959d70, schema=schema@entry=0xa84850, bucket=<optimized out>)
    at ../../xmlschemas.c:10142
#15 0x00007ffff59eec21 in xmlSchemaParse__internal_alias (ctxt=0x959d70) at ../../xmlschemas.c:21355
#16 0x00007ffff618f1f4 in ?? () from /usr/lib/python2.7/dist-packages/lxml/etree.so
#17 0x00000000004b1a1e in type_call.25713 (type=0x7ffff63fc420, args=(), kwds={'file': 'premis.xsd'})
    at ../Objects/typeobject.c:741
#18 0x000000000047c19d in PyObject_Call (kw={'file': 'premis.xsd'}, arg=<optimized out>, func=<type at remote 0x7ffff63fc420>)
    at ../Objects/abstract.c:2529

Python : sys.version_info(major=2, minor=7, micro=4, releaselevel='final', serial=0)
lxml.etree : (3, 1, 0, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 27)
libxslt compiled : (1, 1, 27)

Revision history for this message
Christian Heimes (heimes) wrote :
Revision history for this message
scoder (scoder) wrote :

Hmm, interesting. Thanks for bringing this up. Changing the default behaviour will likely break user code (so can't be done in a bug-fix release), but I agree that this is unexpected given lxml's intention to play safe by default (and in any case, there's currently no way at all to switch off network access here).

It might work to set the expected parser options on the internal parser context that the schema parser uses:

http://xmlsoft.org/html/libxml-xmlschemas.html#xmlSchemaValidCtxtGetParserCtxt

Want to give it a try? If that fails, the next best solution would be a hard switch-off in the resolver that the schema parser calls in lxml (as can be seen in your stack trace), but that's certainly a lot more invasive.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Christian Heimes (heimes) wrote :

It's too late. The imports are resolved when the schema is parsed with xmlschema.xmlSchemaParse() in XMLSchema.__init__(). The code has a comment that explains what is going on:

                # calling xmlSchemaParse on a schema with imports or
                # includes will cause libxml2 to create an internal
                # context for parsing, so push an implied context to route
                # resolve requests to the document's parser
                __GLOBAL_PARSER_CONTEXT.pushImpliedContextFromParser(doc._parser)
                self._c_schema = xmlschema.xmlSchemaParse(parser_ctxt)
                __GLOBAL_PARSER_CONTEXT.popImpliedContext()

You have to disable network access right within the implied context. I have attached a script with a minimal test case.

Revision history for this message
scoder (scoder) wrote :

Thanks for the test case. Interestingly, you are actually using a different way to parse the schema file in it than in your original example. And in fact, there is a substantial difference between the two. If you pass in a tree, it remembers its original parser and can reuse its configuration. In your original example, there is only a file path, so no additional parser configuration. I wonder what the expected configuration is in both cases, and how to allow users to change it in the case where a plain file path (or file object) is passed. Would use the default parser, I guess, but that's a bit far away from the perspective of a user whose code has just been broken by disabling network access...

Revision history for this message
scoder (scoder) wrote :

BTW, I do not consider this a security concern or critical issue. It's not a common use case to run validations with schemas that come themselves from untrusted sources, and external imports should always be covered by catalogues (otherwise, that's a configuration problem on the user side). So it's more of an inconvenience and it would help to have an error message that makes users aware of the problem.

Revision history for this message
Christian Heimes (heimes) wrote :

I agree, it is neither serious nor a security issue. We hadn't noticed the issue before because it used to work all the time.

About the test case:
Yes, it's slightly different because it was more convenient to put the XSD into the same Python file. The outcome is the same for an etree, file=BytesIO() and file=filename.xsd. In all three cases the XSD is loaded from a remote resource despite the default setting no_network=True.

Revision history for this message
Jon (jon-work) wrote :

While I agree that "external imports should always be covered by catalogues (otherwise, that's a configuration problem on the user side)", it's far too easy for the user to get it wrong.

If the user gets this wrong then it will appear to work, with no notification to the programmer/tester that it's making a HTTP request. But if the Internet goes down then the program won't work. And worse, an attacker who can MitM network traffic can replace the schema that was requested via HTTP with an XML schema of their choice. An attacker could do the "quadratic blowup" attack or a straight DOS. Depending on the schema in question, the attacker may be able to do more subtle changes to make the program fail in an attacker-chosen way - e.g. changing the default value of some security-relevant attribute.

Please can LXML add an option to disable network access and just fail if a HTTP URL is requested?

Revision history for this message
Michael Clerx (michaelclerx) wrote :

Just ran into this issue as well; it's very awkward as I can't rely on lxml to pass/fail tests reliably without control over network access. Please add an option and disable network access by default!

Revision history for this message
scoder (scoder) wrote :

PR welcome.

Revision history for this message
Steven Kalt (kalt.steven) wrote :

I took a stab at fixing the bug and got as far as integrating Christian Heimes' test from #3 into src/lxml/test/test_xmlschema.py. The result is at https://github.com/SKalt/lxml/tree/xmlschema-uses-network-bug-1234114, in case anyone wants to grab that amount of work. I'm throwing that work out there and throwing in the towel after:
  - trying to use the simple uwsgi server from src/lxml/test/dummy_server.py to serve an example included .xsd. Localhost and 0.0.0.0 registered no hits on a HTTPRequestCollector. They might be ignored possibly by _avoid_hosts in src/lxml/html/clean.py, but my bet is on an unknown within libxml or the finer points of networking.
  - hail-mary-ing passage of an optional parser arg into src/lxml/xmlschema.pxi and using the test which I've seen fit to plagarize/push.

Hope this saves a future debugger some time.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.