I compared your patches with my patch for Python's standard library. Your approach seems to work well but I'm not sure if it intercepts all possible attacks. When you overwrite DefaultHandler you don't get any error messages when an entity occurs in the XML document.
I like to suggest a different approach. You can either reject all XML documents with an inline DTD (SOAP does it) or you can disallow entity declarations:
class ProtectedExpatParser(ExpatParser):
def __init__(self, forbid_dtd=True, forbid_entities=True, *args, **kwargs):
# Python 2.x old style class ExpatParser.__init__(self, *args, **kwargs) self.forbid_dtd = forbid_dtd self.forbid_entities = forbid_entities
def reset(self): ExpatParser.reset(self)
if self.forbid_dtd: self._parser.StartDoctypeDeclHandler = self.start_doctype_decl
if self.forbid_entities: self._parser.EntityDeclHandler = self.entity_decl self._parser.UnparsedEntityDeclHandler = self.unparsed_entity_decl
With this parser errors and attacks aren't ignored silently.
Your fix for LXML may contain a flaw, too. LXML used to have an issue when parser is shared by multiple threads. I don't know if your are using threads nor if LXML still has a problem with shared parser instances. At work we are using a thread local storage to have a parser context for each thread.
I compared your patches with my patch for Python's standard library. Your approach seems to work well but I'm not sure if it intercepts all possible attacks. When you overwrite DefaultHandler you don't get any error messages when an entity occurs in the XML document.
I like to suggest a different approach. You can either reject all XML documents with an inline DTD (SOAP does it) or you can disallow entity declarations:
class ProtectedExpatP arser(ExpatPars er): entities= True,
*args, **kwargs):
ExpatParser. __init_ _(self, *args, **kwargs)
self.forbid_ dtd = forbid_dtd
self.forbid_ entities = forbid_entities
def __init__(self, forbid_dtd=True, forbid_
# Python 2.x old style class
def start_doctype_ decl(self, name, sysid, pubid, has_internal_ subset) :
raise ValueError("Inline DTD forbidden")
def entity_decl(self, entityName, is_parameter_ entity, value, base, systemId, publicId, notationName): "<!ENTITY> forbidden")
raise ValueError(
def unparsed_ entity_ decl(self, name, base, sysid, pubid, notation_name): "<!ENTITY> forbidden")
# expat 1.2
raise ValueError(
def reset(self):
ExpatParser. reset(self)
self. _parser. StartDoctypeDec lHandler = self.start_ doctype_ decl entities:
self. _parser. EntityDeclHandl er = self.entity_decl
self. _parser. UnparsedEntityD eclHandler = self.unparsed_ entity_ decl
if self.forbid_dtd:
if self.forbid_
With this parser errors and attacks aren't ignored silently.
Your fix for LXML may contain a flaw, too. LXML used to have an issue when parser is shared by multiple threads. I don't know if your are using threads nor if LXML still has a problem with shared parser instances. At work we are using a thread local storage to have a parser context for each thread.