threading problem when using target parser
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I think i found an issue when running multiple parsers in parallel in that threads do not speed up parsing. The attached code should run the following:
one thread does 100 tasks
two threads run 50 tasks each
Now the two threads should be around the same time as the one thread.
Looking at the code ive attached and running it gives:
Python : sys.version_
lxml.etree : (3, 5, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Testing one thread (100 job per thread)
Time: 1.10558
Testing two threads (50 jobs per thread)
Time: 1.79999
On a different machine it gives:
Python : sys.version_
lxml.etree : (3, 3, 3, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Testing one thread (100 job per thread)
Time: 0.69747
Testing two threads (50 jobs per thread)
Time: 1.91676
According to the FAQ (http://
If the code is updated to uncomment the line
##with lock: ## TODOTODO uncommenting this makes it run in the same time as expected.
The code runs in the same time as expected.
Sorry if i have misunderstood how the parser is supposed to work but I cant find any other solutions at the moment.
Using a Python class as parser target means that the parser needs to call into Python for each parse event, which requires serialisation in order to acquire the GIL (to execute that Python code). That's why you do not see any speedup.
If you remove the target and let the parser build a normal tree, it should be much faster in parallel.