lxml

threading problem when using target parser

Bug #1707941 reported by Daniel on 2017-08-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

I think i found an issue when running multiple parsers in parallel in that threads do not speed up parsing. The attached code should run the following:

one thread does 100 tasks
two threads run 50 tasks each

Now the two threads should be around the same time as the one thread.

Looking at the code ive attached and running it gives:

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 5, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Testing one thread (100 job per thread)
Time: 1.10558
Testing two threads (50 jobs per thread)
Time: 1.79999

On a different machine it gives:
Python : sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (3, 3, 3, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Testing one thread (100 job per thread)
Time: 0.69747
Testing two threads (50 jobs per thread)
Time: 1.91676

According to the FAQ (http://lxml.de/FAQ.html#can-i-use-threads-to-concurrently-access-the-lxml-api) two threads should work faster than one thread. Additional you can see each thread starts up its own Parser object.

If the code is updated to uncomment the line
##with lock: ## TODOTODO uncommenting this makes it run in the same time as expected.

The code runs in the same time as expected.

Sorry if i have misunderstood how the parser is supposed to work but I cant find any other solutions at the moment.

Revision history for this message

Daniel (pudding2) wrote on 2017-08-01:

example code Edit (1.6 KiB, text/x-python)

Revision history for this message

scoder (scoder) wrote on 2017-08-02:

Using a Python class as parser target means that the parser needs to call into Python for each parse event, which requires serialisation in order to acquire the GIL (to execute that Python code). That's why you do not see any speedup.
If you remove the target and let the parser build a normal tree, it should be much faster in parallel.

Changed in lxml:
status:	New → Invalid

Revision history for this message

Daniel (pudding2) wrote on 2017-08-02:

firstly thanks for your response, you are correct and I did try removing the target and it does improve the speed in a threaded environment.

But I guess my question was not why the threaded isnt faster but why its much slower (factor of 2).

As you would guess putting the

"with lock:" prior to the "lxml.etree.HTML(DATA,parser=parser)" line means that the threaded version (running more than 1 thread) speeds up to match the single threaded version.

Is acquirering the GIL lock with more than 1 thread that expensive it means that they are waiting on each other?

Revision history for this message

scoder (scoder) wrote on 2017-08-02:

Both threads basically block each other here by trying to acquire the interpreter lock (the GIL) all the time.. Waiting for the lock takes time, and handling the GIL itself takes time also. In addition to not giving any benefit at all, the locking/switching/unlocking overhead is so large that it considerably slows down the program execution.

Note that Py3.5 already has better overall GIL locking performance in your example, but it's still large.

When you serialise both parser runs with your explicit lock, the locking overhead during parsing goes away entirely and you end up with the bare single thread parser performance for each of the two runs.

Revision history for this message

Daniel (pudding2) wrote on 2017-08-03:

Okay well thank you ill close this then

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

example code Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.