document_fromstring

Bug #412931 reported by commissar
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libxml2
Fix Released
Critical
lxml
Invalid
Medium
Unassigned

Bug Description

I use lxml to parse the contents of the url(http://www.dtzww.cn/files/article/fulltext/23/23208.html),the lxml is been blocking,and don't rasie exception. The CPU utilization rate is 100%.

My environment is lxml-2.2.2. ubutnu-8.04-amd64-server python-2.5.2My code is fellow:

import lxml.html as htmltool
import urlib

url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html"
f = urllib.urlopen(url)
data = f.read()

doc = htmltool.document_fromstring(data) ## <--- Block this

and use the following code if it is feasible.

====sample 1:
import lxml.html as htmltool
import urlib

url = "http://www.dtzww.cn/files/article/fulltext/23/23208.html"
doc = htmltool.parse(url) ## <--- ok

====My question:
why the document_fromstring can not work ?and the lxml.html.parse and lxml.html.document_fromstring are not the same used in a way?

Thank you

Revision history for this message
scoder (scoder) wrote :

Reproducible with libxml2 2.7.3. This seems to be a problem in libxml2 rather than lxml. I'll ask over there.

Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
commissar (commissarster) wrote :

OK,thanks

Revision history for this message
scoder (scoder) wrote :

This has been fixed in libxml2, likely to be released in 2.7.4.

http://bugzilla.gnome.org/show_bug.cgi?id=592430

Revision history for this message
scoder (scoder) wrote :

Marking this as "invalid", as it can't be solved in lxml itself.

Changed in lxml:
assignee: Stefan Behnel (scoder) → nobody
status: Confirmed → Invalid
Changed in libxml2:
status: Unknown → Fix Released
Changed in libxml2:
importance: Unknown → Critical
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.