lxml

lxml pypy make_links_absolute error

Bug #1308241 reported by Viacheslav Biriukov on 2014-04-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Low	scoder

Bug Description

Code:

    parser = lxml.html.HTMLParser(encoding=page_encoding)
    lxml_html = lxml.html.fromstring(page_body, parser=parser)
    lxml_html.make_links_absolute(link)

Trace:

...
   lxml_html.make_links_absolute(link)
  File "/home/site-packages/lxml/html/__init__.py", line 340, in make_links_absolute
    self.rewrite_links(link_repl)
  File "/home/site-packages/lxml/html/__init__.py", line 468, in rewrite_links
    for el, attrib, link, pos in self.iterlinks():
  File "/home/site-packages/lxml/html/__init__.py", line 381, in iterlinks
    for el in self.iter(etree.Element):
  File "lxml.etree.pyx", line 2752, in lxml.etree.ElementDepthFirstIterator.__next__ (src/lxml/lxml.etree.c:65185)
  File "lxml.etree.pyx", line 1537, in lxml.etree._elementFactory (src/lxml/lxml.etree.c:50951)
AttributeError: '_Document' object has no attribute '_init'

Versions:

pypy: PyPy 2.2.1 with GCC 4.6.3

ii libxml2 2.7.8.dfsg-5.1ubuntu4.6 GNOME XML library
ii libxml2-dev 2.7.8.dfsg-5.1ubuntu4.6 Development files for the GNOME XML library
ii libxml2-utils 2.7.8.dfsg-5.1ubuntu4.6 XML utilities
ii libxslt1-dev 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - development kit
ii libxslt1.1 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - runtime library

$ pypy ./get_versions.py (your code)
Python : (major=2, minor=7, micro=3, releaselevel='final', serial=42)
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

See original description

Viacheslav Biriukov (v-v-biriukov) on 2014-04-15

description:	updated
description:	updated
description:	updated
summary:	- lxml pypy lxml_html.make_links_absolute error + lxml pypy make_links_absolute error

Revision history for this message

scoder (scoder) wrote on 2014-04-16:

That's a very unexpected error. What's the page you are parsing?

Changed in lxml:
status:	New → Triaged

Revision history for this message

Viacheslav Biriukov (v-v-biriukov) wrote on 2014-04-16:

import urlparse

My code:

import lxml.html
import requests

url = 'https://en.wikipedia.org/wiki/The_Matrix'
parsed_uri = urlparse.urlparse(url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

print url
print domain

r = requests.get(url)

lxml_html = lxml.html.fromstring(r.text)
lxml_html.make_links_absolute(domain)

Revision history for this message

scoder (scoder) wrote on 2014-04-25:

Thanks, I can reproduce it.

Changed in lxml:
assignee:	nobody → scoder (scoder)
importance:	Undecided → Low
status:	Triaged → Confirmed

Revision history for this message

scoder (scoder) wrote on 2014-04-25:

However, "reproduce" doesn't mean I can (easily) fix it. It looks more like PyPy is misbehaving here, so it would be better if a PyPy developer could look into this.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.