lxml pypy make_links_absolute error

Bug #1308241 reported by Viacheslav Biriukov on 2014-04-15
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Low
scoder

Bug Description

Code:

    parser = lxml.html.HTMLParser(encoding=page_encoding)
    lxml_html = lxml.html.fromstring(page_body, parser=parser)
    lxml_html.make_links_absolute(link)

Trace:

...
   lxml_html.make_links_absolute(link)
  File "/home/site-packages/lxml/html/__init__.py", line 340, in make_links_absolute
    self.rewrite_links(link_repl)
  File "/home/site-packages/lxml/html/__init__.py", line 468, in rewrite_links
    for el, attrib, link, pos in self.iterlinks():
  File "/home/site-packages/lxml/html/__init__.py", line 381, in iterlinks
    for el in self.iter(etree.Element):
  File "lxml.etree.pyx", line 2752, in lxml.etree.ElementDepthFirstIterator.__next__ (src/lxml/lxml.etree.c:65185)
  File "lxml.etree.pyx", line 1537, in lxml.etree._elementFactory (src/lxml/lxml.etree.c:50951)
AttributeError: '_Document' object has no attribute '_init'

Versions:

pypy: PyPy 2.2.1 with GCC 4.6.3

ii libxml2 2.7.8.dfsg-5.1ubuntu4.6 GNOME XML library
ii libxml2-dev 2.7.8.dfsg-5.1ubuntu4.6 Development files for the GNOME XML library
ii libxml2-utils 2.7.8.dfsg-5.1ubuntu4.6 XML utilities
ii libxslt1-dev 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - development kit
ii libxslt1.1 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - runtime library

$ pypy ./get_versions.py (your code)
Python : (major=2, minor=7, micro=3, releaselevel='final', serial=42)
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

description: updated
description: updated
description: updated
summary: - lxml pypy lxml_html.make_links_absolute error
+ lxml pypy make_links_absolute error
scoder (scoder) wrote :

That's a very unexpected error. What's the page you are parsing?

Changed in lxml:
status: New → Triaged

import urlparse

My code:

import lxml.html
import requests

url = 'https://en.wikipedia.org/wiki/The_Matrix'
parsed_uri = urlparse.urlparse(url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

print url
print domain

r = requests.get(url)

lxml_html = lxml.html.fromstring(r.text)
lxml_html.make_links_absolute(domain)

scoder (scoder) wrote :

Thanks, I can reproduce it.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Low
status: Triaged → Confirmed
scoder (scoder) wrote :

However, "reproduce" doesn't mean I can (easily) fix it. It looks more like PyPy is misbehaving here, so it would be better if a PyPy developer could look into this.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers