lxml pypy make_links_absolute error

Bug #1308241 reported by Viacheslav Biriukov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Low
scoder

Bug Description

Code:

    parser = lxml.html.HTMLParser(encoding=page_encoding)
    lxml_html = lxml.html.fromstring(page_body, parser=parser)
    lxml_html.make_links_absolute(link)

Trace:

...
   lxml_html.make_links_absolute(link)
  File "/home/site-packages/lxml/html/__init__.py", line 340, in make_links_absolute
    self.rewrite_links(link_repl)
  File "/home/site-packages/lxml/html/__init__.py", line 468, in rewrite_links
    for el, attrib, link, pos in self.iterlinks():
  File "/home/site-packages/lxml/html/__init__.py", line 381, in iterlinks
    for el in self.iter(etree.Element):
  File "lxml.etree.pyx", line 2752, in lxml.etree.ElementDepthFirstIterator.__next__ (src/lxml/lxml.etree.c:65185)
  File "lxml.etree.pyx", line 1537, in lxml.etree._elementFactory (src/lxml/lxml.etree.c:50951)
AttributeError: '_Document' object has no attribute '_init'

Versions:

pypy: PyPy 2.2.1 with GCC 4.6.3

ii libxml2 2.7.8.dfsg-5.1ubuntu4.6 GNOME XML library
ii libxml2-dev 2.7.8.dfsg-5.1ubuntu4.6 Development files for the GNOME XML library
ii libxml2-utils 2.7.8.dfsg-5.1ubuntu4.6 XML utilities
ii libxslt1-dev 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - development kit
ii libxslt1.1 1.1.26-8ubuntu1.3 XSLT 1.0 processing library - runtime library

$ pypy ./get_versions.py (your code)
Python : (major=2, minor=7, micro=3, releaselevel='final', serial=42)
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

description: updated
description: updated
description: updated
summary: - lxml pypy lxml_html.make_links_absolute error
+ lxml pypy make_links_absolute error
Revision history for this message
scoder (scoder) wrote :

That's a very unexpected error. What's the page you are parsing?

Changed in lxml:
status: New → Triaged
Revision history for this message
Viacheslav Biriukov (v-v-biriukov) wrote :

import urlparse

My code:

import lxml.html
import requests

url = 'https://en.wikipedia.org/wiki/The_Matrix'
parsed_uri = urlparse.urlparse(url)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

print url
print domain

r = requests.get(url)

lxml_html = lxml.html.fromstring(r.text)
lxml_html.make_links_absolute(domain)

Revision history for this message
scoder (scoder) wrote :

Thanks, I can reproduce it.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Low
status: Triaged → Confirmed
Revision history for this message
scoder (scoder) wrote :

However, "reproduce" doesn't mean I can (easily) fix it. It looks more like PyPy is misbehaving here, so it would be better if a PyPy developer could look into this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.