majobake unicode data in lxml.html.fromstring function

Bug #1338546 reported by vahid kharazi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
Unassigned

Bug Description

I have a code in python that fetch many sites, in one site i have majobake problem.
this is my fetching part

            content = self.urllib2.urlopen(self.url).read()
            content = unicodedata.normalize('NFKD', unicode(content, 'utf-8')).encode(
                'utf-8',
                'ignore'
            )
            print "-------------------"
            print content # correct unicode
            self.page = fromstring(content)
            print page # majobake characters

and site is: http://www.eghtesadonline.com/fa/content/51805/%D8%A7%D9%86%D8%AA%D9%82%D8%A7%D8%AF-%D8%B4%D8%AF%DB%8C%D8%AF-%D8%A8%D9%87-%D8%A7%D9%81%D8%B2%D8%A7%DB%8C%D8%B4-%D8%AD%D9%82-%D8%A8%DB%8C%D9%85%D9%87-%D8%B4%D8%AE%D8%B5-%D8%AB%D8%A7%D9%84%D8%AB

in parsing with beautiful soup(bs4) mycode work correctly.

Revision history for this message
scoder (scoder) wrote :

If you already know it's utf-8 encoded, either use an HTMLParser(encoding='utf8') or pass the decoded Unicode string into the parser instead of encoding it back into UTF-8 first.

scoder (scoder)
Changed in lxml:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.