lxml

majobake unicode data in lxml.html.fromstring function

Bug #1338546 reported by vahid kharazi on 2014-07-07

This bug report is a duplicate of: Bug #898072: lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Undecided	Unassigned

Bug Description

I have a code in python that fetch many sites, in one site i have majobake problem.
this is my fetching part

            content = self.urllib2.urlopen(self.url).read()
            content = unicodedata.normalize('NFKD', unicode(content, 'utf-8')).encode(
                'utf-8',
                'ignore'
            )
            print "-------------------"
            print content # correct unicode
            self.page = fromstring(content)
            print page # majobake characters

and site is: http://www.eghtesadonline.com/fa/content/51805/%D8%A7%D9%86%D8%AA%D9%82%D8%A7%D8%AF-%D8%B4%D8%AF%DB%8C%D8%AF-%D8%A8%D9%87-%D8%A7%D9%81%D8%B2%D8%A7%DB%8C%D8%B4-%D8%AD%D9%82-%D8%A8%DB%8C%D9%85%D9%87-%D8%B4%D8%AE%D8%B5-%D8%AB%D8%A7%D9%84%D8%AB

in parsing with beautiful soup(bs4) mycode work correctly.

Revision history for this message

scoder (scoder) wrote on 2014-07-07:

If you already know it's utf-8 encoded, either use an HTMLParser(encoding='utf8') or pass the decoded Unicode string into the parser instead of encoding it back into UTF-8 first.

scoder (scoder) on 2014-12-06

Changed in lxml:
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #898072 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.