majobake unicode data in lxml.html.fromstring function
Bug #1338546 reported by
vahid kharazi
This bug report is a duplicate of:
Bug #898072: lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly.
Edit
Remove
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Undecided
|
Unassigned |
Bug Description
I have a code in python that fetch many sites, in one site i have majobake problem.
this is my fetching part
content = self.urllib2.
content = unicodedata.
)
print "------
print content # correct unicode
print page # majobake characters
in parsing with beautiful soup(bs4) mycode work correctly.
Changed in lxml: | |
status: | New → Triaged |
To post a comment you must log in.
If you already know it's utf-8 encoded, either use an HTMLParser( encoding= 'utf8') or pass the decoded Unicode string into the parser instead of encoding it back into UTF-8 first.