GBK encoding problems
Bug #1263000 reported by
scj
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
when I use beautiflsoup to process a html file,I met a error:"
Changed in beautifulsoup: | |
status: | New → Invalid |
To post a comment you must log in.
What is the file you are using, and what code are you using to process it?
I can make a guess at the answer: "\xa0" is a Latin-1 byte sequence. If a
gb18030 document contains "\xa0", then it is not really a gb18030
document--it has no encoding at all. You will not be able to convert it to
Unicode without removing \xa0 and similar characters, or replacing them
with their gb18030 equivalents.
The detwingle() method will fix the problem of Latin-1 byte sequences
embedded in UTF-8, but I don't think it will work for gb18030.
Leonard
On Fri, Dec 20, 2013 at 2:29 AM, scj <email address hidden> wrote:
> Public bug reported: UnicodeEncodeEr ror: 'gbk' codec can't encode character '\xa0' in /bugs.launchpad .net/bugs/ 1263000 /bugs.launchpad .net/beautifuls oup/+bug/ 1263000/ +subscriptions
>
> when I use beautiflsoup to process a html file,I met a
> error:"
> position 161: illegal multibyte sequence" .even changed the encoding to
> gb18030,It doesn't work.can you help me to solve it.I use python3.3 .
>
> ** Affects: beautifulsoup
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are subscribed to
> Beautiful Soup.
> https:/
>
> Title:
> GBK encoding problems
>
> To manage notifications about this bug go to:
> https:/
>