Comment 1 for bug 1838877

Leonard Richardson (leonardr) wrote :

It looks like there are three problems here.

1. The TypeError. This is in Beautiful Soup code and easy to fix.

2. The lxml parser doesn't deal well with Unicode documents. It's rejecting your markup, with this exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 79: unexpected end of data

But you don't get any visibility into that exception. I fixed this by propagating the exception upwards so you can see it.

3. By encoding the data as UTF-8, you can get lxml to accept the markup without raising an exception. But whatever problem lxml is having with this particular document doesn't go away, and lxml still can't handle the document. it ignores the entire thing, because of whatever problem it perceives in the DOCTYPE, and you're left with an empty BeautifulSoup object.

The fixes to 1 and 2 are in revision 526. To actually parse the document I recommend using html5lib as the parser instead of lxml.