After #&#@! is encountered, all tags are closed and the rest of the page is not parsed !!
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Hi, I wanted to extract user reviews of all the 250 top Imdb movies. I used BeautifulSoup for parsing the content. However, in this URL "https:/
Thanks for filing an issue about this.
Frequently issues of this type are due to differences between parsers. I couldn't tell from your screenshot which parser you were using, so I reproduced some of the markup from your screenshot and used the 'diagnose' module to show how different parsers handle the markup.
Here's my source code:
---
markup = """ review/ rs0346205/ ?ref_=tt_ urv"> A One line summary is too much for this overrated piece of #&#@!
<a class="title" href="/
</a>
<div class=" display- name-date" >
</div>
"""
from bs4 import diagnose diagnose( markup)
diagnose.
---
Here's the output of the script:
Python version 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609]
Found lxml version 4.2.3.0
Found html5lib version 0.999
Trying to parse your markup with html.parser review/ rs0346205/ ?ref_=tt_ urv"> display- name-date" >
Here's what html.parser did with the markup:
<a class="title" href="/
A One line summary is too much for this overrated piece of #&#@!
</a>
<div class="
</div>
------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --- review/ rs0346205/ ?ref_=tt_ urv"> display- name-date" > ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --- review/ rs0346205/ ?ref_=tt_ urv"> display- name-date" > ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --- review/ rs0346205/ ?ref_=tt_ urv"> ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ---
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<a class="title" href="/
A One line summary is too much for this overrated piece of #&#@!
</a>
<div class="
</div>
</body>
</html>
-------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<a class="title" href="/
A One line summary is too much for this overrated piece of #&#@!
</a>
<div class="
</div>
</body>
</html>
-------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="title" href="/
A One line summary is too much for this overrated piece of #&#@!
</a>
-------
My guess is that were probably using the lxml XML parser, since that's the only parser that has the problem you're describing. Ultimately it's lxml's decision how to handle bad markup, so if this is really a problem it needs to be fixed inside lxml--there's not much I can do about it. The good news for you is that switching to another parser -- especially if you were using one designed to parse XML instead of HTML -- should solve the problem.