After #&#@! is encountered, all tags are closed and the rest of the page is not parsed !!

Bug #1793722 reported by Aditya Pal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Hi, I wanted to extract user reviews of all the 250 top Imdb movies. I used BeautifulSoup for parsing the content. However, in this URL "https://www.imdb.com/title/tt0110912/reviews/_ajax?ref_=undefined&paginationKey=gislhsnr5arpptolkipssfuman2h7xk7etbbostz5vmg5n7gjy5iwurqrqv6gdmnjxgby5w6gqwvk" there is a review with the title "A One line summary is too much for this overrated piece of #&#@!". After reaching this review title, the Beautiful Soup seems to break and all tags are closed immediately. The point is even if this review would be skipped, I would have had no problems, however, I need the paginationKey to the next page which appear after all the reviews. I have attached pictures for ease of identification.

Revision history for this message
Aditya Pal (aditya-pal-science) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for filing an issue about this.

Frequently issues of this type are due to differences between parsers. I couldn't tell from your screenshot which parser you were using, so I reproduced some of the markup from your screenshot and used the 'diagnose' module to show how different parsers handle the markup.

Here's my source code:

---

markup = """
<a class="title" href="/review/rs0346205/?ref_=tt_urv"> A One line summary is too much for this overrated piece of #&amp;#@!
</a>

<div class="display-name-date">
</div>
"""

from bs4 import diagnose
diagnose.diagnose(markup)

---

Here's the output of the script:

Python version 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609]
Found lxml version 4.2.3.0
Found html5lib version 0.999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
 A One line summary is too much for this overrated piece of #&amp;#@!
</a>
<div class="display-name-date">
</div>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&amp;#@!
  </a>
  <div class="display-name-date">
  </div>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&amp;#@!
  </a>
  <div class="display-name-date">
  </div>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
 A One line summary is too much for this overrated piece of #&amp;#@!
</a>
--------------------------------------------------------------------------------

My guess is that were probably using the lxml XML parser, since that's the only parser that has the problem you're describing. Ultimately it's lxml's decision how to handle bad markup, so if this is really a problem it needs to be fixed inside lxml--there's not much I can do about it. The good news for you is that switching to another parser -- especially if you were using one designed to parse XML instead of HTML -- should solve the problem.

Changed in beautifulsoup:
status: New → Invalid
Revision history for this message
Aditya Pal (aditya-pal-science) wrote :

Thanks a lot. I am pretty sure I had not used the lxml-xml parser (I had used the html parser). Anyways, thanks a lot for your effort and cheers !!!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.