Beautiful Soup

After #&#@! is encountered, all tags are closed and the rest of the page is not parsed !!

Bug #1793722 reported by Aditya Pal on 2018-09-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

Hi, I wanted to extract user reviews of all the 250 top Imdb movies. I used BeautifulSoup for parsing the content. However, in this URL "https://www.imdb.com/title/tt0110912/reviews/_ajax?ref_=undefined&paginationKey=gislhsnr5arpptolkipssfuman2h7xk7etbbostz5vmg5n7gjy5iwurqrqv6gdmnjxgby5w6gqwvk" there is a review with the title "A One line summary is too much for this overrated piece of #&#@!". After reaching this review title, the Beautiful Soup seems to break and all tags are closed immediately. The point is even if this review would be skipped, I would have had no problems, however, I need the paginationKey to the next page which appear after all the reviews. I have attached pictures for ease of identification.

Revision history for this message

Aditya Pal (aditya-pal-science) wrote on 2018-09-21:

source code vs BeautifulSoup code Edit (297.3 KiB, image/png)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-12-24:

Thanks for filing an issue about this.

Frequently issues of this type are due to differences between parsers. I couldn't tell from your screenshot which parser you were using, so I reproduced some of the markup from your screenshot and used the 'diagnose' module to show how different parsers handle the markup.

Here's my source code:

---

markup = """
<a class="title" href="/review/rs0346205/?ref_=tt_urv"> A One line summary is too much for this overrated piece of #&#@!
</a>

<div class="display-name-date">
</div>
"""

from bs4 import diagnose
diagnose.diagnose(markup)

---

Here's the output of the script:

Python version 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609]
Found lxml version 4.2.3.0
Found html5lib version 0.999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
A One line summary is too much for this overrated piece of #&#@!
</a>
<div class="display-name-date">
</div>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&#@!
  </a>
  <div class="display-name-date">
  </div>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&#@!
  </a>
  <div class="display-name-date">
  </div>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
A One line summary is too much for this overrated piece of #&#@!
</a>
--------------------------------------------------------------------------------

My guess is that were probably using the lxml XML parser, since that's the only parser that has the problem you're describing. Ultimately it's lxml's decision how to handle bad markup, so if this is really a problem it needs to be fixed inside lxml--there's not much I can do about it. The good news for you is that switching to another parser -- especially if you were using one designed to parse XML instead of HTML -- should solve the problem.

Thanks for filing an issue about this.

Here's my source code:

---

markup = """
<a class="title" href="/review/rs0346205/?ref_=tt_urv"> A One line summary is too much for this overrated piece of #&amp;#@!
</a>

<div class="display-name-date">
</div>
"""

from bs4 import diagnose
diagnose.diagnose(markup)

---

Here's the output of the script:

Python version 2.7.12 (default, Nov 12 2018, 14:36:49) 
[GCC 5.4.0 20160609]
Found lxml version 4.2.3.0
Found html5lib version 0.999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
 A One line summary is too much for this overrated piece of #&amp;#@!
</a>
<div class="display-name-date">
</div>

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&amp;#@!
  </a>
  <div class="display-name-date">
  </div>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <a class="title" href="/review/rs0346205/?ref_=tt_urv">
   A One line summary is too much for this overrated piece of #&amp;#@!
  </a>
  <div class="display-name-date">
  </div>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="title" href="/review/rs0346205/?ref_=tt_urv">
 A One line summary is too much for this overrated piece of #&amp;#@!
</a>
--------------------------------------------------------------------------------