Beautiful Soup

title after </head> breaks html5lib parser in BS4

Bug #1450884 reported by Geoffrey Sneddon on 2015-05-01

This bug report is a duplicate of: Bug #1430633: Document with <head><head> crashes BS4. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

Given the following input:

```
</head>
<title>foo</title>
Hello, World!
```

In [2]: from bs4 import BeautifulSoup
In [3]: d = BeautifulSoup(fp, "html5lib")
In [4]: d.find_all("p")
Out[4]: []

This is not what I expect, and inconsistent with html5lib itself (where with etree `.//{http://www.w3.org/1999/xhtml}p` will find the p element). If one omits the `</head>`, then BS4 finds it.

```
In [8]: diagnose.diagnose(fp)
Diagnostic running on Beautiful Soup 4.3.2
Python version 2.7.9 (9c4588d731b7fe0b08669bd732c2b676cb0a8233, Mar 31 2015, 07:51:42)
[PyPy 2.5.1 with GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
I noticed that lxml is not installed. Installing it may help.
Found html5lib version 0.99999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<title>
foo
</title>

Hello, World!

--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
 <title>
 foo
 </title>
</head>
<body>
 
 Hello, World!
 
</body>
</html>
--------------------------------------------------------------------------------
```

Revision history for this message

Geoffrey Sneddon (geoffers) wrote on 2015-05-11:

Note that I suspect html5lib's testsuite would catch issues like this (try hacking html5lib/test/support.py to include the BS builder, and make sure the builder defines a test serializer). It should probably be made easier to run the testsuite against out-of-tree treebuilders, though.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1430633 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.