title after </head> breaks html5lib parser in BS4
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
New
|
Undecided
|
Unassigned |
Bug Description
Given the following input:
```
</head>
<title>foo</title>
<p>Hello, World!
```
In [2]: from bs4 import BeautifulSoup
In [3]: d = BeautifulSoup(fp, "html5lib")
In [4]: d.find_all("p")
Out[4]: []
This is not what I expect, and inconsistent with html5lib itself (where with etree `.//{http://
```
In [8]: diagnose.
Diagnostic running on Beautiful Soup 4.3.2
Python version 2.7.9 (9c4588d731b7fe
[PyPy 2.5.1 with GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
I noticed that lxml is not installed. Installing it may help.
Found html5lib version 0.99999
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<title>
foo
</title>
<p>
Hello, World!
</p>
-------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
<title>
foo
</title>
</head>
<body>
<p>
Hello, World!
</p>
</body>
</html>
-------
```
Note that I suspect html5lib's testsuite would catch issues like this (try hacking html5lib/ test/support. py to include the BS builder, and make sure the builder defines a test serializer). It should probably be made easier to run the testsuite against out-of-tree treebuilders, though.