title after </head> breaks html5lib parser in BS4

Bug #1450884 reported by Geoffrey Sneddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

Given the following input:

```
</head>
<title>foo</title>
<p>Hello, World!
```

In [2]: from bs4 import BeautifulSoup
In [3]: d = BeautifulSoup(fp, "html5lib")
In [4]: d.find_all("p")
Out[4]: []

This is not what I expect, and inconsistent with html5lib itself (where with etree `.//{http://www.w3.org/1999/xhtml}p` will find the p element). If one omits the `</head>`, then BS4 finds it.

```
In [8]: diagnose.diagnose(fp)
Diagnostic running on Beautiful Soup 4.3.2
Python version 2.7.9 (9c4588d731b7fe0b08669bd732c2b676cb0a8233, Mar 31 2015, 07:51:42)
[PyPy 2.5.1 with GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
I noticed that lxml is not installed. Installing it may help.
Found html5lib version 0.99999

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<title>
 foo
</title>
<p>
 Hello, World!
</p>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
  <title>
   foo
  </title>
 </head>
 <body>
  <p>
   Hello, World!
  </p>
 </body>
</html>
--------------------------------------------------------------------------------
```

Revision history for this message
Geoffrey Sneddon (geoffers) wrote :

Note that I suspect html5lib's testsuite would catch issues like this (try hacking html5lib/test/support.py to include the BS builder, and make sure the builder defines a test serializer). It should probably be made easier to run the testsuite against out-of-tree treebuilders, though.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.