Consistently handle whitespace -- either collapse it or don't
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Run the following code:
------
#!/usr/bin/python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(
print(repr(
------
The output is
'<html>
while I expect it to be
'<html>
(One newline is missing in generated output.)
The same happens with html.parser parser, but not with html5lib, for html5lib it output the expected string.
If I wrap the newlines with pre tag:
-------
#!/usr/bin/python3
from bs4 import BeautifulSoup
soup = BeautifulSoup(
print(repr(
------
it works correctly, outputting
'<html>
and for html.parser too, but **not** for html5lib; with html5lib it outputs
<html><
similarly losing a newline.
(The rest of this bug report was tested for lxml only.)
The same seems to happen whenever any two tags, be it opening or closing, are separated by two or more newlines or spaces without any non-whitespace characters. With any additional character, all newlines and spaces are preserved.
Some additional examples (I show input strings only):
"<html>
"<html><body> <b></b>
"<html><body> \n<b></
"<html><body> \n\n \n<b></
"<html><body> \n\n \na<b><
I'm using python 3.4.3 under Kubuntu 15.10, beautifulsoup4 4.4.1, lxml 3.6.0, libxml2 2.9.2+zdfsg1-
Beautiful Soup is designed to collapse all whitespace except where whitespace is significant (such as within a <pre> or <textarea> tag). You've identified some places where whitespace doesn't get collapsed, and my initial reaction would be to make sure it does get collapsed.
In addition, you're asking for a mode where whitespace _doesn't_ get collapsed, where every tag is treated the way we currently treat <pre> and <textarea>. I don't get a lot of requests for this but it seems reasonable.