Consistently handle whitespace -- either collapse it or don't

Bug #1589227 reported by Petr
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Run the following code:

------
#!/usr/bin/python3

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html><body>\n\n</body></html>", "lxml")
print(repr(str(soup)))
------

The output is

'<html><body>\n</body></html>'

while I expect it to be

'<html><body>\n\n</body></html>'

(One newline is missing in generated output.)

The same happens with html.parser parser, but not with html5lib, for html5lib it output the expected string.

If I wrap the newlines with pre tag:

-------
#!/usr/bin/python3

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html><body><pre>\n\n</pre></body></html>", "lxml")
print(repr(str(soup)))
------

it works correctly, outputting

'<html><body><pre>\n\n</pre></body></html>'

and for html.parser too, but **not** for html5lib; with html5lib it outputs

<html><head></head><body><pre>\n</pre></body></html>'

similarly losing a newline.

(The rest of this bug report was tested for lxml only.)

The same seems to happen whenever any two tags, be it opening or closing, are separated by two or more newlines or spaces without any non-whitespace characters. With any additional character, all newlines and spaces are preserved.

Some additional examples (I show input strings only):

"<html><body><b></b>\n\n\n\n</body></html>" only one newline left
"<html><body> <b></b></body></html>" misses one space
"<html><body> \n<b></b></body></html>" misses space
"<html><body> \n\n \n<b></b></body></html>" only one newline (and no spaces) survives
"<html><body> \n\n \na<b></b></body></html>" works as expected

I'm using python 3.4.3 under Kubuntu 15.10, beautifulsoup4 4.4.1, lxml 3.6.0, libxml2 2.9.2+zdfsg1-4ubuntu0.3.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Beautiful Soup is designed to collapse all whitespace except where whitespace is significant (such as within a <pre> or <textarea> tag). You've identified some places where whitespace doesn't get collapsed, and my initial reaction would be to make sure it does get collapsed.

In addition, you're asking for a mode where whitespace _doesn't_ get collapsed, where every tag is treated the way we currently treat <pre> and <textarea>. I don't get a lot of requests for this but it seems reasonable.

summary: - Some newlines are missing in parsed document
+ Consistently handle whitespace -- either collapse it or don't
Changed in beautifulsoup:
status: New → Confirmed
Revision history for this message
Petr (petr-4) wrote :

"Beautiful Soup is designed to collapse all whitespace except where whitespace is significant..."

This sounds rather strange for me, as I was understanding Beautiful Soup as HTML parser, not interpreter or whatever. I would think that collapsing or not collapsing whitespaces is the responsibility of the client, not of Beautiful Soup itself. Is there really a lot of scenarios when whitespace collapsing is the only post-processing needed? I would think that clients either need raw parsed data without any whitespace collapsing, either they do some advanced postprocessing, and whitespace collapsing is only one step of it. Then it is strange to consider whitespace collapsing to belong to Beautiful Soup.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Collapsing whitespace is part of the job of the HTML parser. lxml collapses whitespace when it applies the rules of HTML to a document. The rules are laid out in section 9.1 ("White space") of the HTML 4 spec:

https://www.w3.org/TR/html4/struct/text.html#h-9.1

"In particular, user agents should collapse input white space sequences when producing output inter-word space."

The HTML 3.2 spec (https://www.w3.org/TR/REC-html32) is outdated but says it more explicitly:

"Except within literal text (e.g. the PRE element), HTML treats contiguous sequences of white space characters as being equivalent to a single space character (ASCII decimal 32)."

With HTML 5 it's a lot more complicated, but it's generally closer to the behavior you want. That's why html5lib behaves differently from lxml in your test.

Having looked into this in more detail, I no longer think that Beautiful Soup has control over whether whitespace comes in collapsed or not. There is no mention of whitespace or the strip() method in builders/_lxml.py. It looks like lxml and html.parser follow the rules of HTML 4, and html5lib follows the rules of HTML 5. Beautiful Soup will rearrange whitespace on output if you call prettify() but will otherwise use the whitespace it got from the parser.

As such, I'm closing this issue as invalid. If you want to have the lxml parser apply the whitespace rules of HTML 5, that's a feature that needs to be added to the lxml parser.

Changed in beautifulsoup:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.