`decode_contents' doesn't indent multiple line content

Bug #1506200 reported by yangchenyun
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

`decode_contents' would only indent the once for content; it is an issue for tag which might contains multipline line content such as <script>, <stylesheet>.

In those case, only the first line will have correct indentation.

For example, the `s` used to track modified content before printing would be:

s = [' ', u"goog.require('yt.testing.jasmine');\n\n describe('<ytm-app>', function() {\n var ytmApp\n\n it('has correct state flow', function() {\n });\n\n it('loadData works as expected', function() {\n });\n });"]

The correct version should be:

s = [
    ' ', u"goog.require('yt.testing.jasmine');\n\n",
    ' ', u"describe('<ytm-app>', function() {\n",
    ' ', u" var ytmApp\n\n",
    ' ', u" it('has correct state flow', function() {\n",
    ' ', u" });\n\n",
    ' ', u" it('loadData works as expected', function() {\n",
    ' ', u" });\n",
    ' ', u"});"
]

Revision history for this message
Leonard Richardson (leonardr) wrote :

I understand why you want this, but I think it's out of scope for this project. The content of a <script> or <style> tag is a different media type from HTML, and reindenting it requires parsing that media type. It looks like you're saying that 1) a <script> tag will only ever contain Javascript, and 2) adding whitespace where whitespace already exists won't change the meaning of a Javascript program.

1) is basically true right now, and 2) seems reasonable, but I don't know enough about Javascript to be confident that it's always true. You'd think indentation wouldn't matter for HTML, but there are specific cases where it matters -- <pre> and <textarea> tags for example -- and those have caused Beautiful Soup bugs in the past.

Basically, I don't want to be responsible for every piece of (potentially invalid) Javascript someone runs through Beautiful Soup. I'd rather leave the Javascript alone and focus on the HTML. The same is true, to a lesser extent, for the CSS code in <style> tags.

If anything, I should probably add <script> and <style> to the list of preserve_whitespace_tags (like <pre> and <textarea>) to reduce the possibility that the _initial_ indentation changes the meaning of the content inside the tag.

I'm open to reopening this issue but I'll need to be convinced that the change can be made without having to bring in parsers for other media types, that it won't break anything, and that it will on balance make people happier.

Changed in beautifulsoup:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.