nbsp Causes Trouble with Tidy

Bug #867394 reported by nobody
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Zorba
New
Medium
Sorin Marian Nasoi

Bug Description

------------------------------------------
  Submitted:
------------------------------------------

Name: Michael Westbay
Email: ***************
Reason: other reasons
web site: http://try.zorba-xquery.com
feedbackID: 24

------------------------------------------
  Message:
------------------------------------------

A major problem with parsing HTML is the ever present nbsp entity. As you know, it's not part of the XML specification, so it needs to be converted to   before an XML parser can deal with it.

Playing with the "tidy funciton with options" live demo I've tried the following:

import module namespace html="http://www.zorba-xquery.com/modules/converters/html";
import schema namespace html-options="http://www.zorba-xquery.com/modules/converters/html-options";

html:parse('<title>Foo</title><p>Foo!&#nbsp; Spaces',
            <options xmlns="http://www.zorba-xquery.com/modules/converters/html-options" >
              <tidyParam name="output-xml" value="yes" />
              <tidyParam name="doctype" value="omit" />
              <tidyParam name="quote-nbsp" value="no" />
              <tidyParam name="char-encoding" value="utf8" />
              <tidyParam name="newline" value="LF" />
              <tidyParam name="tidy-mark" value="no" />
            </options>)

With the "quote-nbsp" option set to "no" it will work if the paragraph is set to 'Foo!&#160; Spaces' but not set to 'Foo!&#nbsp; Spaces'. It fails either way when "quote-nbsp" is set to "yes". This is also the case when "output-xml" is set to either "yes" or "no".

I was hoping that the tidy module would output the nbsp entity as &#160; instead of &nbsp; for XML output, but that does not seem to be the case.

Is this something that can be fixed on the Zorba side within the HTML module? Or does this issue need to be handled on the Tidy side?

------------------------------------------
  Query:
------------------------------------------
import module namespace html="http://www.zorba-xquery.com/modules/converters/html";
import schema namespace html-options="http://www.zorba-xquery.com/modules/converters/html-options";

html:parse('<title>Foo</title><p>Foo!&#nbsp; Spaces',
            <options xmlns="http://www.zorba-xquery.com/modules/converters/html-options" >
              <tidyParam name="output-xml" value="no" />
              <tidyParam name="doctype" value="omit" />
              <tidyParam name="quote-nbsp" value="no" />
              <tidyParam name="char-encoding" value="utf8" />
              <tidyParam name="newline" value="LF" />
              <tidyParam name="tidy-mark" value="no" />
            </options>)

Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

In your first example &#nbsp; should be replaced by either one of the following:
- &amp;nbsp; or
- &#160;

Wrt. to the second issue, the one related to the "quote-nbsp" option in tidy:
setting the
<tidyParam name="quote-nbsp" value="yes" />
raises an error and this seems like a bug in the html module.
I have created SF bug #3405598:
https://sourceforge.net/tracker/?func=detail&aid=3405598&group_id=226244&atid=1067586

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.