html2xhtml produces invald XML for MS Office HTML output

Bug #1706274 reported by Thomas Weber
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libhtml-html5-parser-perl (Ubuntu)
Undecided
Unassigned

Bug Description

This is the document element created by MS Office on a Mac:

<html xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<!-- -->
</html>

html2xhtml outputs the following invalid XML with two xmlns namespace declarations:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/TR/REC-html40" xmlns:x="urn:schemas-microsoft-com:office:excel"><head/><body>
</body></html>

I'm not sure what part of the Perl libraries is responsible for this and where to report this upstream. Any hints for that are very welcome.

Revision history for this message
Kjetil Kjernsmo (kjetil) wrote :

I'm not 100% sure, since this module is a parser and not a serializer, but it appears HTML::HTML5::Parser is just building a DOM, and the serialization is then done by XML::LibXML. Therefore, it seems likely the bug is indeed in HTML::HTML5::Parser.

The upstream bug tracker is at
https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-HTML5-Parser
but unfortunately, it doesn't see a lot of attention these days. Nevertheless, please submit upstream.

Changed in libhtml-html5-parser-perl (Ubuntu):
status: New → Confirmed
Revision history for this message
Kjetil Kjernsmo (kjetil) wrote :

I was able to reproduce the bug, and I have committed the example as test data to my own fork of the module: https://github.com/kjetilk/p5-html-html5-parser/commit/c4be3e6ee63d0850079c115ef4274e4c2c3befa9
I'm not a maintainer, so it doesn't bring us much closer to a solution though. Just did it since I have a fork :-)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers