HTML Tidy is doing a poor job; please update to newer HTML Tidy

Bug #1660537 reported by Jeffrey Walton on 2017-01-31
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tidy-html5 (Ubuntu)
Undecided
Unassigned

Bug Description

In the past I ran HTML pages through HTML Tidy provided by MacPorts on OS X. I'm now working on a Ubuntu 16.10/Yakkety system, and its doing an awful job on the pages. When I diff the pages nearly everything has changed.

For example, Ubuntu's HTML Tidy is not indenting, its adding extra characters and its stripping whitespace that should remain. Others have experienced the problem, too: https://stackoverflow.com/questions/24505764/html-tidy-stripping-space-at-the-start.

Please update to a more recent version of HTML Tidy.

**********

The nice thing about this report is the pages and the script is located at https://github.com/weidai11/website. You can duplicate with the following. You don't even need to make a change. Just diff after running `cleanup.sh`.

   git clone https://github.com/weidai11/website
   cd website
   ./cleanup.sh

HTML Tidy is invoked with the following in the script:

   # Cleanup HTML files
   for file in *.html
   do
      echo "**************** $file ****************"

      echo "tidy: processing file $file..."
      "$HTML_TIDY" --quiet yes --output-bom no --indent auto --wrap 90 -m "$file"

      echo "sed: processing file $file..."

      # Delete trailing whitespace
      "$SED" "${SED_OPTS[@]}" -e's/[[:space:]]*$//' "$file"

      # Delete the generator markup tag
      "$SED" "${SED_OPTS[@]}" -e'/<meta name="generator"/d' "$file"

      # Fix CRLF endings after sed
      unix2dos "$file"
   done

**********

   $ lsb_release -a
   No LSB modules are available.
   Distributor ID: Ubuntu
   Description: Ubuntu 16.10
   Release: 16.10
   Codename: yakkety

**********

$ apt-cache show tidy
Package: tidy
Priority: optional
Section: universe/web
Installed-Size: 83
Maintainer: Ubuntu Developers <email address hidden>
Original-Maintainer: Jason Thomas <email address hidden>
Architecture: amd64
Source: tidy-html5
Version: 1:5.2.0-2
Depends: libc6 (>= 2.14), libtidy5 (= 1:5.2.0-2)
Filename: pool/universe/t/tidy-html5/tidy_5.2.0-2_amd64.deb
Size: 25524
MD5sum: 06fda2013e8edb31fbc37fb2bb407e5c
SHA1: 58b1b60cd8bc2a084d78d374b24fefd24acc7783
SHA256: 6c9492519b78c37f3ac97c88237b7832e4f50d3eb303364e5afb44ecbe0ed548
...

Hans Joachim Desserud (hjd) wrote :

Thanks for reporting. It looks like in 16.10 and later, the package for tidy is tidy-html5. I've taken the liberty of moving this bug report to the right package.

affects: tidy (Ubuntu) → tidy-html5 (Ubuntu)
tags: added: upgrade-software-version
Jeffrey Walton (noloader) wrote :

Thanks for the quick response Hans.

If you need a retest, then call it out. I'd be happy to perform it.

Sorry about the mis-classification.

Jeremy Bicha (jbicha) wrote :

It looks like tidy-html5 5.2.0 is the latest stable release available:
https://github.com/htacg/tidy-html5/releases

Thank you for taking the time to report this bug and helping to make Ubuntu better. The issue you are reporting is an upstream one and it would be nice if somebody having it could send the bug to the developers of the software at https://github.com/htacg/tidy-html5/issues . If you have done so, please tell us the number of the upstream bug (or the link), so we can add a bugwatch that will inform us about its status. Thanks in advance.

Hans Joachim Desserud (hjd) wrote :

Just looked a bit more at this, and found that http://www.html-tidy.org/ mentions the latest release is 5.2.0. This seems to be the version available in Ubuntu 16.10 and later, so I'm slightly confused now. Is there a newer version available?

>If you need a retest, then call it out. I'd be happy to perform it.
>Sorry about the mis-classification.

No problem. And just to clarify, I mainly tag bug reports and stuff, I don't necessarily upload new versions of packages.

Jeffrey Walton (noloader) wrote :
Download full text (4.0 KiB)

> The issue you are reporting is an upstream one and it would be nice if somebody having it could send the bug to the developers of the software at https://github.com/htacg/tidy-html5/issues .

My bad... I did not check Debian. Let me see if there's a Debian bug covering it.

Here's what MacPorts is providing:

    $ port info tidy
    tidy @5.2.0 (www)
    Variants: debug, universal

    Description: Tidy is a utility to clean up and fix broken HTML files.
    Homepage: http://www.html-tidy.org/

    Build Dependencies: cmake, libxslt
    Platforms: darwin
    License: MIT
    Maintainers: Email: <email address hidden>
                          Policy: openmaintainer

**********

It looks like Debian's HTML Tidy is producing the issue:

diff --git a/downloads.html b/downloads.html
index b1199eb..b3f1807 100644
--- a/downloads.html
+++ b/downloads.html
@@ -3,15 +3,15 @@

 <html>
 <head>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
-
- <title>Crypto++ Library | All Downloads</title>
- <meta name="description" content=
- "free C++ library for cryptography: includes ciphers, message authentication
codes, one-way hash functions, public-key cryptosystems, key agreement schemes,
and deflate compression">
- <meta name="keywords" content=
- "Crypto++, CryptoPP, crypto, cryptography, cryptographic, security, free, open source, public domain, library, C++, SSE2, SSE4, AESNI, RDRAND, RDSEED, NEON, ASIMD, cipher, ciphers, code, codes, scheme, schemes, hash, digest, cryptosystem, key agreement, AES, DH, RSA, DSA, DES, SHA, HMAC, HKDF, elliptic curve">
- <link rel="stylesheet" type="text/css" href="cryptopp.css">
- <style type="text/css">
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8">^M
+^M
+<title>Crypto++ Library | All Downloads</title>^M
+<meta name="description" content=^M
+"free C++ library for cryptography: includes ciphers, message authentication codes, one-way hash functions, public-key cryptosystems, key agreement schemes, and deflate compression">^M
+<meta name="keywords" content=^M
+"Crypto++, CryptoPP, crypto, cryptography, cryptographic, security, free, open source, public domain, library, C++, SSE2, SSE4, AESNI, RDRAND, RDSEED, NEON, ASIMD, cipher, ciphers, code, codes, scheme, schemes, hash, digest, cryptosystem, key agreement, AES, DH, RSA, DSA, DES, SHA, HMAC, HKDF, elliptic curve">^M
+<link rel="stylesheet" type="text/css" href="cryptopp.css">^M
+<style type="text/css">^M

**********

Here's what Debian is providing:

$ apt-cache show tidy
Package: tidy
Version: 20091223cvs-1.4+deb8u1
Installed-Size: 84
Maintainer: Jason Thomas <email address hidden>
Architecture: amd64
Depends: libc6 (>= 2.14), libtidy-0.99-0 (>= 20091223cvs-1.4+deb8u1)
Suggests: tidy-doc
Description-en: HTML syntax checker and reformatter
 Corrects markup in a way compliant with the latest standards, and
 optimal for the popular browsers. It has a comprehensive knowledge
 of the attributes defined in the HTML 4.0 recommendation from W3C,
 and understands the US ASCII, ISO Latin-1, UTF-8 and the ISO 2022
 family of 7-bit encodings. In the output:
 .
 ...

Read more...

Jeffrey Walton (noloader) wrote :

This looks ominous... From Debian:

    $ tidy --version
    HTML Tidy for Linux released on 25 March 2009

From MacPorts:

    $ /opt/local/bin/tidy --version
    HTML Tidy for Mac OS X version 5.2.0

Jeremy Bicha (jbicha) wrote :

Yes, the current stable version of Debian offers the 2009 tidy. tidy-html5 is a fork. The next stable version of Debian (expected this year) will include the same tidy-html5 5.2.0 currently in Ubuntu 16.10.

5.4.0 was released 2017-05-01, and release notes are at http://binaries.html-tidy.org/release_notes/5.4.0.html

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in tidy-html5 (Ubuntu):
status: New → Confirmed
summary: - HTML Tidy is dooing a poor job; please update to newer HTML Tidy
+ HTML Tidy is doing a poor job; please update to newer HTML Tidy

FYI there is 2:5.6.0-6 in 19.04 proposed now

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers