xmllint does not recognize emdash (—)

Bug #2020814 reported by Jeffrey Walton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libxml2 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I'm using Ubuntu 20.04.2 LTS, x85_64, fully patched. I'm using DocBook to build a PDF. One of the steps I use in my build script is to validate and format the XML using xmllint from libxml2-utils
2.9.13+dfsg-1ubuntu0.3:

    echo "Validating book..."
    if ! xmllint --xinclude --noout --postvalid book.xml
    then
        echo "Validation failed. Exiting."
        exit 1
    fi
    echo "Complete."

    echo "Formatting source code..."
    for file in *.xml
    do
        if xmllint --format "${file}" --output "${file}.format"
        then
            mv "${file}.format" "${file}"
        fi
    done
    echo "Complete."

When I added an emdash (—) the book failed to format:

    Validating book...
    Complete.
    Formatting source code...
    ch02.xml:58: parser error : Entity 'mdash' not defined
     injections are remediated using several methods. And two output devices —
                                                                                   ^
    ch02.xml:58: parser error : Entity 'mdash' not defined
     methods. And two output devices — the printer and plaintext email —
                                                                                   ^
    Complete.

The text is:

    <para>... And two output devices &mdash; the printer and plaintext email &mdash; do not require...</para>

It seems like emdash should be recognized.

-----

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy

-----

$ xmllint --version
xmllint: using libxml version 20913
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ICU ISO8859X Unicode Regexps Automata Schemas Schematron Modules Debug Zlib Lzma

$ command -v xmllint
/usr/bin/xmllint

$ dpkg -S /usr/bin/xmllint
libxml2-utils: /usr/bin/xmllint

$ apt-cache show libxml2-utils
Package: libxml2-utils
Architecture: amd64
Version: 2.9.13+dfsg-1ubuntu0.3
Multi-Arch: foreign
Priority: optional
Section: text
Source: libxml2
Origin: Ubuntu
Maintainer: Ubuntu Developers <email address hidden>
Original-Maintainer: Debian XML/SGML Group <email address hidden>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 202
Depends: libc6 (>= 2.34), libxml2 (>= 2.9.0)
Filename: pool/main/libx/libxml2/libxml2-utils_2.9.13+dfsg-1ubuntu0.3_amd64.deb
Size: 40192
MD5sum: 3ca7de07562010fcaabf255ea8fea9c4
SHA1: 128a9cfaff49e85f2ab08578f389eecb21f17766
SHA256: c279c07caf909545e2cedb7845b5ac652e0a70f9784e5faf799a1a01441b4649
SHA512: 51600d7206c9a5568fdaeee9adddbc48962fc094cc479f4bd42c0714b3725cd3200937f8c876a897db08fc50d891005d7dbabfb6ae12ad27ad6ed416f8b6a03d
Homepage: http://xmlsoft.org
...

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

I doubt this is a bug: nowhere do you pass the validator a DTD, and entities are defined in the DTD.

It’s best practice nowadays to not use entities but just write the UTF-8 characters directly.

An em dash surrounded by hair spaces is: “ — ” (for your copy/paste convenience)

Revision history for this message
Jeffrey Walton (noloader) wrote : Re: [Bug 2020814] Re: xmllint does not recognize emdash (&mdash;)

On Thu, May 25, 2023 at 5:35 PM Thorsten Glaser
<email address hidden> wrote:
>
> I doubt this is a bug: nowhere do you pass the validator a DTD, and
> entities are defined in the DTD.
>
> It’s best practice nowadays to not use entities but just write the UTF-8
> characters directly.
>
> An em dash surrounded by hair spaces is: “ — ” (for your copy/paste
> convenience)

I think you are right - this is not a bug. I took a quick peek at RFC
3470, and I don't see where HTML entity references are optional.

Sorry about that. I got some bad info off the internet (surprise,
surprise). It said to use the character entity reference for emdash
due to portability problems when using the character itself.

Jeff

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

Yeah well, those portability problems were back in the 1990s when people used latin1 or whatever codepages.

Changed in libxml2 (Ubuntu):
status: New → Invalid
Revision history for this message
Jeffrey Walton (noloader) wrote :

Thorsten, a quick question...

The first part of my book build script has this:

    echo "Validating book..."
    if ! xmllint --xinclude --noout --postvalid book.xml
    then
        echo "Validation failed. Exiting."
        exit 1
    fi
    echo "Complete."

Why did the book pass validation when using emdash entity reference? It seems like that should have failed.

An example of the build script can be found at https://github.com/noloader/POWER8-crypto/tree/master/docbook. make-book.sh builds the book, and it includes the snippet.

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

Hmm. I normally use libxml2 via xmlstarlet which has a somewhat nicer UX than xmllint.

My guess is that you didn’t give a DTD, so it could only check that all present entities are syntactically valid, but not expand them.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.