pandoc --ascii and --self-contained conflict
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pandoc (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
$ echo '"test"' | pandoc -t html5 -o - --smart
<p>“test”</p>
$ echo '"test"' | pandoc -t html5 -o - --smart --ascii
<p>“
$ echo '"test"' | pandoc -t html5 -o - --smart --ascii --self-contained
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title></title>
<!--[if lt IE 9]>
<script src="http://
<![endif]-->
</head>
<body>
<p>“test”</p>
</body>
</html>
Clearly "pandoc -t html5 -o - --smart --ascii --self-contained" should have used “ and ” entities as well.
ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: pandoc 1.9.1.1-1
ProcVersionSign
Uname: Linux 3.2.0-27-generic x86_64
ApportVersion: 2.0.1-0ubuntu11
Architecture: amd64
Date: Sat Jul 14 18:18:49 2012
EcryptfsInUse: Yes
ProcEnviron:
TERM=xterm
LC_COLLATE=C
PATH=(custom, user)
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: pandoc
UpgradeStatus: Upgraded to precise on 2012-02-06 (159 days ago)
Looking at src/pandoc.hs, we have:
| htmlFormat && ascii ->
writerFn outputFile =<< selfcontain (toEntities result)
So it tries to do the right thing in that situation. result contains the standalone html string and toEntities substitutes entities for non-ascii characters. The problem is selfcontain once again parses the html string, so the entities get resolved to characters once more, and then this is written out. I flipped it around so that we have:
| htmlFormat && ascii -> do
result' <- selfcontain result
writerFn outputFile =<< return (toEntities result')
which seems to fix the issue. I'll report this upstream and see if we can get it fixed.