UnicodeEncodeError when logging improperly encoded filenames

Bug #1893481 reported by Jeff Dairiki on 2020-08-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Duplicity
Medium
Unassigned
duplicity (Ubuntu)
Low
Unassigned

Bug Description

Attempts to log messages which contain unicode surrogate characters cause exceptions.
(These surrogate characters arise, for example, when handling files whose names are not properly encoded as UTF-8.)

NOTE: I have no idea whether this is an issue when running on python 2. (If it is, the fixes suggested below probably won't work.)

Duplicity version: 0.8.15
Python version: 3.8.5
Target filesystem: Linux

Example log output:

--- Logging error ---
Traceback (most recent call last):
  File "/opt/Python-3.8.5/lib/python3.8/logging/__init__.py", line 1084, in emit
    stream.write(msg + self.terminator)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 45: surrogates not allowed
Call stack:
  File "/root/.local/pipx/venvs/duplicity/bin/duplicity", line 104, in <module>
    with_tempdir(main)
  File "/root/.local/pipx/venvs/duplicity/bin/duplicity", line 90, in with_tempdir
    fn()
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/dup_main.py", line 1531, in main
    do_backup(action)
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/dup_main.py", line 1655, in do_backup
    full_backup(col_stats)
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/dup_main.py", line 559, in full_backup
    bytes_written = write_multivol(u"full", tarblock_iter,
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/dup_main.py", line 417, in write_multivol
    at_end = gpg.GPGWriteFile(tarblock_iter, tdp.name, config.gpg_profile,
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/gpg.py", line 390, in GPGWriteFile
    data = block_iter.__next__().data
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/diffdir.py", line 544, in __next__
    result = self.process(next(self.input_iter)) # pylint: disable=assignment-from-no-return
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/diffdir.py", line 238, in get_delta_iter
    log_delta_path(delta_path, new_path, stats)
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/diffdir.py", line 181, in log_delta_path
    log.Info(_(u"A %s") %
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/log.py", line 128, in Info
    Log(s, INFO, code, extra)
  File "/root/.local/pipx/venvs/duplicity/lib/python3.8/site-packages/duplicity/log.py", line 91, in Log
    _logger.log(DupToLoggerLevel(verb_level), s,
Message: 'A home/dairiki/PRCS/junk-changelog/22_Senaste\udcc4nd,v'
Arguments: ()

Steps to reproduce:
- Have a file with funny characters in its name, encoded in latin-1 encoding. E.g. a file whose name is "Fü" encoded to latin-1 (b'F\xfc'). When duplicity handles this file, the improperly encoded character will be replaced with a unicode surrogate character.
- Attempt to create an archive containing this file, with verbosity set to 5. Duplicity will try to log each file processed. When it gets to this file, an exception will be reported (and the file will not make it into the archive.)

Alternative steps to produce:
- If the archive is created with verbosity less than 5, the file will make it into the archive. However, if an attempt is made to list files using 'duplicity list-current-files', an exception will be reported when it gets to the file with the funny name.

Workaround
==========

A simple workaround is to set the environment variable PYTHONIOENCODING="utf-8:surrogateescape" before running duplicity. This will set the encoding error mode for stdout and stderr to 'surrogateescape' (by default it is 'strict') with the effect that any surrogates will be replaced with the unicode replacement character (U+FFFD: "�").

Possible Fix
============

A possible fix, at least for Py3K, is probably for duplicity to explicitly set the encoding error strategy for stdin and stdout.
For python >= 3.7 this is simple:

    sys.stdin.reconfigure(errors='surrogateescape')
    sys.stderr.reconfigure(errors='surrogateescape')

For earlier pythons (>= 3), the best option might be:

    sys.stdin = codecs.getwriter('utf-8')(sys.stdin.detach(), 'surrogateescape')

(and similarly for stderr)

Note that python 2 doesn't know about errors='surrogateescape'. Errors='replace' would probably work as an alternative, but it's not ideal as it replaces the surrogates with a plain question mark rather than a unicode replacement character.

Possible Similar Issue
======================

I didn't actually verify that this fails, but it appears that there might be a similar issue when using the --log-fd command line option. Function duplicity.log.add_fd() does a:

    handler = logging.StreamHandler(os.fdopen(fd, u'w'))

In Python 3 os.fdopen (an alias for open) opens the stream with errors='strict' by default.

    handler = logging.StreamHandler(os.fdopen(fd, u'w', errors='surrogateescape'))

or

    handler = logging.StreamHandler(open(fd, u'w', errors='surrogateescape'))

is probably a better choice. (But neither will work in python 2.)

Jeff Dairiki (dairiki) wrote :

I apologize. I just now noticed that the duplicity project is apparently moving to gitlab?

Let me know if you'd like me to re-file this bug over there.

We're still using LP for some things, so no problem there.

Thanks for the bug report and analysis!  It will help!

Changed in duplicity:
assignee: nobody → Kenneth Loafman (kenneth-loafman)
importance: Undecided → Medium
milestone: none → 0.8.16
status: New → In Progress
status: In Progress → Confirmed
Changed in duplicity:
milestone: 0.8.16 → 0.8.17

Please run locale from a terminal and copy/paste the result in a comment.

Changed in duplicity:
assignee: Kenneth Loafman (kenneth-loafman) → nobody
status: Confirmed → Fix Committed
Changed in duplicity (Ubuntu):
importance: Undecided → Low
status: New → Fix Committed
Jeff Dairiki (dairiki) wrote :

The short answer is "en_US.UTF-8".

Here's the long answer:

dairiki@hairball$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Changed in duplicity:
status: Fix Committed → Fix Released
Jeff Dairiki (dairiki) wrote :

Thank you!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers