"UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 69: surrogates not allowed" with mime.file() on path from os.walk

Bug #1677244 reported by Jamie Strandboge on 2017-03-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
file (Ubuntu)
Undecided
Unassigned
python3.5 (Ubuntu)
Undecided
Unassigned

Bug Description

The following script works fine on 16.04 LTS:

#!/usr/bin/python3

import magic
import os

dir = "/usr/share/ca-certificates/mozilla"

mime = magic.open(magic.MAGIC_MIME)
mime.load()

for root, dirnames, filenames in os.walk(dir):
    for f in filenames:
        fn = os.path.join(root, f)
        print("%s: %s" % (fn, mime.file(fn)))

Eg:
$ python3 /tmp/test.py
/usr/share/ca-certificates/mozilla/TWCA_Root_Certification_Authority.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Baltimore_CyberTrust_Root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Comodo_AAA_Services_root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Hellenic_Academic_and_Research_Institutions_RootCA_2011.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/TC_TrustCenter_Class_3_CA_II.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Security_Communication_RootCA2.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.crt: text/plain; charset=us-ascii
...

(notice the last filename before the ellipsis)

But on 17.04, this happens:

$ python3 /tmp/test.py
/usr/share/ca-certificates/mozilla/TWCA_Root_Certification_Authority.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Baltimore_CyberTrust_Root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Comodo_AAA_Services_root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Hellenic_Academic_and_Research_Institutions_RootCA_2011.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/TC_TrustCenter_Class_3_CA_II.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Security_Communication_RootCA2.crt: text/plain; charset=us-ascii
Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 15, in <module>
    print("%s: %s" % (fn, mime.file(fn)))
  File "/usr/lib/python3/dist-packages/magic.py", line 130, in file
    bi = bytes(filename, 'utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 69: surrogates not allowed

I'm guessing this is a change in python3 that python3-magic hasn't accounted for, but I'm not sure. Adding python3 task just in case.

description: updated
Michael Hudson-Doyle (mwhudson) wrote :

Your test program works in artful with python 3.6 for me; I guess something got updated to fix it but am not going to dig into why unless you really want me to...

Changed in python3.5 (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers