"UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 69: surrogates not allowed" with mime.file() on path from os.walk

Bug #1677244 reported by Jamie Strandboge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
file (Ubuntu)
New
Undecided
Unassigned
python3.5 (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

The following script works fine on 16.04 LTS:

#!/usr/bin/python3

import magic
import os

dir = "/usr/share/ca-certificates/mozilla"

mime = magic.open(magic.MAGIC_MIME)
mime.load()

for root, dirnames, filenames in os.walk(dir):
    for f in filenames:
        fn = os.path.join(root, f)
        print("%s: %s" % (fn, mime.file(fn)))

Eg:
$ python3 /tmp/test.py
/usr/share/ca-certificates/mozilla/TWCA_Root_Certification_Authority.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Baltimore_CyberTrust_Root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Comodo_AAA_Services_root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Hellenic_Academic_and_Research_Institutions_RootCA_2011.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/TC_TrustCenter_Class_3_CA_II.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Security_Communication_RootCA2.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.crt: text/plain; charset=us-ascii
...

(notice the last filename before the ellipsis)

But on 17.04, this happens:

$ python3 /tmp/test.py
/usr/share/ca-certificates/mozilla/TWCA_Root_Certification_Authority.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Baltimore_CyberTrust_Root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Comodo_AAA_Services_root.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Hellenic_Academic_and_Research_Institutions_RootCA_2011.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/TC_TrustCenter_Class_3_CA_II.crt: text/plain; charset=us-ascii
/usr/share/ca-certificates/mozilla/Security_Communication_RootCA2.crt: text/plain; charset=us-ascii
Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 15, in <module>
    print("%s: %s" % (fn, mime.file(fn)))
  File "/usr/lib/python3/dist-packages/magic.py", line 130, in file
    bi = bytes(filename, 'utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 69: surrogates not allowed

I'm guessing this is a change in python3 that python3-magic hasn't accounted for, but I'm not sure. Adding python3 task just in case.

description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Your test program works in artful with python 3.6 for me; I guess something got updated to fix it but am not going to dig into why unless you really want me to...

Changed in python3.5 (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.