[11.10 beta1] UnicodeDecodeError crash on localized input in multiple encodings/languages

Bug #839609 reported by Dennis Chua
108
This bug affects 16 people
Affects Status Importance Assigned to Milestone
command-not-found
Fix Released
Critical
Zygmunt Krynicki
command-not-found (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

The command-not-found package crashes on input of a simplified chinese character representing a bogus command. The problem was found with in 11.10 beta1, for both the x86/i386 and amd64 systems. Debugging the python script in /usr/lib/command-not-found shows that a UnicodeDecodeError is thrown. The crash_guard() callback framework catches this and reports the error.

Here are further observations.

(1) With the same simplified chinese input, 11.04 handles the test case gracefully, returning a message
    explaining that the command is not found.

(2) Between these two series, python has change: 11.04 (Python 2.7.1+) versus 11.10 beta1 (Python 2.7.2+).

To elaborate on this problem, the following files have been included:

(1) Screen shots showing step-by-step how to reproduce the bug. As switching to Simplified Chinese is
    difficult to explain in words, a video was taken to show how this process.

(2) A screen shot showing /usr/lib/command-not-found script traced by means of the Python pdb module.
    This shows the zh_CN.UTF-8 byte stream input and the point where UnicodeDecodeError is thrown.

This issue was investigated in 11.10 beta1 host running in VirtualBox.

===

Taken from To_Reproduce_Bug.txt attachment.

01_After_ISO_Installation.png The VirtualBox VM with default English locale.

02_Open_Lanugage_Support.png Prepare to switch to Simplified Chinese locale.
    See the accompanying video for this process.

03_Enable_IBUS_Pinyin.png After switching to Simplified Chinese. Note the locale
    environment variables. Click the IBUS keyboard icon
    and select Pinyin input.

04_Pinyin_Enabled.png Ready for Pinyin input. Note the blue IBUS icon.

05_Type_Phonetic_Pinyin.png Type in two letters: 'w' followed by 'o'. Phonetically these
    correspond to the Chinese character representing 'I' or 'Myself'.
    IBUS displays options. You want the first one. Hit the space
    bar to choose it.

06_Chinese_Input_Complete.png Chinese 'wo' in zh_CN.UTF-8 is ready to be passed to the Bash.
                                Hit the return key to do so.

07_Crash_command_not_found.png Bash calls command-not-found, which can't handle the input.

08_Disable_Pinyin_Input.png Instruct IBUS to disable Simplified Chinese input.

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: command-not-found 0.2.43ubuntu1 [modified: usr/lib/command-not-found]
ProcVersionSignature: Ubuntu 3.0.0-9.15-generic 3.0.3
Uname: Linux 3.0.0-9-generic x86_64
Architecture: amd64
Date: Fri Sep 2 10:23:08 2011
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Beta amd64 (20110901)
PackageArchitecture: all
SourcePackage: command-not-found
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Dennis Chua (dmcvocation) wrote :
Revision history for this message
Dennis Chua (dmcvocation) wrote :

To see the image and video file attachments, follow this - https://chinstrap.canonical.com/~dchua/bug_839609/

Changed in command-not-found (Ubuntu):
status: New → Confirmed
Revision history for this message
Dennis Chua (dmcvocation) wrote :

More effort put into this problem yielded a likely solution (i.e. 'hack'). First of all, the Python Exception was

     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

This was with the bogus Simplified Chinese command 我。The expected way command-not-found should have handled this would be something like:

root@u-VirtualBox:~# mgcc
未找到 'mgcc' 命令,您要输入的是否是:
 命令 'mlcc' 来自于包 'mlterm-tools' (universe)
 命令 'cgcc' 来自于包 'sparse' (multiverse)
 命令 'gcc' 来自于包 'gcc' (main)
 命令 'gcc' 来自于包 'pentium-builder' (universe)
mgcc:找不到命令

Now the hack involves updating two Python files in the package:

(1) /usr/lib/command-not-found (line 24) :
    cnf.install(unicode=True) ==> cnf.install(unicode=False)

(2) /usr/share/pyshared/CommandNotFound/util.py (line 9):
    _ = gettext.translation("command-not-found", fallback=True).ugettext ==>
    _ = gettext.translation("command-not-found", fallback=True).lgettext

With these edits in place, command-not-found can now handle the test case:

root@u-VirtualBox:~# 我
我:找不到命令

root@u-VirtualBox:~# mgcc
未找到 'mgcc' 命令,您要输入的是否是:
 命令 'mlcc' 来自于包 'mlterm-tools' (universe)
 命令 'cgcc' 来自于包 'sparse' (multiverse)
 命令 'gcc' 来自于包 'gcc' (main)
 命令 'gcc' 来自于包 'pentium-builder' (universe)
mgcc:找不到命令

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Thanks for the analysis, this seems solid. I need to check if the fix also works on non-CJK encodings/languages.

out of curiosity, what is your output of `locale`

Changed in command-not-found:
importance: Undecided → Critical
summary: - [11.10 beta1] UnicodeDecodeError crash on simplified chinese input of
- fake command
+ [11.10 beta1] UnicodeDecodeError crash on localized input in multiple
+ encodings/languages
Revision history for this message
Dennis Chua (dmcvocation) wrote :

You're welcome. The command-not-found package is very useful. I'm happy to have helped; it was interesting diving into Python's facilities for multi-byte I/O and I18/L10n/gettext. I hope this solves the problem comprehensively.

Here is my locale:

u@u-VirtualBox:~$ locale
LANG=zh_CN.UTF-8
LANGUAGE=zh_CN:en_US:en
LC_CTYPE=zh_CN.UTF-8
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE=zh_CN.UTF-8
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES=zh_CN.UTF-8
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

Zygmunt Krynicki (zyga)
Changed in command-not-found:
status: New → In Progress
assignee: nobody → Zygmunt Krynicki (zkrynicki)
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

This bug is actually caused by invalid handling of input (sys.argv), not output. When binary string (in utf-8) is coerced with unicode strings (that are part of translated system messages) UnicodeDecode error is raised as, by default, python coerces unicode and binary strings by converting the binary string to unicode assuming ansi encoding.

A possible fix is to properly decode sys.argv arguments. I've tried this by hard-coding UTF-8 input but it would be nice to fix this in general too.

Changed in command-not-found:
status: In Progress → Triaged
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Fix proposed for merging. Anyone interested is free to review the branch and check that it actually fixes the problem on their system.

Zygmunt Krynicki (zyga)
Changed in command-not-found:
status: Triaged → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package command-not-found - 0.2.44ubuntu1

---------------
command-not-found (0.2.44ubuntu1) oneiric; urgency=low

  * merged lp:~zkrynicki/command-not-found/fix-839609
    LP: #839609
  * scan.data:
    - updated to current oneiric
 -- Michael Vogt <email address hidden> Tue, 20 Sep 2011 15:48:12 +0200

Changed in command-not-found (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Dennis Chua (dmcvocation) wrote :

The screenshots of the bug reproduced using simplified chinese are attached.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Dennis Chua: could you please upgrade command-not-found and confirm that the bug no longer occurs?

Revision history for this message
Dennis Chua (dmcvocation) wrote :

Reviewed this issue with Oneiric Beta2, updating command-not-found to 0.2.44ubuntu1. With the bogus Simplified Chinese test, command-not-found does not throw an exception.

However, the output text does not appear to coincide with the language encoding of the shell environment. Compare the following, Natty vs. Oneiric Beta2:

=== Natty. command-not-found 0.2.41ubuntu2 ===

u@u-VirtualBox:~$ 我
我:找不到命令

u@u-VirtualBox:~$ mgcc
未找到 'mgcc' 命令,您要输入的是否是:
 命令 'mlcc' 来自于包 'mlterm-tools' (universe)
 命令 'cgcc' 来自于包 'sparse' (multiverse)
 命令 'gcc' 来自于包 'gcc' (main)
 命令 'gcc' 来自于包 'pentium-builder' (universe)
mgcc:找不到命令
u@u-VirtualBox:~$

u@u-VirtualBox:~$ locale
LANG=zh_CN.UTF-8
LANGUAGE=zh_CN:en_US:en
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

=== Oneiric Beta2. command-not-found 0.2.44ubuntu1 ===

u@u-VirtualBox:~$ 我
我: command not found

u@u-VirtualBox:~$ mgcc
No command 'mgcc' found, did you mean:
 Command 'mlcc' from package 'mlterm-tools' (universe)
 Command 'cgcc' from package 'sparse' (multiverse)
 Command 'gcc' from package 'gcc' (main)
 Command 'gcc' from package 'pentium-builder' (universe)
mgcc: command not found

u@u-VirtualBox:~$ locale
LANG=zh_CN.UTF-8
LANGUAGE=zh_CN:en_US:en
LC_CTYPE=zh_CN.UTF-8
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE=zh_CN.UTF-8
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES=zh_CN.UTF-8
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

Revision history for this message
Dennis Chua (dmcvocation) wrote :

Clearly the changes addressed the Unicode decoding exception. Can we close this issue, and open a separate one for the mismatch in output encoding?

Revision history for this message
Reinis Zumbergs (reinis-zumbergs) wrote :

This fix solves my reported problem with Latvian special characters

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.