Ubuntu

Problem in FSCK checking Chinese filename (Big5)

Reported by Oliver Li on 2006-06-10
34
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dosfstools (Ubuntu)
Medium
Onno Benschop

Bug Description

When booting into Ubuntu 6.06, it will checking the fat32 partition.

If it has Chinese filename with specific character, it will rename the file to fsck0000.REN

For example, 功,會,四
Because the character in big5 has "\" or "|"

Onno Benschop (onno-itmaze) wrote :

sxian,

Thank you for your bug report. We have received one other report where a user reported files being renamed in a similar way. They reported that it only affected file name with a space " " in them.

I'm sorry, but to give us a meaningful answer, and because Ubuntu will fix any files before you can show a report, you'll need to do some extra work.

Can you please:
 * Boot into Ubuntu
 * run: sudo chmod -x /sbin/dosfsck
 * Boot into DOS/Windows
 * Create a file with an illegal character like you show
 * Boot into Ubuntu
 * run: sudo chmod +x /sbin/dosfsck
 * run: dosfsck -v -n <partition>

Then attach the output of the last command to this bug report.

Onno Benschop (onno-itmaze) wrote :

It appears as if this bug has now been reported twice. Marking the bug as confirmed. I'll also link the two.

Changed in dosfstools:
assignee: nobody → onno-itmaze
status: Unconfirmed → Confirmed
Stan (mdkrokz) wrote :

Onno,

Thanks a lot for working on the problem. I was doing a fresh Kubuntu 7.04 installation yesterday and ran into the same problem. It took me a while to figure out it's actually a problem from fsck. It appears to me that once fsck renamed the "bad file", there is no way for windows to touch the renamed file. I can't even delete the renamed file in windows. However, the renamed file seems to be opened in Kubuntu just fine. I haven't tried other file formats but at least the renamed mp3 file plays just fine in Amarok. As mentioned by the original poster, this happens to only some of the (traditional) Chinese characters.

One thing that bothers me is that Kubuntu turns on fsck by default after installation. So, as far as I know, there is basically no way for me to turn fsck off before it did damages to some of the files on my FAT32 partitions. Do you know, by any chance, where I can turn it off during installation?

The following message is what I got after II created a file named "四_windows.txt" in windows XP SP2 and ran dosfsck:

$ sudo dosfsck -v -n /dev/hdd1
dosfsck 2.11 (12 Mar 2005)
dosfsck 2.11, 12 Mar 2005, FAT32, LFN
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkdosfs"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
     16384 bytes per cluster
        32 reserved sectors
First FAT starts at byte 16384 (sector 32)
         2 FATs, 32 bit entries
  15352832 bytes per FAT (= 29986 sectors)
Root directory start at cluster 2 (arbitrary size)
Data area starts at byte 30722048 (sector 60004)
   3838159 data clusters (62884397056 bytes)
63 sectors/track, 255 heads
         0 hidden sectors
 122881122 sectors total
/fsck_problem/:5RR_windows.txt
  Bad file name.
  Auto-renaming it.
  Renamed to 000\0000000.\000XT
Checking for unused clusters.
Checking free cluster summary.
Leaving file system unchanged.
/dev/hdd1: 81 files, 37249/3838159 clusters

Thanks again for the help.

Hi, just got your message. I'm currently away from my computer but I'm
still working on this. I expect to do some more when I return next
week. Would you be able to test versions if I release them? Or could
you create a tiny FAT file system that shows the problem so I can test
stuff as I'm fixing it? If not I'd understand as well.

Emmet Hikory (persia) wrote :

Looking at dosfsck/file.c (around line 57), it appears that each character in the filename is expected to be independent. According to http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/fatgen103.doc, it appears that any of the eight or three characters in the 8.3 format may be DBCS character pairs, rather than single-byte characters (and as such, not unsigned char).

Ideally, dosfsck should take a codepage argument to assign the OEM codepage for the FAT32 filesystem, and expect either single-byte or DBCS encoding depending on the provided codepage.

Further, any automated execution of dosfsck (e.g. prior to resizing a partition in the installer) should use the default OEM codepage for the installation locale.

Onno Benschop (onno-itmaze) wrote :

Determining the code page number has been the largest issue. It is as
far as I've been able to investigate not detectable on the file-system
itself, which means that the end user is expected to supply it. That
really doesn't work.

Whilst investigating this I toyed with the idea of reading the
config.sys file from the file-system to determine the codepage, but I
wasn't keen on the idea.

If you have alternative suggestions, I'm open to ideas.

--
Onno Benschop

Connected via Optus B3 at S31°54'06" - E115°50'39" (Yokine, WA)
--
()/)/)() ..ASCII for Onno..
|>>? ..EBCDIC for Onno..
--- -. -. --- ..Morse for Onno..

ITmaze - ABN: 56 178 057 063 - ph: 04 1219 8888 - <email address hidden>

Emmet Hikory (persia) wrote :

Unfortunately, the codepage cannot be determined automatically from inspection of the filesystem, although exhaustive search of the names of the files present may provide some guidance. In Windows, the codepage is set manually, and the set of valid codepages is defined at kernel compilation time. The codepage of a filename is set at the time of file creation, by the active codepage in the system, meaning that it is possible to create a FAT32 filesystem that is internally inconsistent by creating files with different DBCS encodings (which may require different installations, depending on the set of valid codepages for each installation).

While user input is not ideal, the other alternatives are to use a different default codepage for the selected language of the install (as is done for Windows), or to attempt to determine the default codepage setting for the installed version of Windows (e.g. reading CONFIG.SYS). Note that adding support for a user-selected codepage would be a significant improvement over the current state, where users must boot an alternate operating system to safely check their disks, and may need to manually partition: allowing the option to set a codepage would allow users to continue without the errors described, at the cost of some hassle.

Onno Benschop (onno-itmaze) wrote :

The largest problem I see with your suggestion - "have the user select the code-page" - is, that this is a choice they have never been required to make before. Under Windows they are unlikely to have come across this - their administrator may have, but the average end-user has not.

I suppose that an added command-line switch for the code page would be one work-around.

Another would be to leave file-names alone altogether and work on the actual structure of the drive, that is links and cross-links, rather than file-names.

We could repair those only if they fell outside "of their box" if you like, but I've not looked at the code in a little while, so my memory on the issue is not fresh.

Emmet Hikory (persia) wrote :

I suspect most users encountering this issue would be perfectly happy to have a manual process by which they would select the codepage, and not have it be default behaviour. The current state of the dosfsck tool is such that a user with a DBCS encoded FAT filesystem cannot install Ubuntu in a dual-boot configuration solely from the install CD: they must first repartition their drive using a separate DBCS-aware tool. For those with only one disk and OEM-provided "restore" CDs that do not provide a safe partitioning tool, this requires additional arrangements, such as removal of the drive for installation in another system, or purchase of dedicated partition management software.

Additionally, users are currently instructed to provide a codepage parameter in /etc/fstab when automounting FAT partitions (issues with specification of the codepage for other means of mounting the drives are similar, but unrelated bugs), as otherwise the filenames may be garbage characters, and unsuitable for use.

On the other hand, adjusting dosfsck to work on the structure directly, and leave the filenames alone would also work, although it would be unable to detect issues that may cause the filesystem to be unusable in other operating systems due to invalid filenames (although the native filesystem checker could be expected to resolve this if it occurs).

Liu Qishuai (lqs) wrote :

The backslash "\" is allowed in fat filenames, so this patch (against dosfsck/check.c) should work.

- char *bad_chars = atari_format ? "*?\\/:" : "*?<>|\"\\/:";
+ char *bad_chars = atari_format ? "*?/:" : "*?<>|\"/:";

Liu Qishuai (lqs) wrote :

Sorry, the character "|" should also be removed from bad_chars.

- char *bad_chars = atari_format ? "*?\\/:" : "*?<>|\"\\/:";
+ char *bad_chars = atari_format ? "*?/:" : "*?<>\"/:";

Onno Benschop (onno-itmaze) wrote :

Uhm, both these characters, '\' and '|' *are* bad characters as a
non-Chinese (Big5) FAT filename AFAIK.

If you have a reference that says otherwise, let me know and I'll have a
look at it.

Liu Qishuai (lqs) wrote :

Linux kernel can handle any characters in vfat.

Onno Benschop (onno-itmaze) wrote :

I don't doubt that you can use a bad character in a Linux
implementation. Also, the vfat module and dosfsck are not the same piece
of software and both read from the drive in completely different ways.

My point is that dosfsck is checking a FAT file system. The problem is
not that these characters are allowed in some cases, but as I understand
it, not in all cases - that is, under some code-pages those characters
are not allowed.

Trying to determine the correct code-page for the appropriate
file-system is the real issue. Once that has been achieved, then
determining which characters are "bad" for that code-page becomes rather
more simple.

Just arbitrarily allowing any old bad character defeats the process of
actually checking the file system.

Liu Qishuai (lqs) wrote :

The "chkdsk" command in windows doesn't check the bad characters. Why does dosfsck?

If users have bad characters in the filename which cause windows fails to access it, they can manually rename them in linux.

Onno Benschop (onno-itmaze) wrote :

The chkdsk command is running with a specified code-page because it's
running within DOS or Windows. I strongly suspect that you'll find that
if you change the code-page, all kinds of nasty things will happen.

Again, if you can provide an actual specification that indicates that
these characters are allowed across *all* code-pages, then the patch
you've suggested may be useful, but until such time I doubt that it will
be adopted.

Liu Qishuai (lqs) wrote :

No. I ran chkdsk for the USB disk which contains the "BAD\CHAR" directory. It said no problem.

Onno Benschop (onno-itmaze) wrote :

When you ran chkdsk you were running DOS or Windows. This means that a
code-page is set for the operating system at that time. When you insert
a USB disk, it is also running under that code-page. When you then run
chkdsk it checks characters based on the current code-page.

It has just occurred to me that until now I did not actually have access
to a file-system that has "bad characters" on it and is by the look of
things from a Non-US code-page. So, if you are in fact running a Non-US
version of DOS/Windows, I'd appreciate it if you could create a dd-image
of a small USB thumb-drive with some "bad characters" on it, so I could
attempt some more testing.

Liu Qishuai (lqs) wrote :

This is a disk image. chkdsk on any codepage reports no problem.

Onno Benschop (onno-itmaze) wrote :

Excellent, thanks for supplying that.

Just to make sure:
* chckdsk was tested on all code-pages?
* can you please specify which specific code-pages you tested (and how you did the testing)

And I'm assuming that dosfsck incorrectly munches that file-system.

Liu Qishuai (lqs) wrote :

I switched the codepage by using "chcp" command, and then run chkdsk. the UI of chkdsk changed.

C:\>chcp 936
活动的代码页: 936

C:\>chkdsk a:
文件系统的类型是 FAT。
卷序列号为 540C-0199
Windows 正在校验文件和文件夹...
已完成文件和文件夹校验。
Windows 已检查文件系统并确定没有问题。

磁盘空间总数 730,112 字节。
5 个文件夹: 5,120 字节。
2 个文件: 2,048 字节。
可用磁盘空间: 722,944 字节。

每个分配单元中有 1,024 字节。
磁盘上共有 713 个分配单元。
磁盘上有 706 个可用的分配单元。

C:\>dir a:
 驱动器 A 中的卷没有标签。
 卷的序列号是 540C-0199

 A:\ 的目录

2008-04-18 10:15 <DIR> 東
2008-04-18 10:16 <DIR> 嘰
2008-04-18 10:16 <DIR> BAD\CHAR
2008-04-18 10:18 <DIR> BAD
2008-04-18 10:20 531 chkdsk.txt
               1 个文件 531 字节
               4 个目录 722,944 可用字节

C:\>chcp 437
Active code page: 437

C:\>chkdsk a:
The type of the file system is FAT.
Volume Serial Number is 540C-0199
Windows is verifying files and folders...
File and folder verification is complete.
Windows has checked the file system and found no problems.

      730,112 bytes total disk space.
        5,120 bytes in 5 folders.
        2,048 bytes in 2 files.
      722,944 bytes available on disk.

        1,024 bytes in each allocation unit.
          713 total allocation units on disk.
          706 allocation units available on disk.

C:\>dir a:
 Volume in drive A has no label.
 Volume Serial Number is 540C-0199

 Directory of A:\

2008-04-18 10:15 <DIR> ?
2008-04-18 10:16 <DIR> ?
2008-04-18 10:16 <DIR> BAD\CHAR
2008-04-18 10:18 <DIR> BAD
2008-04-18 10:20 531 chkdsk.txt
               1 File(s) 531 bytes
               4 Dir(s) 722,944 bytes free

Liu Qishuai (lqs) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers