Sharing duplicates

Bug #311818 reported by Windkracht8
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
DC++
Fix Released
Medium
Unassigned
LinuxDC++
Confirmed
Medium
Windkracht8

Bug Description

Unix systems have case-sensitive file-systems. The core however converts all file and directory-names to lower case. Which gives problems if files are shared with the same text and different cases.
example:
In the share there's a file fubar.txt and a file Fubar.txt. Then these two files will be hashed every time dc is started.

I don't know why all filenames are converted to lower case for Windows, but for other OS's it's not wanted. So I've added a possible fix which checks for OS.

Tags: filename
Revision history for this message
Windkracht8 (windkracht8) wrote :
Changed in linuxdcpp:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Steven Sheehy (steven-sheehy) wrote :

Windows DC++ team: What would be the impact of not converting to lower case for Windows as well?

Revision history for this message
Jacek Sieka (arnetheduck) wrote :

the problem is two-fold...
1) adc (explicitly) and nmdc (implicitly) are case insensitive so we can't allow files with different case in the share
2) the file system is case insensitive, so we can't try to write files that differ only in case. it is however case preserving so most of the time (when the filename comes from the filesystem itself), it shouldn't need to be lowercased...if however the filename can come from other sources (someone elses file list, user typing etc), something has to be done...

now, I don't remember how the hashmanager interacts with the rest of the code so this would need more investigation...

Changed in dcplusplus:
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Francois Botha (igitur) wrote :

Mind if I bump this issue a bit? We have a 7TB share, and not surprisingly, have a lot of these cases. It causes quite confusion among the users.

Revision history for this message
Windkracht8 (windkracht8) wrote :

Since I don't see everything becoming case sensitive for ... time. Here is another fix.
With this patch a second file with, case insensitive, the same name is ignored.

First the patch for the current linuxdcpp branch.

Revision history for this message
Windkracht8 (windkracht8) wrote :

And here the patch for the current DC++ branch.

Changed in linuxdcpp:
assignee: nobody → Windkracht8 (bartv-windkracht8)
Revision history for this message
Steven Sheehy (steven-sheehy) wrote :

I prefer your original patch, only issue was with the #ifdefs sprinkled throughout that cluttered the code. I suggest you create a wrapper func that contains the ifdef called getSharedFilePath()/getSharedFileName() that then calls Text::toLower() and Util::getFile*() as necessary for the target platform. Since arne said that we basically can't change the behavior of the windows code, this will allow us to keep the status quo for windows while adding our case sensitive code.

Arne, would this be acceptable? If a linux users serves up two files with the same case insensitive name to windows users, it should just fail when finished downloading and moving to the target destination, I believe. Or is your quote about adc and nmdc not allowing files with a different case need to apply to linux as well? If so, please explain the impact.

Revision history for this message
poy (poy) wrote :

the ADC spec forbids this: "each file or directory name must be unique in a case-insensitive context"

Revision history for this message
Steven Sheehy (steven-sheehy) wrote :

That doesn't really explain the impact of allowing case sensitive files to be shared for platforms that support it (which is nearly every platform except Windows). As Francois mentioned, it causes confusion when users share a file but it doesn't show up in the share.

Revision history for this message
Francois Botha (igitur) wrote :

It's doubly confusing. When you have foo.txt and FOO.txt, it's not just that the one is missing. It's also confusing because one day only foo.txt will show up in the file list, and a few days later only FOO.txt will show up. It seems to be quite random.

FYI, I wrote a utility that writes out a weekly report of new files added on our 16TB (and growing) share. With such a large share, it's impossible just to browse around to find new files. With this case sensitivity issue I keep picking up "new files", e.g. FOO.txt because in the previous week only foo.txt was in the file list.

Revision history for this message
Windkracht8 (windkracht8) wrote :

If you would share file's with the same case insensitive name, someone (on windows) downloading both files, would get an error.

The problem basically is with windows. Other file systems are case sensitive. So we need a solution for windows. How does the windows version have to handle case insensitive equal names?

Until the final solution, I suppose to commit the latest patch. At the moment case insensitive equal names are handled like this:
On file system:
 - FOO.bar with TTH: 111
 - foo.bar with TTH: 222
Will result in share:
 - FOO.bar with TTH: 222

And it will hash both file's every time linuxdcpp starts.

Revision history for this message
Francois Botha (igitur) wrote :

I disagree. This isn't only a Windows issue. See my previous comments. Both the 16TB user and I are on Ubuntu.

Revision history for this message
endator (endator) wrote :

I realize that the spec is clear on this subject, but I do not see that 'fixing' this issue would actually break the intention of the spec. It does however introduce the need in both win and linux client to be able to download and handle files with same name and path at the download location. Preferably by allowing the client to auto rename files to solve name conflicts (this would at the same time solve issues with incompatible file names between os:es and/or code-pages).

When the issue of virtual folders were introduced the possible name conflicts issue where discussed and the proposed solution there was quite simple, as long as the files have different TTH, show and share them both, if they have same name and TTH, choose one and show/share that file. I don't see why the issue covered by this bug report could not be handled similarly?

1d

Revision history for this message
poy (poy) wrote :

the impact of disobeying the spec? i'd rather not even think about it. producing invalid file lists could break various clients that implement the ADC spec to the letter - or they may simply completely discard such file lists. i don't know why the spec explicitly spells this out, but it does and has to be obeyed to.

instead of thinking of fictive scenarios based on a client that would produce invalid file lists, or code changes, the first focus of this discussion should be how to best ammend our current use of the protocol to allow case-insensitive dupes in a file list. here are some possibilities that spring to mind:

- add a new extension, with its own fourcc and all. this is a sure-fire way of not breaking anything and being perfectly compliant with the current spec. one flipside of this is that a client implementing this extension would have to manage 2 file lists, one with case-insensitive dupes and one without.

- use a new root in the file list, similar to the current "/", for example "/v2/" or "/extended/".

- change the version of the file list (the "version" parameter of the first tag) to "1.1" or "2.0".

- add a note in the ADC "recommandation" document saying that that particular phrase has been deprecated and should be ignored. but i doubt the ADC maintainers would allow such a frivolity...

- dynamically modify the name of a case-insensitive dupe before sharing it, for example by appending " (1)", " (2)", etc or underscores. this allows for no compat change; however, it could be bothersome to manage code-wise. this would be my favorite possibility if it were at all doable.

i don't know how practical these design changes are, and there may be other possibilities, but one has to be chosen for sure before thinking of going further.

regarding endator's last comment: no, DC++ only shares one of the files if 2 files with the same name but a different TTH are found in the same virtual directory. it would be nice if the solution adopted in this thread for case-insensitive dupes could also be generally applied to name dupes in merged directories.

Revision history for this message
Jacek Sieka (arnetheduck) wrote :

Any patches supporting renaming will gladly be accepted...assuming they handle the virtual directory duplicate filename as well. If the files share content (TTH), I think only one of them needs to be visible, preferably a deterministic choice between the offered names (now it's most probably random because hash maps are involved if I remember things correctly...)

Revision history for this message
Tehnick (tehnick) wrote :

Hi to all.

In EiskaltDC++ project we are using case-sensitive share now.
See patch to the current linuxdcpp trunk branch.

> 1) adc (explicitly) and nmdc (implicitly) are case insensitive so we can't allow files with different case in the share

I am not see hard correlation between the protocols and data content in the file list.

> 2) the file system is case insensitive, so we can't try to write files that differ only in case. it is however case preserving so most of the time (when the filename comes from the filesystem itself), it shouldn't need to be lowercased...if however the filename can come from other sources (someone elses file list, user typing etc), something has to be done...

At first, look here:
http://en.wikipedia.org/wiki/Comparison_of_file_systems#cite_note-note-36-70
http://en.wikipedia.org/wiki/Comparison_of_file_systems#Features
http://en.wikipedia.org/wiki/NTFS
As you can see this is only MS Windows trouble...

At second, DC client is should not think for the user what to do when a similar file already exists in the destination directory.

> the ADC spec forbids this: "each file or directory name must be unique in a case-insensitive context"

The file name does not matter, important is only TTH, which is really unique identifier. TTH is not depend on the file name. And this is its great advantage over torrents.

> If you would share file's with the same case insensitive name, someone (on windows) downloading both files, would get an error.

This is not DC-client problem on my opinion.

For example, file sharing using samba-server and various ftp-servers is provide access to all files, regardless of the symbols register in file names. And this is user problem, if he can not copy a few of them simultaneously...

So you are using unjustified artificial restriction, which only discourage users to engage in file-sharing...

Revision history for this message
poy (poy) wrote :

almost everything in this patch is wrong. in ascending order of importance:

- the naming of "stricmps" & "strnicmps" doesn't make sense; it sounds like you just wanted "strcmp" & "strncmp".

- i don't understand the whole CaseStringHash / CaseStringEq deal. why not just let the STL handle it?

- searching is still case insensitive, yet the bloom table has been made case sensitive; weird...

- and to the actual issue. it is beyond me why after all the discussion and rebuttals, you still went out of your way to re-propose Windkracht8's patch of producing invalid file lists.
you seem to be discussing an "artificial restriction"; as i wrote in my previous comment, this is not the point of this thread. you should have been arguing about this months ago when ADC was still being established. now it is here and can't be hacked around in such unintelligent ways. up until now, no client has had the guts of implementing the crazy idea of producing invalid file lists, so there is probably not much validation going on in client & bot software around; but don't be surprised if after your file lists get in the wild, such validation mechanisms do appear, rendering your file lists essentially useless.
it looks like you are willing to go with the 4th of my possibilities, which was ammending the "recommendation" file. properly expose your arguments to the ADC managers and hope they will side with you, then. but just making your client invalid without proper consulting, and daring to ask for the same changes to be brought over other clients, shows either a lack of comprehension or great insolence.

furthermode, i have enumerated several possibilities that could work, with the last one being my favorite as it involved no protocol-level change. it even got picked up again by Jacek in his last comment. why not implement it instead?

Revision history for this message
poy (poy) wrote :

implemented renaming in rev 3156.

because hashed files were previously stored in lower case, i had to upgrade the hashing version to 3. version 2 trees will be imported when upgrading to version 3, but not the file registry.

renaming should occur in the following situations:
- duplicates caused by directory merging.
- files that share the same name in a case-insensitive context but that are actually different in a case-sensitive context.

todo:
- directories that share the same name in a case-insensitive context but that are actually different in a case-sensitive context. this may have to be more contrived than file renaming; hopefully it's a less common situation...

it hasn't been well tested, so various share operations might break. i have been testing on Windows, on which support for a case-sensitive file system is minimal; feedback from Linux devs is very welcome to see how this can be improved.

Revision history for this message
poy (poy) wrote :

done the mentioned todo in rev 3157. directories get renamed when needed, and files they contain are correctly hashed; but i have no idea if they can be downloaded...

summary: - case-sensitivity in Unix systems
+ Sharing duplicates
Revision history for this message
poy (poy) wrote :

rev 3158: re-hashing won't actually be necessary on Windows >= Vista (as long as hashed files don't have a case-insensitive dupe, which is very unlikely on Windows).

i have also been able to test downloading with the above changes; everything seems to be going smoothly. :)

notes for Linux devs:
- make sure the same files are always renamed after a share refresh; otherwise, "file not found" errors are going to pop on peers trying to download renamed files. this is not an issue on Windows because the directory iterator always reads directories in the same order (eg A.txt always comes before a.txt, so a.txt will always be the one to be renamed to "a (1).txt"); but i don't know if opendir/readdir etc respect that on Linux.
- try to implement the hash upgrading logic in dcpp/HashManager.cpp:431, function upgradeFromV2. it's been left as a todo for now.

Changed in dcplusplus:
status: In Progress → Fix Committed
Revision history for this message
poy (poy) wrote :

Fixed in DC++ 0.810.

Changed in dcplusplus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.