/rebuild does not remove all obsolete items from HashIndex

Bug #919424 reported by bsod
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
ApexDC++
New
Undecided
Unassigned
DC++
Fix Committed
Low
Unassigned

Bug Description

BCDC 0.790a.
Windows 7 SP1 x64.

Summary:
If you add a new share, hash the files, and then remove it, issue a /rebuild and then re-add the same share again, DC++ will not rehash the files. Furthermore, a /rebuild between adding/removing shares does not change the size of hashdata.dat or hasindex.xml.

Repro:
* add a new share. accept the default name. DC++ hashes the files.
* open your filelist to validate share is present.
* check the size of hashdata and hashindex.
* remove the share you just added.
* issue a /refresh and /rebuild.
* note the size of hashdata and hashindex has NOT changed.
* add the same share again, accepting the default name. DC++ does NOT hash the files.
* issue a /refresh. DC++ does not re-index the files from that share.

Tags: core
Revision history for this message
LoRenZo (lorenzo-mailbox-deactivatedaccount) wrote :

It would be great if the expired information would be removed from both HashData.dat and HashIndex.xml. As the .xml is readable for us, I can find basicly all my share duplicated in it, since I had a drive path change at some point. The concept of /rebuild should be to have only valid information stored in these files.

Revision history for this message
iceman50 (bdcdevel) wrote :

I can confirm this in in r3146 of DC++

Revision history for this message
eMTee (realprogger) wrote :

In my test it purged out the obsolete parts from both files.

Revision history for this message
eMTee (realprogger) wrote :

When you experience the problem, do you see a "Hash database rebuilt" log message appear in the system log shortly after the /rebuild command issued? This would indicate that the rebuild process started at least...

Changed in dcplusplus:
importance: Undecided → Low
Revision history for this message
LoRenZo (lorenzo-mailbox-deactivatedaccount) wrote :

Just issued the /rebuild command with the latest available release (r3161) and it seems that the issue is now resolved.
The expired entries were removed from HashIndex.xml, and the size of HashData.dat has noticeably decreased as well.

Revision history for this message
eMTee (realprogger) wrote :

Since the hashdata format version is recently changed it's maybe only due to the automatic conversion. Some additional test would clarify that the issue is really gone. If not, my questions in comment #4 still apply...

Changed in dcplusplus:
status: New → Confirmed
Revision history for this message
LoRenZo (lorenzo-mailbox-deactivatedaccount) wrote :

I do a hash database rebuilding from time to time, and in all cases regarding the previous releases I have seen the "Hash database rebuilt" message in the System Log, but the invavlid entries were not removed dispite of my trials.
I must assume that the earlier relases did not contain any related modifications in the source code since a long time - unlike the current one, as you have mentioned it as well -, which is definitely the reason why it is working now, but has not been working before.
Nevertheless, this was the first time when it has been completed successfully, the way I have always been expecting.

Revision history for this message
eMTee (realprogger) wrote :

The question still remains: after this one time hopeful result can you reproduce the bug the way described by the reporter.

Revision history for this message
LoRenZo (lorenzo-mailbox-deactivatedaccount) wrote :

Ok, it seems that it does not work as flawlessly as I described previously.

First, I would like you to know that the first time I have started this build after the upgrade, the loading screen was stuck at the "loading hash data" for about 3 minutes. (The memory usage was around 10 MB during this, so the client was rather in half way through the startup state at this point - I guess it is a normal behaviour based on the newly implemented changes.)
After this I have initated the /rebuild command which cleared up the expired entries as I already mentioned in my earler post.

I just did the reproduction steps based on this report, which you were looking for to have confirmed.

I have added a new folder, and waited until it was fully processed by the hashing procedure. I have removed it from the share and issued a /rebuild. Once it was compelted again, I had to conclude that the entry was still to be found in the HashIndex.xml. Since my older entries, which were cleaned up in the first place belonged to a currently non-existing path, I have tried to rename the folder and see what happens after another rebuild. The entry was still in the file - though there were some changes to its size. (Note that nothing has been added to or removed/completed from the download list meanwhile). Afterwards, I have moved the folder to a different location, rehashed the data, rebuilt my list, but the previous entry was still there. I have removed the newest entry after this and issues another /rebuild command. No changes were to be found after all this.

Since the first startup took longer, I gave it another try (restarted the client). The time required for the startup was slightly less than before, and the files were still listed in the HashIndex.xml. (Note that there has been no /rebuild issued at this point).
Once the client was started completely, I have double checked the share information to see whether the newly added folder is listed there or not - but of course it was not therem since I have removed it prior the restart. After this I gave the rebuilding another try: once it has been completed I found that the newly created, expired entries are removed completely once again.

In conclusion, it seems that a client restart is required before the /rebuild command can be made to work properly.

Additional note: during my current trials I have always seen the "Hash database rebuilt" notification in the System Log, too, and I have also experienced some changes regarding the involed files' sizes.

Revision history for this message
eMTee (realprogger) wrote :

Nice find because the problem wasn't reproducible all times. Maybe the restart is the key...

About "I have started this build after the upgrade, the loading screen was stuck at the "loading hash data" " happened probably because the hashdata had been converted and checked in result of the recent fix for bug #311818...

Revision history for this message
LoRenZo (lorenzo-mailbox-deactivatedaccount) wrote :

Well, it seems that you are right about the restart being the key, because I followed the same reproduction steps described in my last post, this time using r3146, and everything happened the exact same way, even though that release does not contain the new hash/share related implementations.

Revision history for this message
maksis (maksis) wrote :

Rebuild won't delete any trees that are used at some point during the session. That's seems to be an intentional design choice that simplifies the code.

While I'm not sure if the behavior described in this post is causing any *real* problems that would justify complicating the code, there's a bigger problem related to this design. When an user moves a file from one shared directory to another one, the tree still stays shared. Since the client stores each hashed file path separately, the client will never delete the old paths from the database as long as the tree stays shared in another location. DC++ keeps all file paths in memory, so this will slowly keep on increasing the memory usage and the start-up speed.

I've changed the hash database format in AirDC++ 2.50, which allows comparing each path in hash database with the share. When I gave the new version to an user who had been experiencing various performance issues, it revealed that he had more than 16 million unused file entries in the database. This had increased his memory usage by several gigabytes (it used to be 5-8 gigabytes with the old version) and the size of his HashIndex.xml was more than 4 gigabytes. Of course this also means that he had moved his files quite a lot because he was "only" sharing 5 million files, but he had probably been using the same database since the time when hashing support was implemented.

Fredrik Ullner (ullner)
tags: added: core
Revision history for this message
Davidson (david.son) wrote :

So was this ever fixed? Because here I am all these years later using 0.867 x64 (portable) on Windows 10 v1709 (16299.371) and I'm experiencing the exact same issue.

I don't know whether HashData.dat is being cleaned of invalid entries or not, but certain invalid entries are definitely NOT being purged from HashIndex.xml no matter how many times I issue a /rebuild or restart the client.

These are entries for files that have been renamed (in some cases I have 3 sets of entries with the same hash because I happened to have renamed the files twice). There are also entries for dummy 0-bit files that I create as fillers in directories to indicate missing or future files in a series. As and when I have the actual files in hand I naturally get rid of their corresponding 0-byte substitutes, but their entries are never purged from HashIndex.xml.

I think the common link is that all older invalid entries have hashes that are still valid. Let me explain with a couple of examples:

1) I rename/move a file - Now there are entries for the old name/path with hash XYZ and for the new name/path with the same hash XYZ.

2) I delete a 0-byte dummy file and replace it with a non-zero byte actual file. However there are still other 0-byte dummy files left - In this case there are entries for the old 0-byte file, and for the new actual file.

In both cases during a HashData/HashIndex rebuild DC++ refuses to remove invalid name/path entries purely on the basis of whether their hashes still match valid entries, and NOT on the basis of whether the file with that name or path actually exists or not. This is obviously the wrong approach. FIRST the file name/path should be checked for existence and if that test fails, DC++ should blindly purge the entry. ONLY if that first test passes should the hash then be checked and updated if required during a rebuild.

Given the numerous advantages of having a slimmed down Hash DB sans all invalid entries, please implement this change/fix ASAP. Thanks!

Revision history for this message
eMTee (realprogger) wrote :

If anyone still cares I suggest splitting this conversation and moving comments #12 and #13 to a new feature wishlist entry as the original report here is about the hashdata.dat cleaning process does not start in some cases - which is still an issue at the moment with DC++ 0.867.

IIRC the error itself happens because of an incomplete refactoring of threading code a few versions before the time when the original report has been filed. The /rebuild process does not start (blocked) if a manual or automatic share refresh process has already started before.
I am surprised that this conclusion has not been added to this thread yet - maybe a separate duplicate entry of this issue exists?

Revision history for this message
Davidson (david.son) wrote :

@eMTee (realprogger): I did think long and hard about whether I should file a new bug report for this, but ultimately decided the issue was related/similar enough to this one so perhaps adding to it might raise its profile (the so-called "bug heat" certainly went up by a few notches). If you want I can open a new bug report and copy-paste myself, or perhaps it's better to wait for this one to be split?

Honestly though, given your "If anyone still cares" comment and the fact that most DC++ client development (Strong, Apex etc.) seems dead and buried, realistically is there any chance at all of this being fixed? Especially considering the original bug report is 6+ years old now, and there are certainly bugs of higher importance that have been languishing for even longer.

In short, is DC++ development mostly limping along on life support these days? Certainly seems that way. Only maksis still seems to be maintaining Air somewhat, and even there basic things like searching in one's own file list seem to be broken for years with no fix in sight. A truly sad state of affairs indeed.

Revision history for this message
eMTee (realprogger) wrote :

@Davidson: I think you answered your own questions yourself. If this problem hits you hard then instead of waiting I'd rather suggest switching to AirDC++.

This bug thread isn't the right place to discuss the DC++ development though so if you have further questions or thoughts then please use the more appropriate places: the Answers section in this site, the forum over at dcbase.org or the dev hub at adcs://hub.dcbase.org:16591 Thank you!

Revision history for this message
Davidson (david.son) wrote :

@eMTee (realprogger): Thank you for pointing me to dcbase.org; I was unaware of that site.

Re. AirDC++, that is indeed one of the clients I'm actively considering switching to, esp. since it's the only one whose developer(s) still seem to be be around to respond to queries and hopefully fix bugs at least (I don't expect new feature addition at this point in any case).

However my one quibble with Air is how its devs prefer to do their own thing with respect to 'standard' DC++ files/formats. For example instead of sticking to a simple, easily parsable and externally editable single Queue.xml file it creates separate 'bundle' XMLs with random names. Or for example in this particular case how the Hash DB issue has been 'solved' by completely dumping HashData.dat and HashIndex.xml and adopting some sort of Air-specific binary (I think) format. Standard files naturally make it easy to move between clients, but the way Air does things seems a bit proprietary and locked-in to me.

That is why I am hoping the DC++ devs will get around to fixing this and a few other big issues at least. If not, I guess Air it will be.

Revision history for this message
maksis (maksis) wrote :

You make it sound like the file formats have been changed without proper reasons. No, the hash database format wasn't changed because of this issue (I just wanted to report my findings here while making the rewrite). Dealing with a hash index XML file that could be several gigabytes in size and being regularly rewritten on disk (or being read into memory) just became too problematic. The same applied to having the whole download queue being stored into a single file vs. splitting the data into several XML files that allows only the changed ones to be rewritten on disk.

I've also written several posts about those changes:

https://www.airdcpp.net/forum/viewtopic.php?f=4&t=4007
https://www.airdcpp.net/forum/viewtopic.php?f=4&t=1856

As written in many introductions, AirDC++ is designed to work smoothly even if the user is sharing hundreds of terabytes of data or more than 10 million files. If you don't see any difference compared to DC++ and you rather prefer compatibility with old file formats, it may not be the right client for your use case.

Revision history for this message
Davidson (david.son) wrote :

@maksis: TBH I'm not completely averse to the changes made, but the only issue is how they're specific to Air. Given the advantages one would have expected the changes to have been introduced upstream rather than have Air diverge from all other DC++ clients. Perhaps an attempt was even made in this regard but rejected by others... I'm not sure. Anyway it does mean that moving away from Air involves subjecting one's drive to many hours of torture again to recreate the Hash DB in a format that other clients find acceptable. Of course your motive is to make it easy to move *to* Air and not away from it (hence the import feature for the old DB format), but still...

Anyway this is all going off-topic now so I don't want to continue here. My primary motive was simply to see if this bug could be fixed in DC++ and other clients that use the old DB format. If not it's good to know that there's at least one client like Air that has a solution, no matter how proprietary.

Revision history for this message
maksis (maksis) wrote :

"Of course your motive is to make it easy to move *to* Air and not away from it (hence the import feature for the old DB format), but still..."

My motivation was to make it possible for people using older versions of the application to upgrade without losing their data...

Regarding comment #14:

"If anyone still cares I suggest splitting this conversation and moving comments #12 and #13 to a new feature wishlist entry as the original report here is about the hashdata.dat cleaning process does not start in some cases"

The way I understand the original report is that the rebuild was performed but it didn't produce the expected results. That's what I tried to explain in comment #12. If a tree has been used earlier during the session, it won't be removed during rebuild. You need to restart the application after removing the files (and all other files with the same TTHs) from share before performing rebuild.

Revision history for this message
Davidson (david.son) wrote :

@20: "You need to restart the application after removing the files (and all other files with the same TTHs) from share before performing rebuild."

That's the problem right there. Files one has moved and/or renamed but that are still shared shouldn't need to be deleted (or moved out of the share) completely in order for their older invalid entries to be purged. The program should be doing all the work during a rebuild instead of relying on users to do things manually.

Since a rebuild is user initiated the program should check each file/path's existence, and non-existent ones should be simply purged irrespective of whether any other entries have the same hash.

Is there any complication that prevents this change from being made for so many years I wonder, or is it just because there's been no-one to work on it?

Revision history for this message
eMTee (realprogger) wrote :

It's been a long time but it looks like now finally I've been able to figure out this problem.

Regarding the original report, the part of the complain about hashdata.dat is invalid. It is rebuilding correctly but the initial size of the binary file is 1 MiB so it'll never shrink below that. With hashdata file larger than that it works as expected.

The reasons why rebuilding is more effective after a restart is greatly explained by maksis and restart is still the best practice since when shared files getting removed, their hash information are not removed from the internal memory map containing the hash indexes. At restart the memory map is freshly synced with what has been found in the filesystem so then checking for what items used and what are obsolete is much more effective at that point.

Therefore if you remove files from the share and rehash and do a rebuild then, unless you have obsolete data of removed items in previous sessions, a rebuild operation will not make your hashindex or hashdata file any slimmer.

Regarding what is removed from hashindex and what isn't, david.son is right, as well as his recommended logic of the solution. Currently only items with an unshared TTH are getting removed.

The puzzling thing is that the code responsible for this is pretty logical and would easily allow to do what is expected by david.son.

The first version of rebuild code that is actually doing something with hashindex (and not just with hashdata) is added in https://bazaar.launchpad.net/~dcplusplus-team/dcplusplus/trunk/revision/545
It has been refactored and simplified several times since but they have never added a feature we miss here - even though it'd have been a very small and logically fit change...

Why? I'm not sure. Maybe a simple overlook or rather, back in the days, the exceptionally talented people who created and shaped DC++ thought it is better to have a bit larger index file with items kept for possible reuse than possible re-hashings. This might have been pretty logical 20 years ago thinking about CPU and storage speeds and capacities of consumer computers of the time. An average share consisted of a few thousand files back then...

I'll test the fix on larger shares and most probably add to the next version of DC++. Testers are welcome if there's still anyone who cares...

summary: - /rebuild does not update HashData
+ /rebuild does not remove all obsolete items from HashIndex
eMTee (realprogger)
Changed in dcplusplus:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.