Protocol suggestion: search metadata

Bug #211753 reported by eMTee
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DC++
Expired
Wishlist
Unassigned

Bug Description

I don't know that I am asking this in the right place, but if someone could tell me where to go and post this I'd be happy to do so.

I have lots of ebooks, mostly PDF files. Many of them are scientific papers, and have lengthy titles by nature. It's hard and cumbersome to make the file name that contains everything that describes the article/book. The 'description' field in PDf properties would be a good place to put all of this information. but we could even do better, such as expand the 'keywords' section of description with such things as Date, ISBN, journal reference number, etc. This could be done for other file types as well. Lots of PDFs already have this file metadata in them, so being able to just search that metadata would be great.

Other things I've wanted to do, is search inside an archive or compressed file. I know, it would probably be quite a CPU load to scan through every compressed file for the metadata. Maybe make that optional, for people with lots of CPU power to do the task.

I guess where this suggestion needs to go is wherever the authors of DC software discuss protocol improvements. Where is this? Can someone direct me there?

Mr. X XA

eMTee (realprogger)
Changed in dcplusplus:
importance: Undecided → Wishlist
Revision history for this message
eMTee (realprogger) wrote :
Revision history for this message
Jacek Sieka (arnetheduck) wrote :

http://adc.sf.net for (concrete) protocol suggestions...marking as incomplete until someone writes an actual suggestion...

Changed in dcplusplus:
status: New → Incomplete
Revision history for this message
Neil Harding (8-launchpad-nharding-co-uk) wrote :

I am planning on adding metadata to the filelist, I was thinking of using value pairs. So metadata = "Date:2008-09;{Contents:xxxxxxxxxxxx};Source:New Scientist;" could be attached to a file. Fields inside {} are computer generated and would not be entered by hand in searches, but could be used to have details about the files inside a compressed file. I was planning on having an external program that could be run for different file types to produce some metadata automatically. Ie, mp3 metadata might be "Song:<Title>;Rate:128Kbps;Tracks:<1|2>,Album:<Title>,Genre:<Pop|Classical|...>. You could edit the data by hand, but it could also get the metadata from another user who has set the metadata (or has more complete metadata), ie. you have a file yellow.mp3 {metadata Song:Yellow Submarine;} and someone else has a file with the same TTH called YellowSubmarine.mp3 with {metadata Song:Yellow Submarine;Group:Beatles, The;} so it can see that you don't have the metadata field Group and it would add it to your metadata (optional depending on your settings).

Revision history for this message
Pseudonym (404emailnotfound) wrote :

I hope you're not going to use that kind of format. The filelist is already XML, so it should be extended with either additional attributes or nested tags if you need more functionality. Something like this might be too bloaty for DC++, but things like this would be a good thing to have in some mods, at least for prototyping. The ADC protocol will support stuff like this without issue in search results, though we'd need to come up with the 2-character identifiers and make sure they do not clash.

Revision history for this message
Neil Harding (8-launchpad-nharding-co-uk) wrote :

I was thinking of a metadata tag in the filelist since it would be minimum amount of changes. I know the filelist is xml but metadata can be arbitrary tag values. I was thinking of adding the metadata to the filename for searches with old clients, so even an old client (or branch can search metadata, although they won't be able to provide metadata for their own files.

Revision history for this message
darkKlor (gav135) wrote :

I don't think the file list should be touched. I also think the TTH attribute should be renamed to Hash, and the root element should contain a HashAlgorithm attribute where the hash is defined. Oh and the file list should include a date modified attribute. But back to the topic... once you start adding metadata it's a question of where do you stop? And why add it to the file list? This in no way aids searching. We have a hash index which is seperate from the file list and is scanned when a search for a TTH root is made. Likewise, any metadata should be stored seperately too for general usage. But there are too many options for what you store metadata on, and XML is not the correct solution if you want to store a wide variety of such data. People will find themselves with metadata files containing hundreds of megabytes, and searching it will be a very slow operation. Any serious metadata scanning e.g that in Windows Vista, uses hashing and databases to make searching fast. This is the scaleable approach. It is also a pain to implement.

Bottom line: if you want metadata related to your files, build a web site connected to a database, export all your metadata to your database such that it's searchable on your web site. Then let the web site generate magnet links to your files and tell the user which hub to be connected to.

Revision history for this message
Neil Harding (8-launchpad-nharding-co-uk) wrote :

The way I was envisioning it, would be for a few metadata tags per file. Using the example above for the Beatles.YellowSubmarine.mp3 with {metadata Song:Yellow Submarine;Group:Beatles, The;} it would appear as file YellowSubmarine.mp3 {Song:Yellow Submarine;Group:Beatles, The;} so that an old client could search using search string Group:Beatles, or even just Beatles and it would still be found. Adding a website is overkill, and would not help regular users if they are doing searches. I was also thinking that you could configure a small program that would produce the metadata for a filetype, so for mp3 you might specify c:\mp3info.exe %1, this would enable some metadata to be autopopulated (although not all mp3 files have the id3 tags in them). It would also be possible to use bitzi to populate the metadata en batch.

Revision history for this message
darkKlor (gav135) wrote :

Ok firstly, the brace character '{' is not in the Windows invalid path character list, so you cannot use it to indicate the beginning of metadata because you can have a file named 'doc { sometext }'. On Windows the invalid path characters include the asterix, pipe, back slash, forward slash, colon, less than, greater than, and question mark.

Now, adding the metadata to the file name string to enable old clients to search is a BAD idea. How does a client expecting each file element to represent a valid path name instead display the non-extended path? This may not be an issue in file transfer where we use named roots based on the session hash, but it IS an issue when we wish to display a file list because the old client will not know to trim the metadata from the list displayed to the user. In any case this would require a substring operation for every single file, whether the client was using the metadata or not. Also for the search command, the client handling the search request would have the added overheard of having to split the metadata for every single file into name/value pairs before performing the search. If the client did not do this then. using your example, a search for the string 'group' would return the song Yellow Submarine, which clearly has no relation to the word 'group'. This adds significant overheard to the processing requirements of a client. On a large hub, with a couple hundred users, where search requests are flying around every few seconds, this would quickly raise the CPU and memory requirements of DC++ significantly.

Revision history for this message
Jacek Sieka (arnetheduck) wrote :

while I don't envision this in dc++ proper any time soon, I'd probably do as pseudo suggests and put the metadata in the xml file and with 2char codes for search results...anything else feels like yet another useless encoding...

why should the tth attrib be renamed? what if we add/change hash algo and want to supply both for backwards compatibility?

Revision history for this message
darkKlor (gav135) wrote :

yeah. back compat is an issue anyway with changing the tth attrib name. the reason for renaming it would be to make it more generic. since identifiers are named roots i think such a 'hash' attrib should follow the form of <algorithmName>/<Hash> e.g.
<File Name="photo.jpg" Size="183908" Hash="TTH/26UE3RC5PPFUOOA2Q3VSKK3KYHPMVUQ6F37BTWQ"/>
Of course until all clients support this you would need to keep both attributes anyway. Support for multiple hash algorithms could be provided using a delimiter such as a semi-colon (you'd want to make sure no common hash functions use this) e.g. <File Name="photo.jpg" Size="183908" Hash="TTH/26UE3RC5PPFUOOA2Q3VSKK3KYHPMVUQ6F37BTWQ;SHA1/243A49FE192DC18197FD234f40071E49BCC234A1"/>

Revision history for this message
Pseudonym (404emailnotfound) wrote :

Please don't be like the people that store XML strings in database fields like I see on TheDailyWTF. Having internal structure in the text of an attribute completely misses the point of having the filelist in a format like XML. If this were to be done (and I don't think it should be), it would have to be done like this:
<File Name="photo.jpg" Size="183908">
<Hash Type="TTH">26UE3RC5PPFUOOA2Q3VSKK3KYHPMVUQ6F37BTWQ</Hash>
<Hash Type="SHA1">243A49FE192DC18197FD234f40071E49BCC234A1</Hash>
</File>
We could also use magnet-like "urn:tree:tiger" instead of "TTH" (and "urn:sha1" instead of "SHA1").

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for DC++ because there has been no activity for 60 days.]

Changed in dcplusplus:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.