author merging

Bug #128401 reported by Aaron Swartz
10
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Medium
Karen Coyle

Bug Description

merge duplicate authors from different sources

Revision history for this message
Aaron Swartz (aaronsw) wrote :

Step One: come up with an algorithm, then assign to dbg to implement

Changed in openlibrary:
assignee: nobody → kcoyle
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Karen Coyle (kcoyle) wrote :

I got the impression that Simon is thinking about this. Let me know if there's anything else I should do.

Revision history for this message
Daniel B. Giffin (daniel-mybuttocks) wrote :

as far as i understand, the most effective approach here may be to let author merging benefit from manifestation merging. that is, authors with similar names who are identified as creators of the same book are probably the same person.

Revision history for this message
Karen Coyle (kcoyle) wrote :

Daniel, I think this is a great idea -- I'm just not sure how it could work, in part because I don't know the data load flow. Basically, though, I think we need a way to bring author pages together "after the fact." There will be times when an author gets more than one page (because of how the author was represented in the metadata). If we change the actual author name to make it link to the same page, then we're changing the bibliographic metadata. That's one option. Another would be to allow cross-references on author pages (which we probably will need anyway). In this solution, there could be a link from one author page to another(s). And we'd want a way that users could make the connection between authors. The reason I'm leaning in this direction (cross-linking rather than forcing a single page) is that the latter implies that we have a single set of rules for author names and that we'll try to get all author names to conform to those rules. In a database of this nature, it makes sense for bibliographic data to come in with different forms of the name and get cross-linking.

The real trick is going to be pulling apart names that are the same but don't represent the same author. The "John Smith" problem. LC does this by adding dates to the name, but our users won't necessarily be able to come up with that information. Personally, I find the dates to be unfriendly from a user point of view, and would rather add something relating to the title of the first work under that author's name to differentiate the author. I don't know what this does to the format for author pages, but it would be nice to be able to create a page with a URL like "smith_john_first_word_of_title" that would display to the user as "Smith, John". And then we have to figure out how to make it easy for users to create this form.

In the end, creating cross-links from the bibliographic record clusters is still a great way to go, if we can figure it out in practice.

Revision history for this message
solrize (solrize) wrote :

There should be some way to manually separate results if they're merged incorrectly. Some books have multiple authors with similar names, Christopher Tolkien (son of J. R. R. Tolkien) has written or edited a bunch of LOTR-related books. There's also a cryptography book by Young and Yung (the authors aren't related to each other).

Changed in openlibrary:
milestone: 1.0 → 1.7
Revision history for this message
solrize (solrize) wrote :

Also there should be a way to merge two author records into a single one, when they refer to the same person. Example:

http://openlibrary.org/a/OL2405825A/F.-Reuleaux
http://openlibrary.org/a/OL2556494A/Franz-Reuleaux

are the same person (per user LA2 on irc).

Bug #240780 may also be relevant to this.

Revision history for this message
Quarian (b-m-diaz) wrote :

Some authors separate their work by using pseudonyms, variant names, or alternate emphasis; e.g. J. Charles Cox for historical works, John Charles Cox for clerical work.
(this is an example; in this case made more complex by the publisher chosen ...).
Thus, to preserve the separate book lists may be sensible and better reflect reality and perhaps other aspects, such as author's intention. One thinks of "Lawrence of Arabia" and his pseudonyms, for example. Victorian ladies, for example Charlotte Elizabeth Bowen, used C.E.B., no name at all, CB and rarely C.R. (the R. for Richmond, her maiden name). Linking all together reflects someone's scholarship and may be valuable for the casual as well as experienced user, of course.
Thus, rather than "merging" two author records (files), might it be an idea to provide a linkage of the URLs via the notion of "a collation", named after the editor that proposes the conflation of the two (more?) author lists? Thus, in the example case, if I was an editor, this would be ".../a/.../Anti-Quarian-John-Charles-Cox". Another editor might correct this later; etc ... There is the problem of what this URL contains, but I guess that would be dealt with by an alternative directory (l for linkage perhaps?). Similarly, I guess, this would avoid or at least side-step, potential editorial battles?

Revision history for this message
solrize (solrize) wrote :

Is there a way to MANUALLY merge authors in the current system? Yannf points out that we have several author records for M. K. Gandhi:

http://openlibrary.org/a/OL4271982A/Gandhi-Mahatma
http://openlibrary.org/a/OL891A/Gandhi
http://openlibrary.org/a/OL4323209A/M.-K.-Gandhi
http://openlibrary.org/a/OL335459A/Mahatma-Gandhi

Revision history for this message
Yann Forget (yann-forget-me) wrote :
Revision history for this message
Yann Forget (yann-forget-me) wrote :
Revision history for this message
Karen Coyle (kcoyle) wrote : Re: [Bug 128401] Re: author merging

Yann Forget wrote:
> I found another author with multiple pages:
> http://openlibrary.org/a/OL3267683A/Lanza-del-Vasto
> http://openlibrary.org/a/OL133326A/Joseph-Jean-Lanza-del-Vasto
> http://openlibrary.org/a/OL4347893A/Giuseppe-Giovanni-Lanza-del-Vasto
>
>

There will be many such instances. This is one of the reasons to hook up
with the LoC name authority files. The record for this author shows:

Primary name: Lanza del Vasto, Joseph Jean,
    dates: 1901-1981

Other names:
Vasto, Joseph Jean Lanza del,
Del Vasto, Lanza,
Shantidas,
Del Vasto, Joseph Jean Lanza,
Lanza, Giuseppe Giovanni,
Lanza del Vasto, Giuseppe Giovanni,
Lanza di Trabia-Branciforte, Giuseppe Giovanni Luigi Enrico

This would help us bring together the various forms of the name.

--
-----------------------------------
Karen Coyle / Digital Library Consultant
<email address hidden> http://www.kcoyle.net
ph.: 510-540-7596 skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.