Author splitting

Bug #286995 reported by Edward Betts
4
Affects Status Importance Assigned to Milestone
Open Library
Confirmed
Medium
Edward Betts

Bug Description

Some authors with the same name have ended up with a single author page. We need a way to split authors.

Changed in openlibrary:
assignee: nobody → edward-debian
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Tom Morris (tfmorris) wrote :

There are actually a large number of these conflated records. Pretty much any author record with just a first and last name (ie no middle name/initial, no birth/death dates) is almost guaranteed to represent multiple people.

Revision history for this message
George (george-archive) wrote :

(From Tom's post to ol-discuss, 4/29)

Rather than just complain about the data quality, here's a small
contribution to help improve it. I put together a little application
which shows all authors who have multiple Open Library author records,
as identified by the Freebase community.

You can find it at http://ol-dupes.freebaseapps.com/authors

The list is sorted by from most to least number of duplicates and each
entry is linked to all OL records as well as the Freebase record.
Freebase uses a slightly different schema, so the authors are linked
to Books ("works" in FRBR lingo) and those are linked to Book Editions
which equate to the Open Library book records.

I also included all the known names for the authors. Most of these
will have come from the merger of multiple records. I haven't looked
in detail, but it wouldn't surprise me if some of the bad names are
from munging on the Freebase side of things. You can see what the
name associated with each OL record is by clicking on the ID link.

The app is better for browsing than actual data cleanup, but I'd be
happy to show someone how to extract the data in a form that could be
used in the OL processes (or do it for you). The app is BSD licensed
so anyone's free to hack on it as well.

Tom

(Thanks, Tom!!)

Revision history for this message
George (george-archive) wrote :

The web app seems to run into query timeouts around 5 or 6 pages,
perhaps because of the way I'm sorting things, but the grand total is
more than you want to be paging through anyway. I count 7138 authors
after de-duping (18,445 records on the OL side).

Here's the histogram of counts by number of duplicates:

2 6496
3 494
4 89
5 32
6 10
7 9
8 3
9 3
11 1
12 1

It's trivial to generate a file of these dupes, but I'd also like to
figure out how this evolves going forward (ie as the Freebase
community identifies additional merges).

Revision history for this message
George (george-archive) wrote :
Download full text (3.9 KiB)

On Sat, May 1, 2010 at 1:14 AM, Michael Engel <email address hidden> wrote:
> > Freebase id: /m/05wk45p
> > Author name: Don Dinkmeyer
> > Aliases:
> > Don Dinkmeyer Jr.,Don Dinkmeyer Sr.,Don Sr Dinkmeyer,
> > Open Library records:
> > OL2624799A,OL302305A,OL2757673A,OL2757574A,OL2686700A,
> >
> > Looks like the Junior and the Senior are two different authors, see one example:

Good catch. I certainly didn't mean to imply that I think Freebase is
error-free. I think it's generally higher quality than what's in Open
Library, but not in this case. I think it also provides a nice
combination of machine-powered and human powered-reconciliation
processes. At a minimum though, the listing can be used to identify
areas that need cleanup.

There were actually two Freebase records and six Open Library records
for what is, most likely, two authors:

Freebase name: Don Dinkmeyer http://www.freebase.com/view/m/05wk45p
  Don Dinkmeyer http://openlibrary.org/a/OL2624799A
  Dinkmeyer, Don C. http://openlibrary.org/a/OL302305A
  Don Dinkmeyer Jr. http://openlibrary.org/a/OL2757673A (0 books)
  Don Dinkmeyer Sr. http://openlibrary.org/a/OL2757574A
  Don Sr Dinkeyer http://openlibrary.org/a/OL2686700A

Freebase name: Don C Dinkmeyer http://www.freebase.com/view/m/05wyhcb
  Don C Dinkmeyer http://openlibrary.org/a/OL3821345A

The Don Dinkmeyer Jr author record on Open Library has no books
associated with it, so I'm not even sure why it got created. Some of
the other OL records (e.g. Don Sr Dinkeyer) were obviously munged at
some stage in the processing pipe before getting to Freebase (perhaps
before getting to Open Library too).

It doesn't look like any of the Freebase community edited the
conflated record, so that's all apparently the result of overly
aggressive machine-based merging. I flagged the two separate records
for merger, which has since been voted on and completed, but now comes
the hard part - teasing apart the two authors.

I looked at the LoC and WorldCat and they do not appear to use Jr. and
Sr. at all. They use "Don Dinkmeyer" for the father, presumably
because he was the first and only at the time, and "Don Dinkmeyer,
1958-" for the son. This is apparently a variation on the bizarre
cataloging practices that librarians use, discussed a while back by
Karen. (Why not birth years for both? Why not Sr./Jr.? Why not
...?)

Here are the LoC authority records:

Dinkmeyer, Don C.
[They know the birth date and the fact that he's Sr., but don't
include it in the main heading]
http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=2362233&v1=1&HC=1&SEQ=20100501103657&PID=W1x3SwNKrlJizonRsJ0SQ7NKGR91

Dinkmeyer, Don C., 1952-
http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=946372&v1=1&HC=1&SEQ=20100501103549&PID=DvTGLGauLzedNKB8tuqgiZXm6K7Y

There's more strangeness in the Open Library records for one of the
books co-authored with Gary McKay, STET
(http://openlibrary.org/books/OL11407090M/Stet). The database lists
the wrong Gary McKay (combat author
http://openlibrary.org/authors/OL370554A/Gary_McKay) on the book, but
if you click through to the author page, the book isn't listed, so the
database is internally inconsistent.

...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.