FRBRizing and deduping

Bug #128394 reported by Aaron Swartz
8
Affects Status Importance Assigned to Milestone
Open Library
Fix Released
Medium
Edward Betts

Bug Description

We need to collect similar books into works, as in FRBR, and merge duplicate books from different sources.

Edward has an algorithm that works this is ready to give live with 1.6

Remaining question is how to handle works with an original language that isn't English. Right now the title is in the original language.

Tags: frbr
Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: nobody → kcoyle
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Greg Grossmeier (greg.grossmeier) wrote :

This bug blocks Davids work for bug 183050. Is there any update on the status?

Revision history for this message
Karen Coyle (kcoyle) wrote :

Nothing developed yet. I don't think we have the programmer resources to take this on at the moment. Algorithms are available that we can crib from; I'm not sure how we will represent the "work cluster" in the db design, nor what UI work it will take to present the results. This one perhaps needs a meeting to sort out the direction?

Revision history for this message
alexisrossi (alexis-archive) wrote : Re: [Bug 128394] Re: FRBRizing and deduping

We would like to make some progress towards FRBR by the end of October
(when IA has its annual meeting), but this is a huge, complicated task.
To my knowledge, no one has really implemented FRBR in the way we would
like to do it. It's easily said, but not so easily done.

Revision history for this message
solrize (solrize) wrote :

I thought Edward had coded the algorithms and that we had done significant de-duping in the current catalog, but that there was more to do. I'd like to help with this if I can. As Alexis says, it is a big messy task, but the methods involve are also of interest for the search stuff I'm doing.

We had a meeting quite a long time back where we discussed this in detail and I thought I understood it then, so maybe I'm way behind the times now.

Revision history for this message
Karen Coyle (kcoyle) wrote :

deduping and frbr-izing are two different things:

1) deduping: bringing together records for copies of the same edition of
the same book. we do this when new sources (e.g. new libraries) are
added to the database.

2) frbr-izing: bringing together records for different editions of the
same book.

Note that many books are only issued in one edition; frbr-izing affects
a small but very visible part of the bibliographic universe (about 5% is
the estimate). It covers popular works like Shakespeare and Mark Twain;
it should also bring together re-printings and translations with the
original work. think of it as a cluster of books with approximately the
same text, although having been published at different times by
different publishers.

kc

solrize wrote:
> I thought Edward had coded the algorithms and that we had done
> significant de-duping in the current catalog, but that there was more to
> do. I'd like to help with this if I can. As Alexis says, it is a big
> messy task, but the methods involve are also of interest for the search
> stuff I'm doing.
>
> We had a meeting quite a long time back where we discussed this in
> detail and I thought I understood it then, so maybe I'm way behind the
> times now.
>

--
-----------------------------------
Karen Coyle / Digital Library Consultant
<email address hidden> http://www.kcoyle.net
ph.: 510-540-7596 skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------

Revision history for this message
rejon (rejon) wrote :

Cool, would this be a good topic for the call this week and steps to
accomplish this? Greg from CC and I will be on the call this week.

OT: Alexis, do you have an url for the meeting in OCTOBER?

On Tue, 2008-07-15 at 20:11 +0000, alexisrossi wrote:
> We would like to make some progress towards FRBR by the end of October
> (when IA has its annual meeting), but this is a huge, complicated task.
> To my knowledge, no one has really implemented FRBR in the way we would
> like to do it. It's easily said, but not so easily done.
>
--
Jon Phillips
San Francisco, CA + Guangzhou + Beijing
GLOBAL +1.415.830.3884
CHINA +86.1.360.282.8624
<email address hidden>
http://www.rejon.org
IM/skype: kidproto
Jabber: <email address hidden>
IRC: <email address hidden>

Revision history for this message
rejon (rejon) wrote :

On Wed, 2008-07-16 at 00:15 +0000, Karen Coyle wrote:
> deduping and frbr-izing are two different things:
>
> 1) deduping: bringing together records for copies of the same edition of
> the same book. we do this when new sources (e.g. new libraries) are
> added to the database.
>
> 2) frbr-izing: bringing together records for different editions of the
> same book.
>
> Note that many books are only issued in one edition; frbr-izing affects
> a small but very visible part of the bibliographic universe (about 5% is
> the estimate). It covers popular works like Shakespeare and Mark Twain;
> it should also bring together re-printings and translations with the
> original work. think of it as a cluster of books with approximately the
> same text, although having been published at different times by
> different publishers.
>
> kc

Yes, and frbr'ization really helps pd/copyright determination and also
really will help as the amount of data included increases and types of
media are recognized as more than just books.

Thanks for breakdown Karen.

I'd love to talk more about and get an engineering breakdown and
possible ways we can help accelerate :)

Jon

> solrize wrote:
> > I thought Edward had coded the algorithms and that we had done
> > significant de-duping in the current catalog, but that there was more to
> > do. I'd like to help with this if I can. As Alexis says, it is a big
> > messy task, but the methods involve are also of interest for the search
> > stuff I'm doing.
> >
> > We had a meeting quite a long time back where we discussed this in
> > detail and I thought I understood it then, so maybe I'm way behind the
> > times now.
> >
>
> --
> -----------------------------------
> Karen Coyle / Digital Library Consultant
> <email address hidden> http://www.kcoyle.net
> ph.: 510-540-7596 skype: kcoylenet
> fx.: 510-848-3913
> mo.: 510-435-8234
> ------------------------------------
>
--
Jon Phillips
San Francisco, CA + Guangzhou + Beijing
GLOBAL +1.415.830.3884
CHINA +86.1.360.282.8624
<email address hidden>
http://www.rejon.org
IM/skype: kidproto
Jabber: <email address hidden>
IRC: <email address hidden>

Revision history for this message
solrize (solrize) wrote :

On Wed, 2008-07-16 at 00:40 +0000, rejon wrote:
> Cool, would this be a good topic for the call this week and steps to
> accomplish this? Greg from CC and I will be on the call this week.

I don't think FRBR is an appropriate topic for this week's call, which
is supposed to be about issues related to database performance.

Revision history for this message
rejon (rejon) wrote :

Yes, that makes sense. Possibly then the following week?

Jon

----- Original Message -----
From: "solrize" <email address hidden>
To: <email address hidden>
Sent: Tuesday, July 15, 2008 6:54 PM
Subject: Re: [Bug 128394] Re: FRBRizing and deduping

On Wed, 2008-07-16 at 00:40 +0000, rejon wrote:
> Cool, would this be a good topic for the call this week and steps to
> accomplish this? Greg from CC and I will be on the call this week.

I don't think FRBR is an appropriate topic for this week's call, which
is supposed to be about issues related to database performance.

--
FRBRizing and deduping
https://bugs.launchpad.net/bugs/128394
You received this bug notification because you are a member of
Openlibrary-team, which is the registrant for Open Library.

Revision history for this message
webchick (webchick) wrote :

Sounds like a plan. Let's do it.

Revision history for this message
solrize (solrize) wrote :

I don't understand what spending a bunch of time on FRBR in a conference call is going to accomplish. There are only a few people likely to work on it. Jon, are you offering to help write the code? If yes, you should probably talk with Edward and Karen. If not, I don't think putting the topic into the next phone call is going to do any good.

Revision history for this message
rejon (rejon) wrote :

Proper pick up a shovel call-out Paul. I want to know how to get this
done sooner rather than later. I do have some resources to help move
this forward if there is a plan and solution in sight.

Paul, I will talk more with Edward and Karen about this, but I wanted to
get an assessment of what is needed to solve this, priorities for coming
weeks/months on OL, and how this relates.

If there is a plan as well, yes, I can CODE. Amazing, right!

Jon

On Wed, 2008-07-16 at 19:06 +0000, solrize wrote:
> I don't understand what spending a bunch of time on FRBR in a conference
> call is going to accomplish. There are only a few people likely to work
> on it. Jon, are you offering to help write the code? If yes, you
> should probably talk with Edward and Karen. If not, I don't think
> putting the topic into the next phone call is going to do any good.
>
--
Jon Phillips
San Francisco, CA + Guangzhou + Beijing
GLOBAL +1.415.830.3884
CHINA +86.1.360.282.8624
<email address hidden>
http://www.rejon.org
IM/skype: kidproto
Jabber: <email address hidden>
IRC: <email address hidden>

Revision history for this message
rejon (rejon) wrote :

Ok, need to schedule call with Karen to discuss this and what PDregistry.ca is hoping for on this. Karen, and others interested in this have time for a call on this this week, or early next week, say like 9 AM PST on next WEDnesday?

Revision history for this message
rejon (rejon) wrote :

What is the link to the current info/recommendation on this? I want to pour fuel onto the frbr process. Access Copyright really wants it integrated, so I want to help move this forward prior to Karen and I meeting next week.

Revision history for this message
Karen Coyle (kcoyle) wrote :

The analysis that I did is at:
  http://openlibrary.org/about/frbrization
This proposes some possible structures for the creation of Work records (Work in the FRBR sense). The CC project has a slightly different need: the creation of Expression records. This is because copyright operates at the FRBR Expression level (the Work level is more abstract than the law addresses). We need to discuss whether we need both Work and Expression levels in OL, or if Expression will be sufficient.

Revision history for this message
Edward Betts (edwardbetts) wrote :

Jon,

What format would you like for work URLs on Access Copyright?

Do you want the author name and work title in the URL, one of these:

/Hamlet_by_William_Shakespeare
/works/Hamlet_by_William_Shakespeare
/works/William_Shakespeare/Hamlet
/William_Shakespeare/works/Hamlet

Or do you want to use a number for the work URL:

/works/2432

If you let me know I'll have a go at adding works to the Access Copyright site.

Revision history for this message
rejon (rejon) wrote :

I want to make sure our path doesn't diverge from OL. The last option
you lay out seems the most likely, but I haven't synced with Edward
about what his implementation supports.

On Sat, 2009-01-10 at 01:57 +0000, Karen Coyle wrote:
> The analysis that I did is at:
> http://openlibrary.org/about/frbrization
> This proposes some possible structures for the creation of Work records (Work in the FRBR sense). The CC project has a slightly different need: the creation of Expression records. This is because copyright operates at the FRBR Expression level (the Work level is more abstract than the law addresses). We need to discuss whether we need both Work and Expression levels in OL, or if Expression will be sufficient.
>
--
Jon Phillips
San Francisco, CA + Guangzhou + Beijing
GLOBAL +1.415.830.3884
CHINA +86.1.360.282.8624
<email address hidden>
http://www.rejon.org
IM/skype: kidproto
Jabber: <email address hidden>
IRC: <email address hidden>

Revision history for this message
rejon (rejon) wrote :

On Mon, 2009-01-12 at 16:43 +0000, Edward Betts wrote:
> Jon,
>
> What format would you like for work URLs on Access Copyright?
>
> Do you want the author name and work title in the URL, one of these:
>
> /Hamlet_by_William_Shakespeare
> /works/Hamlet_by_William_Shakespeare
> /works/William_Shakespeare/Hamlet
> /William_Shakespeare/works/Hamlet
>
> Or do you want to use a number for the work URL:
>
> /works/2432

I want to stay in line with what Open Library is/will do. After reading
Karen's synopsis, we might have to consider more than just works. As for
URL preference, I don't have one as compared with mapping to the FRBR
concept, aka, defer to Karen on this one. I like the third option in
terms of clarity.

> If you let me know I'll have a go at adding works to the Access
> Copyright site.

Thanks, I'm in chat right now if you want to discuss directly you
saint :)

Jon

--
Jon Phillips
San Francisco, CA + Guangzhou + Beijing
GLOBAL +1.415.830.3884
CHINA +86.1.360.282.8624
<email address hidden>
http://www.rejon.org
IM/skype: kidproto
Jabber: <email address hidden>
IRC: <email address hidden>

Revision history for this message
rejon (rejon) wrote :
raj (raj-archive)
Changed in openlibrary:
assignee: kcoyle → edward-debian
milestone: 1.0 → 1.6
description: updated
Changed in openlibrary:
status: Confirmed → In Progress
Changed in openlibrary:
importance: High → Medium
Changed in openlibrary:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.