non-numeric death date

Bug #423523 reported by solrize
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
New
Undecided
Edward Betts

Bug Description

LA2 points out a non-numeric death date (basically the author put in a blurb saying he was still alive and writing) in:

   http://openlibrary.org/a/OL84694A/ALBERTO-(ambit)-MANALON-MADRILEJOS

I'm opening this bug to request general comments/discussion about what we should do about stuff like this. Should we apply some kind of syntax checks to the death date field? We don't, and basically can't, for other fields like publication date.

Revision history for this message
Edward Betts (edwardbetts) wrote :

We need to invent a date type that can handle dates like:

1952
10th July 1952
2nd Century
1832 or 3
ca. 1642

We could add a boolean to the author page with a name like 'alive' or 'living person'.

Revision history for this message
LA2 (lars-aronsson) wrote :

By looking at death_date and changing all digits to 9, the following patterns are the most common (in authors.json.gz of 29 July, 2009 having 6.47 million author records):

occurrences value pattern comment
 164289 9999. trailing period should be removed
 125446 9999 nice year
   4598 * just an asterisk, should be blank?
   4153 . just a period, should be blank
   3499 , just a comma, should be blank
   3311 9999, trailing comma should be removed
   2738 [from old catalog] imported from LoC? garbage?
   2485 9999?
   2395 ) just a closing parenthesis, should be blank
   2150 9999. [from old catalog] LoC?
   1717 .· just a period and a mid-dot, should be blank
   1453 ca. 9999
   1400 999
    765 .* just a period and an asterisk, should be blank
    762 ca. 9999.
    701 9999 or 9.
    424 9999] trailing bracket should be removed
    406 999. trailing period should be removed
    314 9999) trailing parenthesis should be removed
    263 ca. 999
    233 ed. huh?
    225 9999 or 99. trailing period should be removed
    207 9999?. trailing period should be removed
    191 ). just parenthesis and period, should be blank
    173 999?
    172 9999.· trailing period and mid-dot should be removed
    162 ] just a closing bracket, should be blank
    129 9999.* trailing period and asterisk should be removed
    121 9999 or 9
    114 c huh?
     98 99
     97 9999, [from old catalog] LoC?
     86 comp. huh?
     86 9999 or 99
     84 ca. 999. trailing period should be removed
     83 · just a mid-dot, should be blank
     62 999 B.C. keep trailing period after B.C.
     61 999 or 9. trailing period should be removed
     56 99. trailing period should be removed
     50 99th cent. keep trailing period after cent.
     42 99.99.9999 is this day.month.year or month.day.year?
     38 .... four periods?
     35 l999 lowercase L instead of digit 1
     33 9999. [from old catalog] period and two spaces before the bracket, LoC import?
     32 ca.9999 should have space after ca.
     31 9999). trailing parenthesis and period should be removed
     30 9
     28 ca. 999 B.C.
     27 9999 . trailing space and period should be removed
     26 99/99/9999 is this day/month/year or month/day/year ?
     26 ?
     25 l999. lowercase L instead of digit 1
     25 99 B.C.

Revision history for this message
Edward Betts (edwardbetts) wrote :

I have a function called fix_l_in_date which fixes dates containing an l instead of a the digit 1. This is used for all newly loaded records.

http://github.com/openlibrary/openlibrary/blob/master/openlibrary/catalog/utils/__init__.py#L68

Revision history for this message
LA2 (lars-aronsson) wrote :

The problem is actually two problems: 1) Bad input, which can be verified and corrected, e.g. 32nd of April. This is quite easy to analyze and fix just by looking at the JSON dumps. 2) Vandalism, self-promotion or slander in the input, what we tolerate, what we revert, which users should be blocked, etc. This has a lot more to do with community building and communication.

Revision history for this message
Edward Betts (edwardbetts) wrote :

Agreed. For number 2 we need to have people patrolling recent changes and a way to contact editors. To patrol recent changes there needs to be an option to hide edits by bots, see Bug #364191

Changed in openlibrary:
assignee: nobody → Edward Betts (edwardbetts)
Revision history for this message
Karen Coyle (kcoyle) wrote :

[from old catalog] LoC?

Yes, all of these are from LoC. Bracketed phrase should be deleted. Also appears in subfield with name (not just date), subjects, and in titles:

http://openlibrary.org/works/OL102799W/California_and_Oregon_Trail._from_old_catalog

http://openlibrary.org/authors/OL6712855A/Willis_N._from_old_catalog_Bugbee

http://openlibrary.org/subjects/history_and_criticism._%5Bfrom_old_catalog%5D

Revision history for this message
George (george-archive) wrote :

I've manually corrected those first 2 examples from Karen, but can't edit the subject.

(That's an awesome list from LA2 btw!))

Revision history for this message
George (george-archive) wrote :
Revision history for this message
Edward Betts (edwardbetts) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.