In bulk metadata download review dialog, downloaded metadata fields incorrectly shown as blank or undefined if existing field value matches downloaded field value

Bug #1657904 reported by Adrianna Pińska on 2017-01-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

I'm running Calibre 2.77 on Ubuntu Linux 16.10 (installed from a tarball via the regular update script).

At first I thought that this was a bug in the source plugin I'm working on, but I have verified that the plugin is returning the correct data, and that the problem can be reproduced with any set of sources configured. I have tested it with the Amazon and ISBNDB sources as well as the forked ISFDB plugin which I primarily use. The affected fields appear to be the publisher, date of publication, languages and comments (it depends on which fields are available at the source, and under some circumstances some of these fields are inexplicably unaffected).

Steps to reproduce:

1. Create an arbitrary book. The extent of the bug is easiest to see if you configure a single source and start with an empty book with only a single unambiguous identifier for that source, because it is easier to see what fields are fetched and filled in during the download (and easier to demonstrate that the exact same metadata is being fetched every time).

2. Download metadata for the book by right-clicking and selecting "edit metadata" --> "download metadata and covers" (using the less interactive bulk metadata downloading tool, not the more interactive button from inside the "edit metadata individually" dialog). You can download metadata only for simplicity; the covers are irrelevant.

3. Review the downloaded metadata. Look in particular at the fields "publisher", "published", "languages" and "comments". If they are shown to be changed, click "ok" and go back to step 2 (without changing anything in the configuration of the sources, thereby ensuring that the exact same metadata will eventually be downloaded again, taking into account that some sources may initially behave differently depending on what values are already filled in). Otherwise, note that on the left hand side (i.e. the downloaded metadata) some or all of these fields are shown as undefined (the publication date) or blank (the other affected fields), even though they are *not* blank or undefined in the fetched data (which can be verified via the log if you view it before reviewing the metadata).

Specific example: turn off all sources except Amazon; add a completely empty book, enter amazon:0060515236 into the identifiers field, and save. The first time you fetch metadata, all four affected fields will be filled in. The second time you fetch the metadata, all four fields will exhibit this bug.

It appears that if the downloaded values for these fields match the existing values in the current metadata, the downloaded values are shown to be blank. This does not cause the book metadata to become corrupted (because either these fields are really treated as empty and the current values are preserved, or this is a display issue and the values are actually being overwritten with identical values), but it is misleading (particularly during debugging of source plugins).

I haven't been able to narrow down what could be causing this in the code. I also can't 100% consistently reproduce it with all the affected fields and all sources -- on one occasion a publication date which appeared identical was unaffected, and when I use my forked ISFDB plugin, comments are never affected (i.e. apparently identical comments are not marked as having changed, but do appear correctly on both sides). I assume that this means that there is a subtle difference between the comparison used to determine whether a field has changed (for the purposes of bolding and italicising the field name), and whatever comparison is involved in this bug. ISFBD comments are much more heavily formatted than comments from other sources; I don't know if this causes some kind of change in whitespace when they are stored which would affect some string comparisons and not others.

I vaguely recall that this was deliberate to simplify reviewing (no need
to read the same data twice), but I'm not a 100% sure. The relevant code
can be rtaced starting from the __call__() method at line 518 in
metadata/diff.py

Kovid Goyal (kovid) wrote :

Yes, it is deliberate, see line 45 onwards in metadata/sources/worker.py

Not for ease of review but as a performance optimization.

Kovid Goyal (kovid) wrote :

And note that it is explicitly undone when doing bulk downloads for the
title and author fields at line 227 in actions/edit_metadata.py --
probably the sa,e could be done for the other fields, although I am
50-50 on whether that is a good idea or not.

Adrianna Pińska (confluence) wrote :

Thank you for your response. What is the "extra work" in set_metadata that this optimisation is preventing? Is it just the necessity of checking that the fields have changed and not reapplying identical metadata values in all the set_metadata functions for all the different formats?

Would it not be possible to move this optimisation to occur after the values are displayed for review but before the set_metadata functions are called?

Kovid Goyal (kovid) wrote :

See my last response.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers