Bug #1251394 “Metabib Display Fields” : Bugs : Evergreen

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-14:

#1

Branch started at

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1251394-metabib-display-fields

It's largely a copy/paste job from facet_entry, w/ some slight modifications. It seems to work, in as much as values are extracted, but it will need lots of eyes.

TODO:
-- API / Views(?) for easier programmatic access.
-- managing the re-ingest on upgrade

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-14:

#2

... and IDL stuff.

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-14:

#3

Awesome, Bill!

I pushed a collab branch with some minor schema realignment and a fix for repeatable display fields. See: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/lp1251394-metabib-display-fields

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-14:

#4

Thanks, Mike. Changes pulled w/ sign-off and some additional fixes applied into user/berick/lp1251394-metabib-display-fields . I've confirmed I'm now getting multiple ISBN and author entries.

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-15:

#5

Further testing, updates, and questions:

Accessing display fields via the various auto-APIs is working well.

E.g. with no authtoken (~100ms)

request open-ils.fielder open-ils.fielder.mfde.atomic {"query":{"source”:[1,2,3,4,5,6,7,8,9,10]},”fields":["source","value","label","name","field_class"]}

E.g. with an authtoken (~35ms)

request open-ils.pcrud open-ils.pcrud.search.mfde.atomic "0febd15e08dd0150bdfdccf3def1b6ac", {"source":[1,2,3,4,5,6,7,8,9,10]}

---

Each of these returns lists of fields for each record, which raises a question. Given a set of display fields for a record, how do we choose which to show by default in a UI? In the MVR/simple-record world, we have something called "title". Do we need conventions (configuration?) for determining which title (proper, uniform, ...), author (personal, corporate...) , etc. to display by default? (This does not preclude access to all fields, of course, but it allows us to display a saner/smaller default set, which is best for most UIs).

If such a convention exists and is generally universal, it would be extremely convenient to have a DB view or similar which codified this, such that clients could say, "give me the title", while still having access to all display fields. I imagine this would be useful for reporting as well.

------------------

TODO:
-- pgtap test

-- re-ingest for upgrades. I'll need assistance with this. The re-ingest in my upgrade script is a step, maybe, but I'm sure there are better ways.

-- right now, most fields are set as display_fields (see below). Do we need to trim those down any?

UPDATE config.metabib_field SET display_field = FALSE WHERE field_class = 'keyword' OR name = 'complete';

-- Do we need any default xpath or normalizers for display fields?

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-15:

#6

Bill,

There's no convention yet that would choose the best of field to represent the class. Only title and author really fall into that category of having a "main" value. I wonder if, perhaps, the simplest solution is to augment config.search_class with a "representative_field" column that points to one CMF row. Then a view could certainly be constructed to limit the values return to only those of "representative_field" provenance. Have an inverse view for "the rest", and the view you have now for the combination of them both.

As for reporting, things are a little more complicated. There are some array-type fields for ISBN and such which would need more consideration, but for (nominally) unique-per-record fields, the existing reporting fields could be augmented to provide better data, for sure.

Thoughts?

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-18:

#7

"representative_field" is exactly what I was picturing, but was hesitant to suggest it, since most field classes won't need it.

+1 from me as it solves that problem once and for all.

--

Agreed that creating a display_field-driven analog to simple_record would be a step in the right direction, which I think is what you were suggesting. It's not clear to me how to cram multi-value fields into a single column any better than what we already have with arrays (or if indeed we need to for reporting), so I can't really comment there.

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-18:

#8

I've pushed some code to add support for representative_field for review.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-18:

#9

Bill, I think this is certainly an area of need, and the direction you have chosen is definitely viable. I have concerns about continually increasing disk space requirements, but if this ultimately does wholly replace reporter.material_simple_record, that problem is ameliorated.

That said (and I am only raising this point for discussion), I wonder if we should be thinking bigger. In some parts of the code, we transform to MODS, which is often only sort of what we need, then try to further manipulate the MODS output (through normalizers or other stock functions) to get a final result. In other places (especially the OPAC), we have given up on MODS, and instead deal with the MARC directly.

This approach has some known weaknesses:
1) Duplication of logic in multiple places (reporter, MVR, OPAC)
2) Inconsistency when new logic is not propagated completely
3) Not every tool is equally suited to manipulate XML/MARC data

While it has come up a few times in various past discussions, I would like to propose that we reconsider adopting a custom XSL approach, which would mean at least one and perhaps many XSL stylesheets dedicated to MARC data extraction, for indexing, display, or either/both.

In this case, I am imagining that the data we would want in various client interfaces is exactly what is already in the results list of the OPAC. It would therefore seem very beneficial for the data extraction and selection logic to be consolidated into a single place. And, while not a holy grail, XSL happens to be really good at transforming XML data, especially when speed is a factor.

In some ways, we are already doing this, as our MODS XSL slowly drifts away from the original standard version. What I am asking is that we be more explicit about it, and also that perhaps we also separate some logic into some more special-purpose sheets. I know there will be some concerns about multiple transforms, but the cost of the transforms are very cheap compared to many things we already do, and in fact are even cheaper when the XSL is trim and focused. In a basic test on my aging DB server, a simplified in-DB transform which did a MODS-style title and author extraction only went at a clip of around 1,000 records per second.

Again, this is all just for discussion, as I haven't done enough testing (or thinking) to wholesale endorse any of this. Also, I know there are plenty of in-between approaches, and I recognize that this proposal isn't even necessarily in conflict with what we are doing here, but in fact could be done in a more or less complementary way. Still, it appears we find ourselves defining best practices for display of fundamental things like 'title' and 'author', so I think we should consider moving that discussion up (down?) a level into the MARCXML transformation logic itself.

Thoughts?

Bill, I think this is certainly an area of need, and the direction you have chosen is definitely viable.  I have concerns about continually increasing disk space requirements, but if this ultimately does wholly replace reporter.material_simple_record, that problem is ameliorated.

That said (and I am only raising this point for discussion), I wonder if we should be thinking bigger.  In some parts of the code, we transform to MODS, which is often only sort of what we need, then try to further manipulate the MODS output (through normalizers or other stock functions) to get a final result.  In other places (especially the OPAC), we have given up on MODS, and instead deal with the MARC directly.

This approach has some known weaknesses:
1) Duplication of logic in multiple places (reporter, MVR, OPAC)
2) Inconsistency when new logic is not propagated completely
3) Not every tool is equally suited to manipulate XML/MARC data

While it has come up a few times in various past discussions, I would like to propose that we reconsider adopting a custom XSL approach, which would mean at least one and perhaps many XSL stylesheets dedicated to MARC data extraction, for indexing, display, or either/both.

In this case, I am imagining that the data we would want in various client interfaces is exactly what is already in the results list of the OPAC.  It would therefore seem very beneficial for the data extraction and selection logic to be consolidated into a single place.  And, while not a holy grail, XSL happens to be really good at transforming XML data, especially when speed is a factor.

In some ways, we are already doing this, as our MODS XSL slowly drifts away from the original standard version.  What I am asking is that we be more explicit about it, and also that perhaps we also separate some logic into some more special-purpose sheets.  I know there will be some concerns about multiple transforms, but the cost of the transforms are very cheap compared to many things we already do, and in fact are even cheaper when the XSL is trim and focused.  In a basic test on my aging DB server, a simplified in-DB transform which did a MODS-style title and author extraction only went at a clip of around 1,000 records per second.

Again, this is all just for discussion, as I haven't done enough testing (or thinking) to wholesale endorse any of this.  Also, I know there are plenty of in-between approaches, and I recognize that this proposal isn't even necessarily in conflict with what we are doing here, but in fact could be done in a more or less complementary way.  Still, it appears we find ourselves defining best practices for display of fundamental things like 'title' and 'author', so I think we should consider moving that discussion up (down?) a level into the MARCXML transformation logic itself.

Thoughts?

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-21:

#10

The type of customization and configurability you're describing, Dan, is fully supported by the display_field code. We're adding a display_xpath column for storing display-specific value extraction. That XPATH can be run against MARCXML, MODS, or any number of local custom style sheets (config.xml_transform). One or more custom style sheets that do what the TPAC does (and more) would be a great addition to this code and would ideally be developed in tandem to reduce re-index churn.

FWIW, field values can also be run through index normalizers.

As far as the space goes, I do think we can find some savings, e.g. reporter.simple_record, but we may also be trading some disk space for speed (pre-extracted data is a lot faster than on-the-fly MVR generation) and configurability.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-21:

#11

Thanks for the response, Bill. I know my post was a little long-winded and hedged, so I'll to try to respond and summarize my views at the same time.

1) My main push was to get us to consider officially adopting some custom stylesheets as part of this development. It sounds like you are on board with the idea, so I can definitely put some man-power behind that to try and get something going in tandem.

2) My secondary (unclear) goal was to try and advocate for some amount of optimization and/or refactoring as we are all looking at this code again. I know that config.metabib_field is very customizable, and therefore capable of doing what we need. As you said, it was largely a matter of just duplicating the facet extraction code. However, I have concerns about following this path of least resistance, as it burdens an already slow process with still more to do.

On our system, we can reingest right around 10 records per second on a single thread. As noted in my previous post, I can extract display fields on the fly at a rate of around 1000 per second, also on a single thread. Obviously, the reingest does much more work, but 100 times more? What are the bottlenecks, and how do they apply to display field extraction? Time will tell as more testing gets done, but I am afraid if we keep this addition too generic, we might leave a lot of performance on the table.

It certainly makes sense to leverage existing infrastructure, but given that display fields are potentially a much simpler beast, perhaps we should also ensure that we can provide a path through the code which is properly streamlined to take advantage of this simplicity.

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-21:

#12

Thanks for clarifying, Dan. I can't really speak to the ingest internals, at least not without doing some research first. It seems likely there's room for optimization, but I'm not sure.

Is it fair to say that if this feature produced a net zero change in ingest speed, that your concerns would be addressed? Or, do you see this feature as something that must be addressed as a separate process from ingest so that it can run significantly faster and, if so, why exactly? (The obvious benefit would be quicker propagation of display field modifications, but I don't know if that justifies using a separate structure).

I'm curious, Dan, in your 1k records / second test, were you writing output to database rows or to STDOUT?

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-21:

#13

I was curious, so I ran some tests in my desktop dev vm w/ a stock Postgres 9.1 install. I ran batches of ingests of 100 records, repeated 10 times, the first batch without display field extraction and the second with:

Total Time: 74540.091 ms — no display fields
Total Time: 77274.786 ms — with display fields

If my math is right, that's a 3.7% increase in overall time. Or, roughly an additional minute for every 30 minutes of ingest time.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-21:

#14

Bill, I do think display field changes will happen considerably more frequently and more commonly than index/browse/facet field changes, but it will probably still be not terribly often in the grand scheme. So, you are right, that alone probably doesn't justify a separate structure.

I guess the opportunity I am feeling is that I believe display extraction code can be done very simply if we allow the XSL to do the heavy lifting, especially if we place limits on the resulting transformed structure. If an experimental effort to write such code proved successful, it would not only provide a possible path for fast display field propagation, but also inform ways we might tweak the existing extraction process for more general speed gains.

As I said to start, I am just opening the discussion and "testing the waters" on this idea. If there is enough interest and/or no clear opposition, I can produce some code (which would probably clarify things and generate more discussion) and see if this has any real legs, or is just a dead-end. Thoughts?

P.S. Yes, the 1k/second was just a quick and dirty "on the fly" extraction test to STDOUT, no writing of rows. I know that writing rows will slow that to some degree :)

Revision history for this message

Bill Erickson (berick) wrote on 2013-11-21:

#15

If you have a structure in mind, I'm very curious. Faster propagation of display fields would be a great thing to have. (Ditto a faster DB upgrade for this feature). Thanks for keeping the thoughts rolling, Dan.

Revision history for this message

Kathy Lussier (klussier) wrote on 2014-04-07:

#16

We have a lot of interest in this code. Is this something that is ready for testing or is everyone still working out ways and means?

Thanks!
Kathy

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-04-08:

#17

I'd like to spit-ball on a couple things above...

1) Speed concerns - I agree with having the capability to do a display-only reingest. We can leverage the existing infrastructure that uses flags, and extend it with a wrapper function that targets specifically a display reingest. (Likewise, a browse reingest.) With that, we have one code path (internally) during full ingest of a record, and a way to invoke display-only from the outside. In fact, this may tie into my thoughts on queued reingest, where the reingest might be typed (optionally) to allow faster processing. We might not gain /all/ possible speed increases, but I think it's likely we'll gain most of them.

2) Space concerns - I think we should consider leveraging something akin to the browse data and uncontrolled attribute value tables, which store unique, post-normalization values to which other, slimmer tables link. There is the option of reducing the number of secondary XPath expressions, and thus collapsing, say, facets, browse data and display data into a single table, and having all three use cases share the same underlying strings. I'm not sure about that, though .... there are cases where facets need to be different due to syntax constraints on search (today). Also, keeping the underlying structures separate means it's safer and easier to do parallel search+facet/browse/display ingests.

Thoughts?

Kathy Lussier (klussier) on 2014-07-18

Changed in evergreen:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-09-25:

#18

I've pushed a master-rebased version of this branch (no conflicts!) and will be digging into it a bit more soon. Branch here: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp1251394-metabib-display-fields

Changed in evergreen:
assignee:	Bill Erickson (erickson-esilibrary) → Mike Rylander (mrylander)

Mike Rylander (mrylander) on 2015-02-05

Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody

Revision history for this message

Kathy Lussier (klussier) wrote on 2015-02-19:

#19

We had somebody at MassLNC who wanted to take a look at this branch, and I loaded it on a Sandbox this morning. However, after loading it, I get an Internal Server error when navigating to the OPAC.

I don't see any errors in the Apache logs (not sure if we have any Apache logging configured), but I've pasted the errors I see in the osrfsys log at http://pastebin.com/8QQLKa99.

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-02-19:

#20

Kathy,

The paste suggests that either the DB load failed (all those "schema does not exist" errors) or the opensrf.xml file was pointing at the wrong database. Since this is an all-DB feature so far, the former is more likely. Where there any errors captured during the database creation?

Revision history for this message

Kathy Lussier (klussier) wrote on 2015-02-19:

#21

The current error I'm seeing during database creation is:

psql:002.schema.config.sql:188: ERROR: function config.metabib_representative_field_is_valid(integer, text) does not exist
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

Revision history for this message

Dan Wells (dbw2) wrote on 2015-02-20:

#22

'representative_field' (and its check function) create circular references between metabib_class and metabib_field. The most straightforward fix for the fresh DB setup is to deal with those fields separate from the initial table creation and field insertion.

Working branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/lp1251394-metabib-display-fields

working/user/dbwells/lp1251394-metabib-display-fields

Revision history for this message

Terran McCanna (tmccanna) wrote on 2016-11-15:

#23

This one hasn't gotten any attention in a while - does it need a pullrequest tag?

Revision history for this message

Mike Rylander (mrylander) wrote on 2016-11-16:

#24

Terran,

No, there's certainly enough drift that it will need some work. It's rather low priority ATM, also, since making use of it everywhere will be a large undertaking, and disruptive to ongoing work like the web staff client.

Revision history for this message

Bill Erickson (berick) wrote on 2017-04-26:

#25

I was curious about the state of this, so I rebased the most recent branch, updated a few functions to match current master, did some light squashing, and added a sign-off for Dan's SQL fix commit:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1251394-metabib-display-fields-rebase-circa-2.12

Code still works. And it's still fast (and very satisfying) to do stuff like this and get a list of display fields for a bib record:

srfsh# request open-ils.pcrud open-ils.pcrud.search.mfde "ANONYMOUS", {"source":1}

(Now even faster w/ anon-pcrud mode -- ~10ms).

What's our bar for getting this merged? If we want to demonstrate its use in live code, I can think of a few web staff UI's that would be fairly simple to port to display fields. (I'm thinking specifically of a few that use normalized reporter data for title, author, etc., display). In any event, I can help push this along if there's interest / feedback on what's left.

Other TODO items include PGTAP tests and release notes.

Revision history for this message

Galen Charlton (gmc) wrote on 2017-04-27:

#26

Thanks for rebasing, Bill.

I think it would be better if the cmf.display_field column defaulted to FALSE and that we supplied some stock cmf definitions that are specifically tuned for display; this would reduce potential requirements conflict between setting up definitions for indexing versus rules for generating human-readable display labels.

For example: the "author" of a work could be taken from author|personal, author|corporate, and so forth. While the set of author|* cmf rows is fine for indexing, a single display field definition that extracts the first mods:name (or however we choose to set it up) would be easier to use.

As a minimal set, I suggest title, author, and publication date (which is particularly interesting given that nowadays both the 260 and 264 must be consulted). We might even go so far as create a 'display' cmf field class.

That's not to say that I think it should be verboten to have a cmf row that's used for both search/browse/facet generation and a display field, but I do think that existing stock definitions should be checked individually before turning on display_field.

One thing that I think would be nice to have but which I don't consider a blocker for merging: a database function to perform a display field reingest that takes bib ID and a list of display field IDs.

Revision history for this message

Galen Charlton (gmc) wrote on 2017-04-27:

#27

(Also, I can and will help out a bit with the seed data)

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-01:

#28

Thanks for the feedback, Galen. Agreed on all points. Seed data help would certainly be appreciated.

I don't have a strong opinion on creating a "display" field class, but if we do we'll need an alternate way to specify what class each display field is part of. As it stands, config.metabib_field.field_class is how we know that a display field is a title, an author, etc.

On that note, there's another wrinkle I want to iron. One of the goals for this bug is to provide the basis for replacing MVR's. This will mean, for example, creating display fields for ISBN's and allowing UI's to grab displayable ISBN's. As it stands, I don't believe the code supports that use case.

It can say "give me the representative title" or "give me all of the displayable identifier fields", but not "give me the list of displayable ISBN's". That is, unless the caller assumes the field has a certain name, like "isbn", which is not a controlled vocabulary at this point. Are we ready to make the leap to defining a well-known base set of field names for display fields?

In any event, a display-focused reingest function will make all future steps easier, so I'll look at that. We can change the function guts as needed.

Revision history for this message

Galen Charlton (gmc) wrote on 2017-05-02: Re: [Bug 1251394] Re: Metabib Display Fields

#29

> It can say "give me the representative title" or "give me all of the
> displayable identifier fields", but not "give me the list of displayable
> ISBN's". That is, unless the caller assumes the field has a certain
> name, like "isbn", which is not a controlled vocabulary at this point.
> Are we ready to make the leap to defining a well-known base set of field
> names for display fields?

I think we should -- it would be preferable to using opaque IDs to
refer to them.

Looking at it from the POV of replacing MVRs, we've got several fields
that don't fit into the author/title/subject/identifier/series field
classes, including publisher, pubdate, edition, and toc. While I'm
not strongly attached to the notion of using 'display' as a new field
class, if we added it, we could refer to fields like this:

display|representative_contributor
display|title
display|pubdate

and so forth.

> In any event, a display-focused reingest function will make all future
> steps easier, so I'll look at that. We can change the function guts as
> needed.

Great, thanks!

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-05-02:

#30

I don't see a problem with the template's conception of field names being required to match extant CMF rows. We're effectively doing that by hard coding xpath to extract fields now in any case.

[thinking...] Extending the representative field concept to point through a new table mapping from the display field controlled-names to actual fields (a la config.metabib_search_alias) seems perfectly doable to me. IOW, we put a table between config.metabib_class.representative_field and CMF that looks like:

CREATE TABLE config.display_field_map (
  id serial,
  name text,
  cmf int references config.metabib_field (id),
  multi bool default false
);

The templates would reference config.display_field_map.name, we would construct a "display fields" object with config.display_field_map.name as the keys, and values of, perhaps, serialized json containing a scalar where config.display_field_map.multi is false or an array where it's true. That can, at the SQL level, be of type JSON (available in future-required PG versions) and built with JSON contruction functions, also available. Templates would just JSON.parse() (in angular interfaces that use pcrud) or JSON2perl() (in OPAC-style templates) to reconstitute the data.

In fact, we could use a view over metabib.display_field_entry to construct these objects for use by both the templates (they fetch directly by record id), and the simple-record view which can simply de-JSON-ify each field in a predictable way by looking at the mapping table (bonus: including nicer formatting that removes the {}s on arrays).

The first view would be a 3-column view (record id, field name, json data) and the simple-record view would pivot specific rows from that to a wide projection with prescribed columns.

This provides us vocab control, programmatic mapping, and backward compatibility, I think...

With that, I don't see a good argument for creating a new display class -- we have the catch-all keyword class, and can just map through to that.

Thoughts?

I don't see a problem with the template's conception of field names being required to match extant CMF rows.  We're effectively doing that by hard coding xpath to extract fields now in any case.

[thinking...] Extending the representative field concept to point through a new table mapping from the display field controlled-names to actual fields (a la config.metabib_search_alias) seems perfectly doable to me.  IOW, we put a table between config.metabib_class.representative_field and CMF that looks like:

CREATE TABLE config.display_field_map (
  id serial,
  name text,
  cmf int references config.metabib_field (id),
  multi bool default false
);

The templates would reference config.display_field_map.name, we would construct a "display fields" object with config.display_field_map.name as the keys, and values of, perhaps, serialized json containing a scalar where config.display_field_map.multi is false or an array where it's true.  That can, at the SQL level, be of type JSON (available in future-required PG versions) and built with JSON contruction functions, also available.  Templates would just JSON.parse() (in angular interfaces that use pcrud) or JSON2perl() (in OPAC-style templates) to reconstitute the data.

In fact, we could use a view over metabib.display_field_entry to construct these objects for use by both the templates (they fetch directly by record id), and the simple-record view which can simply de-JSON-ify each field in a predictable way by looking at the mapping table (bonus: including nicer formatting that removes the {}s on arrays).

The first view would be a 3-column view (record id, field name, json data) and the simple-record view would pivot specific rows from that to a wide projection with prescribed columns.

This provides us vocab control, programmatic mapping, and backward compatibility, I think...

With that, I don't see a good argument for creating a new display class -- we have the catch-all keyword class, and can just map through to that.

Thoughts?

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-05-02:

#31

Also, the new table described above could (should?) inform the ingest process about whether it should attempt to store multiple values from a display_field CMF, forcing some structure on "bad" records.

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-04:

#32

Hey Mike, I think this sounds good. This gets us all the way back, as alluded, to a reporter.simple_record replacement.

Am I right to assume the 3-column view could contain both controlled fields plus any locally added custom fields, while the simple-record view would only include the controlled fields (a la reporter.simple_record)? That is, unless someone locally modified the view and its IDL class.

And of course we no longer need the representative_field column now, since the purpose of each field will be implicit in its name.

For the 3-column view, it sounds like we're essentially morphing the "mfde" IDL (view) class from the working branch, teaching it to understand config.display_field_map / multi, to JSON-ify the data, and to drop a few unneeded fields. And I assume the "name" column would be config.display_field_map.name, which we'll require to be unique.

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-11:

#33

Pushed a few commits to:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp1251394-metabib-display-fields-rebase-circa-2.12

* Adds the config.display_field_map table. Using the 'name' column as the primary key.

* Tweaks metabib.flat_display_entry view to return JSON scalars and arrays. It's a 4-column view, including the 'field' (config.metabib_field) column so that fleshing display labels (for example) does not require an intermediate flesh of the config.display_field_map.

One concern about my implementation of metabib.flat_display_entry is how/when the data is grouped and JSON-ified. It's not clear to me whether PG is smart enough to avoid JSON-ifying /all the values/ when fetching rows for a single bib record. Comments appreciated there.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=blob;f=Open-ILS/src/sql/Pg/upgrade/XXXX.schema.metabib-display-field.sql;h=2d69e817ad3d7c8880c19363824b7694fec8a439;hb=8c025c8861d26f6f21455f7e47504609a12ad90f#l20

* Adds a few stub seed data values to the YYYY file for testing.

* Adds a ZZZ.UNDO.*.sql file for reverting all SQL changes added by the XXX and YYYY files. Very useful for testing. This will eventually go away.

* Beware the base schema is not in sync w/ the update scripts.

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-12:

#34

Fleshing this out a little more. With an eye toward making data retrieval as flexible as possible, given this data will be used by a variety of clients, here's a breakdown of the current way clients can fetch the display data.

1. metabib.display_entry (table)

Thin table of display entry values.

2. metabib.flat_display_entry (view)

One row per value, combined with data from config.metabib_field and config.display_field_map. Useful for clients that want a list of values and field metadata, like field names and labels.

3. metabib.compressed_display_entry (view)

Same as flat_display_entry, except there's only 1 row per config.display_field_map, i.e. one row per field type. Display entry values are returned as JSON arrays for multi=true fields and JSON scalars for multi=false fields.

4. metabib.wide_display_entry (view)

Analogous to reporter.simple_record.

One row per source (bib record) with a column per well-known field. As with compressed_display_entry, values are JSON-ified and returned as arrays or scalars, depending on multi-ness.

===

Next steps are to flesh out the stock display field configuration data, then finalize the columns for metabib.wide_display_entry.

Still using user/berick/lp1251394-metabib-display-fields-rebase-circa-2.12 for now. Base schema is still out of sync.

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-12:

#35

And last for today, a proof of concept using wide_display_entry as a drop-in replacement for reporter.simple_record in the browser client Items Out list:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=commitdiff;h=ed348aef5384e368010cbf3d1154bf8ab75e857c

(BTW, some of these views could have better names. At some point I get lazy w/ names).

Revision history for this message

Bill Erickson (berick) wrote on 2017-05-17:

#36

Rebased to current master. Pushed a commit adding an egBibDisplay Angular service that can de-JSON-ify the wide_display_entry data and present flat_display_entries data in various formats, which is useful because flat display entries are not beholden to the IDL structure -- they include all display fields.

Revision history for this message

Bill Erickson (berick) wrote on 2017-06-09:

#37

After experimenting and sleeping on this a bit, I have some thoughts.

Adding the new config.display_field_map table means also adding code/documentation for a new interface to manage the table. It also means staff have to visit two UI's to add a new display field. The new table only gives us one column: multi. (It also gives us a 'name' column, but that's not strictly necessary -- more below). This seems like a lot for one column.

I'd like to revisit Galen's suggestion of adding a new "display" metabib class. I see a few benefits to this:

1. It creates a clear separation between display fields and index, etc. fields. I think in practice, few metabib_field's would be both a display and index field and separating them from the start will make them easier to manage and faster to re-ingest for display-only ingests.

2. We can use the config.metabib_field.name column however we need to support named fields, since the unique restriction on names is per-metabib class. I.e. we don't need config.display_field_map.name.

3. No new config.display_field_map table would be required, so no new interface is required. We can add a new display_multi:bool field to config.metabib_field and drop the 2 new display_field and display_xpath fields, since they will be covered by metabib_class='display' and the existing 'xpath' column.

My only concern is whether adding a new metabib class (which we rarely do) will have unexpected consequences in the code.

Thoughts appreciated.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-06-10:

#38

Download full text (4.9 KiB)

Hi Bill,

First, thanks for continually pushing this forward. I personally really appreciate it.

One of the near-term future benefits of display fields, which I'm hoping to work on in the next few months, is the ability to highlight search terms in the fields that were actually searched and matched by the user query. If we add display-centered adornments to the existing searched/browsed/faceted fields, then we know, for certain, that a matching field for search (or browse, or facet) is what we need to highlight inside of for display. A separate display class, though, would keep us from having a 1-to-1 correspondence between search and display.

We could add a different, external mapping table to tie search and display fields together, but then we're back to having a mapping table and we lose the direct knowledge.

I don't actually agree that display and search fields will be mostly orthogonal. In fact, display will be primarily a subset of search. We may certainly display less than we search (think: subfield h of 245), but I don't think we'd ever show more than we'd allow searching on, except in circumstances that would likely be covered by existing (or useful) facet or browse fields. For instance, facets are a step toward display fields already, in that the values we extract for facet limiting are the same we use for display of facet data. The only facet field in stock that is not also a search field is the authority id facet. Browse is in a similar position: there are 5 of 17 browse fields that are not search fields, but one is because of a sorting need special to browse and could probably be addressed without the separate field, and the others are driven by how cataloging is performed rather than separate needs for search/browse as performed by a user per se.

I do see where you're coming from WRT reingest, but I think we have much bigger fish to fry in that area. Separating out display fields for that reason will not make any of the existing TODOs any easier, and they still need design and doing. (I think queued ingest along with a wholesale rewrite of the ingest process will be needed, and I hope that starts sooner rather than later, but that's a separate concern and I don't think it should have much weight on this bug's direction.) That said, I do think special care should be taken to make sure that display-only reingest is as efficient as possible on day one.

WRT config, I don't see the UI issue the way you describe. It may be just a difference in design philosophy, but I don't see why we need to have a UI per table. I think the existing (or, perhaps, angularized) CMF config UI can be made to deal with display mapping. When configuring a field, if display_field is true, offer a modal action that creates or changes the mapping. We save a UI, and lead the user to what they need at the time they need it. And as far as having a mapping table in place, it will be tiny and not factor into performance (not that you claimed it would, I just want to make that clear.)

Also, I'm glad you mention the code/maintenance effect of adding a new class. Currently, the classes are not special -- they can all be used for any purpose, an...

Hi Bill,

First, thanks for continually pushing this forward.  I personally really appreciate it.

One of the near-term future benefits of display fields, which I'm hoping to work on in the next few months, is the ability to highlight search terms in the fields that were actually searched and matched by the user query.  If we add display-centered adornments to the existing searched/browsed/faceted fields, then we know, for certain, that a matching field for search (or browse, or facet) is what we need to highlight inside of for display.  A separate display class, though, would keep us from having a 1-to-1 correspondence between search and display.

We could add a different, external mapping table to tie search and display fields together, but then we're back to having a mapping table and we lose the direct knowledge.

I don't actually agree that display and search fields will be mostly orthogonal.  In fact, display will be primarily a subset of search.  We may certainly display less than we search (think: subfield h of 245), but I don't think we'd ever show more than we'd allow searching on, except in circumstances that would likely be covered by existing (or useful) facet or browse fields.  For instance, facets are a step toward display fields already, in that the values we extract for facet limiting are the same we use for display of facet data.  The only facet field in stock that is not also a search field is the authority id facet.  Browse is in a similar position: there are 5 of 17 browse fields that are not search fields, but one is because of a sorting need special to browse and could probably be addressed without the separate field, and the others are driven by how cataloging is performed rather than separate needs for search/browse as performed by a user per se.

I do see where you're coming from WRT reingest, but I think we have much bigger fish to fry in that area.  Separating out display fields for that reason will not make any of the existing TODOs any easier, and they still need design and doing.  (I think queued ingest along with a wholesale rewrite of the ingest process will be needed, and I hope that starts sooner rather than later, but that's a separate concern and I don't think it should have much weight on this bug's direction.)  That said, I do think special care should be taken to make sure that display-only reingest is as efficient as possible on day one.

WRT config, I don't see the UI issue the way you describe.  It may be just a difference in design philosophy, but I don't see why we need to have a UI per table.  I think the existing (or, perhaps, angularized) CMF config UI can be made to deal with display mapping.  When configuring a field, if display_field is true, offer a modal action that creates or changes the mapping.  We save a UI, and lead the user to what they need at the time they need it.  And as far as having a mapping table in place, it will be tiny and not factor into performance (not that you claimed it would, I just want to make that clear.)

Also, I'm glad you mention the code/maintenance effect of adding a new class.  Currently, the classes are not special -- they can all be used for any purpose, and their semantic importance is only defined by the data mapping that humans require.  That is, we need (at least) 6 "slot types" and then an arbitrary set of named "keys" within each slot.  Elevating the display fields to a slot type, rather than a way of using the data (is it a facet? can I search it? can I browse it?) significantly changes the where the "specialness" lies.  (Of course, there is higher level code that cares about classes, but it's fine for higher level code to care.  Code at the level of CMF should /not/ treat classes differently, only do with the data what the configuration tells it to do.)

Put another way, instead of saying "I can search, browse, facet, and display values in key=FOO of slot=BAR" as we do today, we'd be saying "I can do those things for these 6 slots, but slot 7 is special and I need to act differently with that one."

I think one of the strengths of the Evergreen codebase is that we have managed to maintain a pretty high degree of conceptual layering purity, and once you understand the /concept/ of, say, "there are classes into which we separate data from bib records, and within those classes there are named fields that you can configure for use in search/browse/faceting", then you can both configure the system and, potentially, extend the system in new ways that don't break old assumptions.  And, yes, there are tons of details to CMF and friends configuration, and we haven't been perfect in this regard (we assume USMARC), but those details are /the same/ for all cases within the problem domain.  I see making a new class for a special purpose that breaks those assumptions as increasing the total maintenance cost and (more importantly) conceptual burden on current and future devs.

So, those are my thoughts.  Thanks, if you read this far... :)

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-06-12:

#39

Also, I should note that I think the code as-is is pretty darn close to commit-ready, IMO.

I want to think more about if/how we might start fast-pathing all non-full reingests -- hopefully we can shard what we have and put pieces back together as compossible units without incurring too much overhead (extra UDF calls) or doing too much repeated work (XML re-parsing), while also making each part simpler. I'll try to find some time in the next couple weeks to look at that ... I'm hopeful, but not overly optimistic, that it's doable with restricted tuits.

Mike Rylander (mrylander) on 2017-08-18

Changed in evergreen:
assignee:	nobody → Mike Rylander (mrylander)

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-08-18:

#40

I've rebased this to master and added some reingest streamlining. I got rid of the apparently redundant field definitions, but if they had a reason to live, please let me know about that. Branch at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1251394-display-fields

Revision history for this message

Bill Erickson (berick) wrote on 2017-08-18:

#41

Thanks, Mike. Streamlining++

The redundant definitions were added as a step toward addressing Galen's concerns from #26, i.e. that some stock index definitions will not match what we want to use for display data. The XPATH is identical because we never got around to actually creating the display-friendly definitions.

Also, why the return of the 'representative_field' column? That should no longer be needed with the new display_field_map table. If someone wants the title, it should be enough to ask for the display_field_map row whose name is 'title'.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-08-18:

#42

Updated with minor fixes. I'm pullrequesting this now...

Bill, do you think you'll have time to check this out soon-ish?

TIA

Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody
tags:	added: pullrequest webstaffclient

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-08-18:

#43

Re the representative_field, I grabbed it because it was in the baseline schema ... my bad, I didn't realize it was going away TBH.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-08-18:

#44

I'll get rid of the rep-field stuff in the upgrade and baseline, and adjust the baseline seed data a bit with display_xpath and joiners.

FWIW, I think we can drop the subject|topic display mapping ... the complete one (after setting the joiner to --) is really what we want, I think, to start with anyway.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-08-18:

#45

I've updated the branch as described.

Other than making sure that all the new views are represented somewhere in the baseline schema, I think it's pretty much reified.

For reference, regarding streamlining of reingest, I'm getting about 100 records/second for display-only reingest as shown in the YYYY upgrade file on a slow VM and a small data set.

Revision history for this message

Bill Erickson (berick) wrote on 2017-08-24:

#46

Thanks, Mike! I'm going to look at this tomorrow.

Bill Erickson (berick) on 2017-08-25

Changed in evergreen:
assignee:	nobody → Bill Erickson (berick)

Revision history for this message

Bill Erickson (berick) wrote on 2017-08-25:

#47

Here be another branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/berick/lp-1251394-display-fields-review-08-25

Stuff I did:

copied some remaining bits into the base schema / seed data
other misc. cleanup
replaced wide entry references to 'topic_subject' with the new 'creators' field.
added display_xpath to 'author' field to upgrade script
remove ZZZZ.UNDO script
made the reingest call more generic in case we make more field changes (and for release notes)
expanded inline documentation for webstaff egBibDisplay service
remove webstaff items-out proof-of-concept code (see below)
sign-off's for Mike's commits.
release notes

TODO:
More sign-offs needed.

Other thoughts:

From where I'm sitting, the code itself is ready. More eyes are great, obviously, but I think it's there. The seed data is very slim, though. As it stands, the data we have is only useful in addition to other data, for example it's not yet a full replacement for reporter.simple_record. (This is why I removed the browser client display code proof-of-concept, since it has the effect of reducing the number of bib fields in the grid).

Having said that, though, I'm OK with merging as-is, with the understanding we'll be expanding the seed data and using it in the code, coming soon to a release near you. If we go this route, we should probably remove (comment-out) the display reingest from the upgrade script, since it's probably not yet warranted.

Bill Erickson (berick) on 2017-08-25

Changed in evergreen:
assignee:	Bill Erickson (berick) → nobody

Mike Rylander (mrylander) on 2017-09-01

Changed in evergreen:
assignee:	nobody → Mike Rylander (mrylander)

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-09-01:

#48

Merged to master! And the people did feast upon the lambs, and sloths, and carp, and anchovies, and orangutans, and breakfast cereals, and fruit bats, and large chu...

Thanks, Dan(s), Bill, and all!

Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody
status:	Triaged → Fix Committed

Andrea Neiman (aneiman) on 2017-10-02

Changed in evergreen:
status:	Fix Committed → Fix Released

Galen Charlton (gmc) on 2017-10-02

Changed in evergreen:
milestone:	none → 3.0-beta

Evergreen

Metabib Display Fields

Bug Description

Other bug subscribers

Remote bug watches