Frequent 'inner _print_tree' error

Bug #567505 reported by Steven Chan
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Jason Etheridge

Bug Description

On staff clients version 1.6.0.x and Evergreen version 1.6, we have been getting reports of the error whenever checkout stations need to print checkout receipts. The error is also reported on the Item Status and Check In screens. The commonality is the action to print a list of items. The errors occur frequently; at our largest site, they were experiencing the error on about 10% of checkouts. The error seems to persist across retries. When it occurs during checkout, the librarian has to ask the patron to go without a receipt or copy and paste it in another tool for printing. Some have reported that switching to the Items Out screen and printing the list there can work.

The error report on the screen complains about a Javascript typing error when trying to evaluate 'obj.columns[j]'. The error is caught in file trunk/Open-ILS/xul/staff_client/chrome/content/util/list.js, function _print_tree(), which is a method of the util.list class. Before being able to print, the internal data structure representing the items list needs to be dumped out. Internally, the list is a table of rows and columns held in a DOM tree. Dumping is done by the method _dump_tree_with_keys(). It executes the following code fragment:

for (var j = 0; j < treerow.childNodes.length; j++) {
    row[ obj.columns[j].id ] = treerow.childNodes[j].getAttribute('label');
}

The for loop needs to evaluate obj.columns[j], but the index j is too large, yielding an undefined value, and trying to evaluate an id property of an undefined yields a typing error. The for loop uses the length of each row array as the authority, but this can sometimes conflict with the length of the column headers, which is the obj.columns array. To fix the problem, we have patched the for statement as follows:

for (var j = 0; j < obj.columns[j].length; j++) {

A similar patch needs to be done for the methods _dump_tree_csv() and _dump_tree_selection_with_keys(). This will prevent other output actions (for example, Export, Copy to Clipboard) for an item list from suffering the same error.

To help diagnose the problem, we installed an error trap to dump out the list tree whenever the inner _print_tree_error occurs. The trap calls the _dump_tree() method, converts the object to JSON text and URI-encodes the text. It then sends the text as an HTTP GET request to an unhandled URL, where it is logged on the server in the Apache access log. See below for the error trap code:

'_debug': function (label, params) {
    var q = '?' + label + '=' + encodeURIComponent(js2JSON({ rows: this._dump_tree() }));
    try { new JSAN.Request().getText('/debug' + q); } catch (e) { ; }
 }

At the server, we recover the tree objects using a Perl program to parse the access log. (The parsing is complicated by having an error report broken into several possibly non-consecutive lines; also, some reports are truncated dued to a maximum limit imposed by the Apache logging facility.) Once parsed, we can inspect the tree data for every occurrence of the error. Here are several data samples, showing the number of columns in each row of checkout lists:

a. 47 47 47 47 94 47
b. 94 47 47 47 47 47 47 47
c. 94 94 47
d. 141 47 47 47 47
e. 0 47 94

The normal number of columns for a checkout list is 47, but cases a and b show rows in which they are doubled. Inspection of the data rows show they have a duplicate appended to the end. Case c shows a list with two duplicated rows. Case d shows a list with the first row having a triplicate (3 x 47 = 141). Case e shows a list with the first row missing data values completely and the third duplicated. Cases a and b are the most common; cases d and e are the least.

Before the file list.js was patched, we were trapping between 100 and 200 errors daily. After the patch, we saw traps of case e only, since our patch does not fix zero-length rows.

The error seems to have characteristics of a timing error or a race condition, occurring when the list of items is being incrementally built. It would be good to find the root cause. I also wonder whether the error has been reported by others.

Revision history for this message
Jason Etheridge (phasefx) wrote :

Steven, I really love the error trap scheme.

So with these cases that you say are duplicates, you mean, if a list should normally have:

<treeitem><treerow><treecell label="barcode 1" /></treerow></treeitem>
<treeitem><treerow><treecell label="barcode 2" /></treerow></treeitem>
<treeitem><treerow><treecell label="barcode 3" /></treerow></treeitem>

you might instead see:

<treeitem><treerow><treecell label="barcode 1" /><treecell label="barcode 1" /></treerow></treeitem>
<treeitem><treerow><treecell label="barcode 2" /></treerow></treeitem>
<treeitem><treerow><treecell label="barcode 3" /></treerow></treeitem>

?

For the case where you have

<treeitem<treerow></treerow></treeitem>

Do you know whether any data is being lost or if it's just a spurious row?

Would you be willing to try an experiment? I'm interested in seeing if commenting out all instances of obj.put_retrieving_label(treerow); (or putting an immediate return in that function) would make a difference. This shouldn't have any negative effects, just a cosmetic one (if rows are off-screen waiting to be rendered and suddenly come into view, they won't say "Retrieving..." while they flesh themselves).

There are just two functions that ultimately call document.createElement('treecell'); and that's one of them. The other is _map_row_to_treecell. If those are somehow being called multiple times for a given row before xulrunner gets a chance to update the DOM with the changes from that function, I can imagine redundant treecells being created. That'd be more complicated to test, so I want to eliminate the easy function first (put_retrieving_label).

Thanks!

-- Jason

Revision history for this message
Steven Chan (schan2) wrote :

Jason, I've just noticed your above comment.

The answer to question one is, yes, although I only see the tree object dumped out in JSON-like text. If the row has more than one cell, the first sequence of cells appear, then followed by one or more duplicated sequences.

The answer to question two is, probably spurious, thus no data loss. We probably would have had reports of print receipts not matching the screen display if there were data rows omitted from the printout. But this type of error is rare and the checkout attendant is often busy and so may not notice.

I'll see whether it is feasible to comment out obj.put_retrieving_label(treerow) in the production code. I think the ideal is to somehow trigger the error by simulating lots of inputs to the staff client, perhaps printing statuses of items.

Revision history for this message
Jason Etheridge (phasefx) wrote :

Since it's just XUL files, you could set things up where the changes are OPT-IN for volunteer testers, by using alternate hostnames, IP-based redirection, clients with special build id's, etc.

Revision history for this message
Ben Shum (bshum) wrote :

Hmm, I think we're seeing this problem with our system as well. Running 1.6.0.2 (with some modifications).

It's only affecting one of our library sites in particular, but they also have weird printer issues in general. The following is an example of the error messages we've seen. If it's unrelated, I apologize.

inner _print_tree

{
 "message":"obj.columns
 [
  j
 ]
  is undefined",
 "fileName":"http://server/xul/rel_1_6_0_2/server/main/JSAN.js",
 "lineNumber":1253,
 "stack":"((void 0))@http://server/xul/rel_1_6_0_2/server/main/JSAN.js:1253\n()@http://server/xul/rel_1_6_0_2/server/main/JSAN.js:1239\n()@http://server/xul/rel_1_6_0_2/server/main/JSAN.js:1370\n(5)@http://server/xul/rel_1_6_0_2/server/main/JSAN.js:987\n",
 "name":"TypeError"
}

Revision history for this message
Steven Chan (schan2) wrote :

Benjamin's error report shows that it is related to the same problem.

Revision history for this message
Bill Ott (bott) wrote :

This looked very promising, and initial results were very positive. Suddenly, we started seeing receipts with the template variables printed instead of the values they should have contained.

e.g.
%title%
Barcode: %barcode%
Due: %due_date%

Can relatively easily reproduce this by changing the for loop, but can not understand how one would be at all related to the other.

Revision history for this message
Jason Etheridge (phasefx) wrote :

Bill, to be clear, did you try just Steven's changes, or my experiment as well?

I can't imagine how the template variables not being interpolated would be related either.

Revision history for this message
Bill Ott (bott) wrote :

Jason, I ended up trying both. Together, and separately. From your experiment, I simply returned immediately from put_retrieving_label.

After adding an alert, I consistently received 47, much like Steven's debug info. Strangely, hard coding 47 into the loop resolved the template variable issue. Even though the result of obj.columns[j].length was consistently 47, using the calculated length results in the template variables getting printed.

I suspect that leaving the 47 in place would break other things that call this function.

Revision history for this message
Jason Etheridge (phasefx) wrote :

If you guys would, give this changeset a spin and see if it helps:
http://svn.open-ils.org/trac/ILS/changeset/16688

Revision history for this message
Bill Ott (bott) wrote :

Been testing it here with good results over the last couple days.

Revision history for this message
Jason Etheridge (phasefx) wrote :

Excellent, thanks. This has made its way into the 1.6.1 branch, so will be seen in a packaged release soon.

Revision history for this message
James Fournie (jfournie) wrote :

Hi,

I tried to backport this to 1.6.0 but I get this error when I do a patron search:

Please tell your friendly system administrator or software developer the following:

patron/search_result.xul Error: Could not create namespace[util.list]: SyntaxError: invalid label

I get other "Could not create namespace" errors doing other things in the staff client.

Is there any way to backport this to 1.6.0?

~James

Revision history for this message
Jason Etheridge (phasefx) wrote :

James, see if this patch does trick. If you can't find any problems, I'll commit it to rel_1_6_0. It bothers me a bit because the whitespace changes make it hard to read.

-- Jason

Revision history for this message
Jason Etheridge (phasefx) wrote :
Galen Charlton (gmc)
Changed in evergreen:
status: New → Fix Committed
assignee: nobody → Jason Etheridge (phasefx)
importance: Undecided → Medium
Revision history for this message
James Fournie (jfournie) wrote :

Fix released in 1.6.1

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.