webtrees

relations: some cleanup in get_relationship() code

Bug #1072365 reported by Dirk Bosmans on 2012-10-28

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	webtrees	Invalid	Undecided	Unassigned

Bug Description

This is meant as a request for improvement, though I suspect a bug that will surface sooner or later.

I am reporting against 1.3.2
source file webtrees\includes\functions\functions.php, function get_relationship($pid1, $pid2, $followspouse=true, $maxlength=0, $ignore_cache=false, $path_to_find=0)

The relations diagram is the one I like the most, but also the one that gives me the most trouble (server limits) in execution time, database queries and memory usage. Maybe a few minor improvements can mean a lot. Here is what I read in the code.

1- families are marked visited ($visited[$family->getXref()] = true;) twice, on lines 1259 and 1303, but never checked and skipped when visited. In both cases, all spouses and all children of the retrieved family are visited, there is no need to do this again once the family has been visited. This will save an exponential amount (numchildren^2 I guess) of database access.
(NOTE: I checked this again, the effect will not be that dramatic, as the same family is visited again only for a next generation, formerly acting as parent, now as child, or vice-versa. Insertion of code "if (isset($visited[$family->getXref()])) continue;" did however reduce database queries by about 1/5th, in a testcase.)

2- Using visited families, end branches or dead leaves could be removed from both p1nodes and NODE_CACHE: if all of a persons families have been visited, no new relationship can be found throug that person, and all paths endiing with that person can be removed from memory.

3- NODE_CACHE is never emptied, not even when both person 1 and person 2 are different from the previous call, in which case the contents of NODE_CACHE are not relevant for that querie.

4- NODE_CACHE is underused:
   a* all current calls to get_relationship() are with ignore_cache == true
   b* relationship pid1 --> pid2 could be deduced from pid2 --> pid1 if present in NODE_CACHE (one might indeed want to reverse the parameter order, as in bug https://bugs.launchpad.net/webtrees/+bug/899785 )
   c* when no cached relationship pid1 --> pid2 is found, NODE_CACHE is only searched for relationships of pid1 to the children of pid2; depending on the direction of lineage between pid1 and pid2, parents of pid2 can be more relevant than pid2's children.
   d* The same work as in c* could be done for checking the relationship to pid2, of children and parents of pid1.

5- I suspect a memory leak. For a 10-generation straight-line relationship I, printed out the complete NODE_CACHE and p1nodes as text, with full paths and relationship descriptions, which gives me a textfile of 280kB. Compared to a 1-generation relationship (me to my father), memory usage however, as reported by the web server, went from 35MB to 54MB, which is 19MB or a factor of 67 times more than the full print of the 2 big arrays.

6- NODE_CACHE_LENGTH is set only when no relationship is found, in the block at line 1161, depending on it's value compared to that of maxlength. NODE_CACHE_LENGTH is for the rest only referred to in the block at line 1120 , executed if ignore_cache is false and the relationship is not found in the cache. The code is
   if ((!empty($NODE_CACHE_LENGTH))&&($maxlength>0)) {
   if ($NODE_CACHE_LENGTH>=$maxlength)
    return false;
  }
I strongly suspect this to be a bug, that has not surfaced yet because at the moment get_relationship() is called only with maxlength == 0 or 4

7- The algorithm used in get_relationship() draws ever growing circles around pid1, until that circle contains pid2. If R is the radius of that last circle, the amount of work done is proportional to its surcace pi*R^2. Another aproach is to draw equal sized but ever growing circles around both pid1 and pid2, till these circles intersect. That will be at radius R/2. The amount of work done will be proportional to 2*pi*(R/2)^2 == (pi*R^2) /2, which is half of the present algorithm. The complexity will of course be a bit larger, with frequent calls to the function in_arrayr($needle, $haystack)

See original description

Dirk Bosmans (dirk-bosmans) on 2012-10-28

description:	updated
description:	updated

Dirk Bosmans (dirk-bosmans) on 2012-10-28

description:

updated

Dirk Bosmans (dirk-bosmans) on 2012-10-28

description:

updated

Revision history for this message

fisharebest (fisharebest) wrote on 2012-10-28:

<<4a* all current calls to get_relationship() are with ignore_cache == true>>

This is because the NODE_CACHE logic was broken.

It would only cache one path of each length. So, if there were four paths, with lengths 7,9,9 and 12, then it would only show three of them: 7, (the first) 9 and 12.

I disabled it, while I considered how to fix it.

Most of the cost is fetching the records from the database and checking their privacy. Iterating over spouses/children/parents is an inexpensive operation. These are cached elsewhere.

I suspect that the node-cache costs more to maintain than it saves.

IMHO, we should delete it.

<<5- I suspect a memory leak>>

For each individual/family on the path, we must load the record and calculate its privacy (which often means loading its close relatives).

I think this is more likely to cause the high memory usage, than a memory leak.

Revision history for this message

fisharebest (fisharebest) wrote on 2012-10-28:

If we only wanted to find the shortest path, there are many improvements we could make.

But, we need to be able to find *all* possible paths between two individuals. For example, if you marry your cousin, webtrees must find the two relationships "cousin" and "wife".

The algorithm (I did not write it) appears to be based on Dijkstra's well-known shortest-path algorithm.

The logic for finding the next shortest path does not appear to be based on the traditional extensions to Dijkstra. Typically, one would exclude each node in turn from the path, and call the algorithm again.

I am not entirely sure how it was intended to work. (I know that it fails in many cases!)

It needs a complete rewrite....

Revision history for this message

Dirk Bosmans (dirk-bosmans) wrote on 2012-10-28:

<<For each individual/family on the path, we must load the record and calculate its privacy (which often means loading its close relatives).>>
As far as my understanding goes, the privacy check is bypassed with the WT_PRIV_HIDE parameter.

<<It needs a complete rewrite....>>
I may have bad manners in programming, but for the heavy work I go low level. So for maxlenght=0, I would load the complete set of person to family relationships from the link table at once, and do the rest with array indices instead of database access. Checking privacy can be done on the resulting path, and if rejected, just search another path.

Revision history for this message

fisharebest (fisharebest) wrote on 2012-10-28:

<<the privacy check is bypassed with the WT_PRIV_HIDE parameter.>>

Of course. You are right. That is the purpose of the parameter.

<<I would load the complete set of person to family relationships from the link table at once, and do the rest with array indices instead of database access.>>

I have already written a complete replacement - which runs as a MySQL stored procedure - using this technique. It creates a temporary table, based on the wt_links. Because we are using a database, each additional "circle" can be searched with a single DB query, so it is fairly efficient.

It also uses the new relationship/translation system that is described on the wiki.

I just need to convert it from MySQL to PHP.

Meanwhile, I have made a few adjustments to the algorithm. It has some heuristics, so that parent/child links have a lower cost than spouse/sibling links. This provides a 10% improvement for all the test-cases that I tried.

I have looked at the code using a profiling tool. Approx 20% of the time is spent in the function and 80% is spent in other functions. So, there should be opportunity to

I should be able to save a little more time/memory by replacing the array 'relations' with a string.

We always convert this array to a string, so it is easier to create the string initially.

Revision history for this message

meliza (meliza) wrote on 2012-11-01:

I agree that we should not show subsequent trivial relationships.
When I look on the svn demo site at i14-i5 (Camilla Rosemary Shand and Prince Andrew, Duke of York) I do not want to see the link between Prince Charles and Prince Andrew via their parents or siblings. These paths are not of interest.

Erroneous paths:
In path numbered 3 I see to the right Prince Andrew twice, the second time as his own husband?!?

1st path
The 1st shown path showing brother-in-law, (Camilla, Prince Charles and prince Andrew) is not numbered and I cannot return to it. The numbering starts only from the second path.

Missing path
On my site, when I look at the relationship between third cousins I2345 and I1, I see paths via I1's mother and others via the father.
When I look at relationships of I1-I2345, I see only relationships via the mother and then "No other link between the two individuals could be found.".

I have sent fisharebest e-mails on these issues.

Meliza

Revision history for this message

Dirk Bosmans (dirk-bosmans) wrote on 2012-11-01:

<< the link between Prince Charles and Prince Andrew via their parents or siblings>>

This could be caused by attributing to a direct path (edge) between siblings a length that is greater than twice the length attributed to a parent-to-child edge.

<< I see to the right Prince Andrew twice, the second time as his own husband?!?>>

This is another example of how royals behave according to a different morality than mortals. But in se there is nothing wrong with auto-marriage

Revision history for this message

kiwi (kiwi3685-deactivatedaccount) wrote on 2012-11-02:

Can you PLEASE discuss this on the forum. That way the wider community can be involved. They will not look here for a feature request or change, only for actual bugs.

This, as Dirk said at the start is not a bug in the true sense. Therefore it should not be here.

Revision history for this message

fisharebest (fisharebest) wrote on 2014-09-28:

The relationship code needs to be rewritten. Any further comments should be made at https://github.com/fisharebest/webtrees/issues/256