relations: some cleanup in get_relationship() code
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
webtrees |
Invalid
|
Undecided
|
Unassigned |
Bug Description
This is meant as a request for improvement, though I suspect a bug that will surface sooner or later.
I am reporting against 1.3.2
source file webtrees\
The relations diagram is the one I like the most, but also the one that gives me the most trouble (server limits) in execution time, database queries and memory usage. Maybe a few minor improvements can mean a lot. Here is what I read in the code.
1- families are marked visited ($visited[
(NOTE: I checked this again, the effect will not be that dramatic, as the same family is visited again only for a next generation, formerly acting as parent, now as child, or vice-versa. Insertion of code "if (isset(
2- Using visited families, end branches or dead leaves could be removed from both p1nodes and NODE_CACHE: if all of a persons families have been visited, no new relationship can be found throug that person, and all paths endiing with that person can be removed from memory.
3- NODE_CACHE is never emptied, not even when both person 1 and person 2 are different from the previous call, in which case the contents of NODE_CACHE are not relevant for that querie.
4- NODE_CACHE is underused:
a* all current calls to get_relationship() are with ignore_cache == true
b* relationship pid1 --> pid2 could be deduced from pid2 --> pid1 if present in NODE_CACHE (one might indeed want to reverse the parameter order, as in bug https:/
c* when no cached relationship pid1 --> pid2 is found, NODE_CACHE is only searched for relationships of pid1 to the children of pid2; depending on the direction of lineage between pid1 and pid2, parents of pid2 can be more relevant than pid2's children.
d* The same work as in c* could be done for checking the relationship to pid2, of children and parents of pid1.
5- I suspect a memory leak. For a 10-generation straight-line relationship I, printed out the complete NODE_CACHE and p1nodes as text, with full paths and relationship descriptions, which gives me a textfile of 280kB. Compared to a 1-generation relationship (me to my father), memory usage however, as reported by the web server, went from 35MB to 54MB, which is 19MB or a factor of 67 times more than the full print of the 2 big arrays.
6- NODE_CACHE_LENGTH is set only when no relationship is found, in the block at line 1161, depending on it's value compared to that of maxlength. NODE_CACHE_LENGTH is for the rest only referred to in the block at line 1120 , executed if ignore_cache is false and the relationship is not found in the cache. The code is
if ((!empty(
if ($NODE_
return false;
}
I strongly suspect this to be a bug, that has not surfaced yet because at the moment get_relationship() is called only with maxlength == 0 or 4
7- The algorithm used in get_relationship() draws ever growing circles around pid1, until that circle contains pid2. If R is the radius of that last circle, the amount of work done is proportional to its surcace pi*R^2. Another aproach is to draw equal sized but ever growing circles around both pid1 and pid2, till these circles intersect. That will be at radius R/2. The amount of work done will be proportional to 2*pi*(R/2)^2 == (pi*R^2) /2, which is half of the present algorithm. The complexity will of course be a bit larger, with frequent calls to the function in_arrayr($needle, $haystack)
description: | updated |
description: | updated |
description: | updated |
description: | updated |
<<4a* all current calls to get_relationship() are with ignore_cache == true>>
This is because the NODE_CACHE logic was broken.
It would only cache one path of each length. So, if there were four paths, with lengths 7,9,9 and 12, then it would only show three of them: 7, (the first) 9 and 12.
I disabled it, while I considered how to fix it.
Most of the cost is fetching the records from the database and checking their privacy. Iterating over spouses/ children/ parents is an inexpensive operation. These are cached elsewhere.
I suspect that the node-cache costs more to maintain than it saves.
IMHO, we should delete it.
<<5- I suspect a memory leak>>
For each individual/family on the path, we must load the record and calculate its privacy (which often means loading its close relatives).
I think this is more likely to cause the high memory usage, than a memory leak.