incorrect initialisation of detectors near partition boundaries in parallel

Bug #967093 reported by Hannah Hiester
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fluidity
Fix Released
Low
Michael Lange

Bug Description

In parallel when a detector is initialised in a region that falls into 3 different partitions it gets initialised to the origin rather than the specified coordinate. If debugging is enabled the simulation fails at the assert on line 1834:Diagnostic_variables.F90.

This behaviour was seen initially in a run on cx2 and was reproduced with the attached files when run on 3 processors on dannii.

Related branches

Revision history for this message
Hannah Hiester (h-hiester07) wrote :
Revision history for this message
James Maddison (jamesmadd) wrote :

Hi Hannah,

create_single_detector looks suspect to me. The picker_inquire call has global = .false. which is a bit odd (presumably it's trying to guarantee that all containing elements are owned?). Then:

1377: if (.not.element_owned(xfield,element)) return

can return on all processes - local ownership inquiries can return different elements on different processes, and it's possible for all processes to determine that the node lies in their halo.

Revision history for this message
Hannah Hiester (h-hiester07) wrote :

This seems to do the trick. Have discussed updating one of the current detector tests with Jon to account for this behaviour so will get that sorted and then hopefully commit.

Jon Hill (jon-hill)
Changed in fluidity:
assignee: nobody → Hannah Hiester (h-hiester07)
importance: Undecided → Low
Revision history for this message
Hannah Hiester (h-hiester07) wrote :

Does anyone know about initialising detectors from a binary file?

I was trying to parallelise the test binary_detectors so that this fix can be continually tested without needing a whole new test. However, it falls over when trying to initialise a detector array from a binary file. Is this a known issue and/or is there an easy fix?

Jon Hill (jon-hill)
Changed in fluidity:
status: New → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Mark Filipiak (mjf-q) wrote :

This bug also appears when using flredecomp instead of fldecomp to partition the mesh in the square-convection-parallel-trivial test.

There is a detector at position (0.5,0.5), which is a point shared by 6 elements. Flredecomp splits the domain into 2 parts through this point, so some elements are owned by one process and some by another.

The call to picker_inquire with global=.false. is OK in this case because there are haloes and there is enough information is each process to decide if an element owns a detector. But in general global should be .true. if running in parallel.

The error I find is not in create_single_detector(Diagnostic_variables.F90)
but in node_owner_finder_find_multiple_positions(Node_Owner_Finder_Fortran.F90).
In find_serial() or find_parallel () the elements that might own the detector are found using external C++ code. Then there is a loop which exits immediately an element that owns the detector is found. The order of elements in this loop can be different on different processes so there is no guarantee that the same element is found on different processes when there are several elements that could own the detector. In the case I tested, this meant that process 1 found an element owned by process 2, and process 2 found an element owned by process 1; so the detector was lost.

To fix, instead of coming out of the loop at the first match, you could choose the element with the lowest universal number. That should give the same element for all processes. However, 1/ I don't think that deals with 'closest_miss' cases, and 2/ there are calls to this subroutine for meshes that do not have haloes. The solution (I think) for the parallel case is to only consider elements that this process owns.

A patch is attached that 1/ makes the call in create_single_detector to use global=.true. and 2/ modifies the find_parallel() case of node_owner_finder_find_multiple_positions to work in parallel.

find_serial() is used in several places and I don't know is these calls are safe in parallel or not:
 1. interpolate_boundary_values(Geostrophic_Pressure.F90) needs to find the not-owned elements for its interpolation to work, I think. It is using the haloes as they should be used, to be able to do an interpolation off-process without communication. However, it does interpolate to the halo region, that will not be correct.
 2. horizontal_picker(Vertical_Extrapolation.F90) also want the not-owned elements, but I reckon in general the vertical extrapolation (VerticalExtrapolation etc.) will not work for general meshes and general decompositions, where a surface element vertically above a point may be on a different process.
 3. distribute_detectors(Detectors_Parallel.F90) does not want not-owned elements, but probably should be doing a global check because of possible closest_miss elements.
 4. find_reference_pressure_node_from_coordinates(Boundary_Conditions_From_Options.F90) probably doesn't care because it does a check on the universal numbers afterwards (perhaps it never has meshes without halos defined?)

Changed in fluidity:
assignee: Hannah Hiester (h-hiester07) → Michael Lange (michael-lange)
Revision history for this message
James Maddison (jamesmadd) wrote :

The parallel detector algorithm is precision robust in parallel when using global=.true. . However, as written it doesn't guarantee that the returned element is owned (it could be a halo element) and it assumes that the detector lies within the domain somewhere (i.e. the caller guarantees it lies within some element). The former can be relaxed by populating the rtree with only owned elements, while the latter can be relaxed by checking the element local coordinate, and then the element-element list for the appearance of the domain boundary where a local coordinate is negative. Note that the latter requires the former to be implemented first, otherwise a detector could be found to lie outside the local partition, whereas it actually lies on the edge of a halo.

> To fix, instead of coming out of the loop at the first match, you could choose the element with the lowest universal number.

I think that's unsafe -- the lowest universal number element could in principle have a large negative local coord (a large "miss"), and hence clearly not contain the given detector. Note that decreasing the out-of-bounds tolerance doesn't really help to avoid this problem, as epsilon-ball approaches are inappropriate here.

> 1. interpolate_boundary_values(Geostrophic_Pressure.F90) needs to find the not-owned elements for its interpolation to work, I think.

This is a corner case where a local (process) inquiry is fine, as it is performing consistent interpolation of boundary values between matching meshes with difference basis functions.

Revision history for this message
James Maddison (jamesmadd) wrote :

> A patch is attached that ...

The attached patch for femtools/Node_Owner_Finder_Fortran.F90 looks OK, although it's a little inefficient as the rtree is populated with and returns unwanted non-owned elements.

Changed in fluidity:
status: In Progress → Fix Committed
Changed in fluidity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.