DOLFIN

reading facet connections in parallel is extremely slow

Bug #928342 reported by Marco Morandini on 2012-02-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	DOLFIN	Fix Released	Medium	Unassigned	DOLFIN 1.1.0

Bug Description

The attached patch (against 1.0.x.) is an attempt to fix the more evident scalability issues
(mostly replacing std::find on vectors, but also with a more complex problem at the end of MeshPartitioning.h).
It reduces the initial startup time of about half an hour, required for a 321668 elements mesh, by more than an order of magnitude on my pc.
Please check with care the first part of the patch for MeshDistributed is correct: entity_indices
has a lot of repeated values, and using a simple set I'm dropping them all together in a single pass
if they are in global_index. Is this correct?
Any advice is welcome.

Thanks,

Marco

Revision history for this message

Marco Morandini (morandini) wrote on 2012-02-07:

diff.log Edit (8.4 KiB, text/plain)

Revision history for this message

Garth Wells (garth-wells) wrote on 2012-02-07: Re: [Bug 928342] Re: reading facet connections in parallel is extremely slow

Thanks for the patch. I'll take a look. It may be worth also testing
boost::unordered_set/map.

Garth

On 7 February 2012 16:21, Marco Morandini <email address hidden> wrote:
> ** Patch added: "diff.log"
> https://bugs.launchpad.net/bugs/928342/+attachment/2721968/+files/diff.log
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/928342
>
> Title:
> reading facet connections in parallel is extremely slow
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/928342/+subscriptions

Revision history for this message

Garth Wells (garth-wells) wrote on 2012-02-09: Re: [Bug 928342] [NEW] reading facet connections in parallel is extremely slow

On 7 February 2012 16:21, Marco Morandini <email address hidden> wrote:
> Public bug reported:
>
> The attached patch (against 1.0.x.) is an attempt to fix the more evident scalability issues
> (mostly replacing std::find on vectors, but also with a more complex problem at the end of MeshPartitioning.h).
> It reduces the initial startup time of about half an hour, required for a 321668 elements mesh, by more than an order of magnitude on my pc.

This sounds strange. Do you have compiler optimisation turned on? The code

#include <dolfin.h>
using namespace dolfin;
int main()
{
UnitCube mesh(64, 64, 64);
cout << dolfin::MPI::sum(mesh.num_cells()) << endl;
}

builds and distributes a Mesh with > 1.5M cells in 32s on my laptop
with 4 MPI processes. It seems strange that a mesh with 321668 cells
could require more than ~10s.

The massive bottleneck that I see, and have for some time, is the call to

// Generate facet - cell connectivity if not generated
mesh.init(D - 1, D);

from inside BoundaryComputation::compute_boundary_common. Of the 32s
total runtime, 23s is spent on this function call.

I'm hesitant to investigate any mesh optimisations before looking at
the init functions since in my experience their cost dwarfs the other
steps.

Garth

> Please check with care the first part of the patch for MeshDistributed is correct: entity_indices
> has a lot of repeated values, and using a simple set I'm dropping them all together in a single pass
> if they are in global_index. Is this correct?
> Any advice is welcome.
>
> Thanks,
>
> Marco
>
> ** Affects: dolfin
> Importance: Undecided
> Status: New
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/928342
>
> Title:
> reading facet connections in parallel is extremely slow
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/928342/+subscriptions

On 7 February 2012 16:21, Marco Morandini <928342@bugs.launchpad.net> wrote:
> Public bug reported:
>
> The attached patch (against 1.0.x.) is an attempt to fix the more evident scalability issues
> (mostly replacing std::find on vectors, but also with a more complex problem at the end of MeshPartitioning.h).
> It reduces the initial startup time of about half an hour, required for a 321668 elements mesh, by more than an order of magnitude on my pc.

This sounds strange. Do you have compiler optimisation turned on? The code

#include <dolfin.h>
using namespace dolfin;
int main()
{
  UnitCube mesh(64, 64, 64);
  cout << dolfin::MPI::sum(mesh.num_cells()) << endl;
}

builds and distributes a Mesh with > 1.5M cells in 32s on my laptop
with 4 MPI processes.  It seems strange that a mesh with 321668 cells
could require more than ~10s.

The massive bottleneck that I see, and have for some time, is the call to

// Generate facet - cell connectivity if not generated
  mesh.init(D - 1, D);

from inside BoundaryComputation::compute_boundary_common. Of the 32s
total runtime, 23s is spent on this function call.

I'm hesitant to investigate any mesh optimisations before looking at
the init functions since in my experience their cost dwarfs the other
steps.

Garth

> Please check with care the first part of the patch for MeshDistributed is correct: entity_indices
> has a lot of repeated values, and using a simple set I'm dropping them all together in a single pass
> if they are in global_index. Is this correct?
> Any advice is welcome.
>
> Thanks,
>
> Marco
>
> ** Affects: dolfin
>     Importance: Undecided
>         Status: New
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/928342
>
> Title:
>  reading facet connections in parallel is extremely slow
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/928342/+subscriptions

Revision history for this message

Marco Morandini (morandini) wrote on 2012-02-12:

>This sounds strange.

My initial report was missing a testcase, an analysis and, worst of all, the
title is wrong. It should have been something like "reading facet mesh function
from a file in parallel is extremely slow".
Apologize for that.

1) Testcase: run the following code first with only a process in order to
build the input file, then in parallel.

#include <dolfin.h>
using namespace dolfin;
int main(void)
{
  UnitCube mesh(50, 50, 50);
  if (MPI::num_processes() == 1)
  {
    MeshFunction<uint> mfc(mesh, 2, 0);
    File mfcf("mesh_facet_markers.xml.gz");
    mfcf << mfc;
  }
  MeshFunction<uint> mfc(mesh, "mesh_facet_markers.xml.gz");
  cout << dolfin::MPI::sum(mesh.num_cells()) << endl;
  return 0;
}

Im' getting this on my laptop:
marco@darkstar:~/Programmi/Dolphin> time ./TestCase && time mpirun -np 2 TestCase
750000

real 0m25.556s
user 0m24.827s
sys 0m0.665s
Process 0: Number of global vertices: 132651
Process 0: Number of global cells: 750000
Process 0: Partitioned mesh, edge cut is 6083.
Process 1: Partitioned mesh, edge cut is 6083.
Process 0: 375030
Process 1: 374970

real 16m55.749s
user 33m39.199s
sys 0m6.482s

2) Analysis.
2.1) MeshPartitioning.h line 223:
std::find on a vector is an
O(global_entity_indices.size()) operation.
It is performed insed a loop spanning [0, ldata.size())
-> global_entity_indices.size() * ldata.size() operations.
The proposed solution leads to N*log(N) average operations.
I agree with you that hashed containers should be investigated
in order to check if the better scalability
they could offer is does not cost too much for
smaller meshes.

2.2) MeshPartitioning.h lines 245-273:
The second loop at line 251 is performed to check
the condition of line 260. If I've understoo correctly,
the condition is true, for the test case,
only four time for each time the whole loop is run
(i.e. for the four facets of each cell); it can be
avoided pre-building the map (of hashed container) map_of_ldata

2.3) MeshDistributed.cpp line 89 and 145: same problem of 2.1:
std::find on a vector performed inside a loop.

>This sounds strange.

1) Testcase: run the following code first with only a process in order to
build the input file, then in parallel.

#include <dolfin.h>
using namespace dolfin;
int main(void)
{
  UnitCube mesh(50, 50, 50);
  if (MPI::num_processes() == 1) 
  {
    MeshFunction<uint> mfc(mesh, 2, 0);
    File mfcf("mesh_facet_markers.xml.gz");
    mfcf << mfc;
  }
  MeshFunction<uint> mfc(mesh, "mesh_facet_markers.xml.gz");
  cout << dolfin::MPI::sum(mesh.num_cells()) << endl;
  return 0;
}

Im' getting this on my laptop:
marco@darkstar:~/Programmi/Dolphin> time ./TestCase && time mpirun -np 2 TestCase
750000

real    0m25.556s
user    0m24.827s
sys     0m0.665s
Process 0: Number of global vertices: 132651
Process 0: Number of global cells: 750000
Process 0: Partitioned mesh, edge cut is 6083.
Process 1: Partitioned mesh, edge cut is 6083.
Process 0: 375030
Process 1: 374970

real    16m55.749s
user    33m39.199s
sys     0m6.482s

2) Analysis.
2.1) MeshPartitioning.h line 223: 
std::find on a vector is an 
O(global_entity_indices.size()) operation.
It is performed insed a loop spanning [0, ldata.size())
-> global_entity_indices.size() * ldata.size() operations.
The proposed solution leads to N*log(N) average operations.
I agree with you that hashed containers should be investigated
in order to check if the better scalability
they could offer is does not cost too much for
smaller meshes.

2.2) MeshPartitioning.h lines 245-273:
The second loop at line 251 is performed to check
the condition of line 260. If I've understoo correctly,
the condition is true, for the test case, 
only four time for each time the whole loop is run 
(i.e. for the four facets of each cell); it can be
avoided pre-building the map (of hashed container) map_of_ldata

2.3) MeshDistributed.cpp line 89 and 145: same problem of 2.1:
std::find on a vector performed inside a loop.

Revision history for this message

Garth Wells (garth-wells) wrote on 2012-02-14:

Could you post your "mesh_facet_markers.xml.gz" or upload it somewhere. We can use it as a benchmark to measure performance improvements.

Revision history for this message

Marco Morandini (morandini) wrote on 2012-02-14: Re: [Bug 928342] Re: reading facet connections in parallel is extremely slow

> Could you post your "mesh_facet_markers.xml.gz" or upload it somewhere.
> We can use it as a benchmark to measure performance improvements.
>

I'm out of office right now. I could post it tomorrow; in the meantime it
should be enough
to make a serial run of the testcase I've provided in comment #4 in order
to have it on your hd.
Let me know if you need it anyhow.

Revision history for this message

Marco Morandini (morandini) wrote on 2012-02-15:

> Could you post your "mesh_facet_markers.xml.gz" or upload it somewhere.

http://www.aero.polimi.it/morandini/Misc/DolfinFiles/mesh_facet_markers.xml.gz : face markers built using comment #4' testcase (mesh: UnitCube(50, 50, 50) )

www.aero.polimi.it/morandini/Misc/DolfinFiles/naca0012.xml.gz : mesh for a naca0012 wing
www.aero.polimi.it/morandini/Misc/DolfinFiles/naca0012_face_markers.xml.gz : face markers for the naca wing mesh

Feel free to re-distribute the mesh.
Please let me know if you have problems downloading them and when I can take them off the website.

Thanks,

Marco

Revision history for this message

Garth Wells (garth-wells) wrote on 2012-05-17:

I'm having trouble applying this patch. Could you push a branch to Launchpad, e.g.

bzr push lp:~morandini/dolfin/facet-connections

and request a merge. I won't apply this change to 1.0, so please request a merge against lp:dolfin and make sure that you're up-to-date with the lp:dolfin tip.