LinearOperator does not work in parallel

Bug #1088175 reported by Garth Wells
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DOLFIN
Fix Released
Medium
Anders Logg

Bug Description

LinearOperator is not written to work in parallel. It does not respect the parallel layout of matrices and vectors. This leads to crashes.

The unit test has been disabled in parallel.

Changed in dolfin:
assignee: nobody → Anders Logg (logg)
Revision history for this message
Anders Logg (logg) wrote : Re: [Bug 1088175] [NEW] LinearOperator does not work in parallel

I can take a look but object to the description. It is indeed written
to work in parallel. It has also been tested to work in parallel
before.

--
Anders

On Sun, Dec 09, 2012 at 01:00:59PM -0000, Garth Wells wrote:
> Public bug reported:
>
> LinearOperator is not written to work in parallel. It does not respect
> the parallel layout of matrices and vectors. This leads to crashes.
>
> The unit test has been disabled in parallel.
>
> ** Affects: dolfin
> Importance: Undecided
> Assignee: Anders Logg (logg)
> Status: New
>
> ** Changed in: dolfin
> Assignee: (unassigned) => Anders Logg (logg)
>

Revision history for this message
Garth Wells (garth-wells) wrote :

On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> I can take a look but object to the description. It is indeed written
> to work in parallel. It has also been tested to work in parallel
> before.
>

Tested where?

It's clear from the code that it will only work in parallel for very
special cases.

Garth

> --
> Anders
>
>
> On Sun, Dec 09, 2012 at 01:00:59PM -0000, Garth Wells wrote:
>> Public bug reported:
>>
>> LinearOperator is not written to work in parallel. It does not respect
>> the parallel layout of matrices and vectors. This leads to crashes.
>>
>> The unit test has been disabled in parallel.
>>
>> ** Affects: dolfin
>> Importance: Undecided
>> Assignee: Anders Logg (logg)
>> Status: New
>>
>> ** Changed in: dolfin
>> Assignee: (unassigned) => Anders Logg (logg)
>>
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/1088175
>
> Title:
> LinearOperator does not work in parallel
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions

Revision history for this message
Anders Logg (logg) wrote :

On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> > I can take a look but object to the description. It is indeed written
> > to work in parallel. It has also been tested to work in parallel
> > before.
> >
>
> Tested where?

I thought the unit test was running fine on the buildbots but I see
now it was never added to the list of tests (in the main test.py).

--
Anders

> It's clear from the code that it will only work in parallel for very
> special cases.
> Garth
>
> >
> >
> > On Sun, Dec 09, 2012 at 01:00:59PM -0000, Garth Wells wrote:
> >> Public bug reported:
> >>
> >> LinearOperator is not written to work in parallel. It does not respect
> >> the parallel layout of matrices and vectors. This leads to crashes.
> >>
> >> The unit test has been disabled in parallel.
> >>
> >> ** Affects: dolfin
> >> Importance: Undecided
> >> Assignee: Anders Logg (logg)
> >> Status: New
> >>
> >> ** Changed in: dolfin
> >> Assignee: (unassigned) => Anders Logg (logg)
> >>
> >
> >
> > Title:
> > LinearOperator does not work in parallel
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions
>

Revision history for this message
Garth Wells (garth-wells) wrote :

On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
> On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
>> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
>> > I can take a look but object to the description. It is indeed written
>> > to work in parallel. It has also been tested to work in parallel
>> > before.
>> >
>>
>> Tested where?
>
> I thought the unit test was running fine on the buildbots but I see
> now it was never added to the list of tests (in the main test.py).
>

The unit test is a special case for which it may run on 3 processes
because the number of dofs is divisible by 3. In general, the parallel
layouts will not match.

Garth

> --
> Anders
>
>> It's clear from the code that it will only work in parallel for very
>> special cases.
>> Garth
>>
>> >
>> >
>> > On Sun, Dec 09, 2012 at 01:00:59PM -0000, Garth Wells wrote:
>> >> Public bug reported:
>> >>
>> >> LinearOperator is not written to work in parallel. It does not respect
>> >> the parallel layout of matrices and vectors. This leads to crashes.
>> >>
>> >> The unit test has been disabled in parallel.
>> >>
>> >> ** Affects: dolfin
>> >> Importance: Undecided
>> >> Assignee: Anders Logg (logg)
>> >> Status: New
>> >>
>> >> ** Changed in: dolfin
>> >> Assignee: (unassigned) => Anders Logg (logg)
>> >>
>> >
>> >
>> > Title:
>> > LinearOperator does not work in parallel
>> >
>> > To manage notifications about this bug go to:
>> > https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions
>>
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/1088175
>
> Title:
> LinearOperator does not work in parallel
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions

Revision history for this message
Anders Logg (logg) wrote :

On Sun, Dec 09, 2012 at 02:32:19PM -0000, Garth Wells wrote:
> On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
> > On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
> >> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> >> > I can take a look but object to the description. It is indeed written
> >> > to work in parallel. It has also been tested to work in parallel
> >> > before.
> >> >
> >>
> >> Tested where?
> >
> > I thought the unit test was running fine on the buildbots but I see
> > now it was never added to the list of tests (in the main test.py).
> >
>
> The unit test is a special case for which it may run on 3 processes
> because the number of dofs is divisible by 3. In general, the parallel
> layouts will not match.

It doesn't even work for that case. So what happened is that I thought
the test was run on the buildbot and therefore worked in parallel. I
see now why it doesn't work but it should be relatively easy to fix.

--
Anders

> Garth
>
> >
> >> It's clear from the code that it will only work in parallel for very
> >> special cases.
> >> Garth
> >>
> >> >
> >> >
> >> > On Sun, Dec 09, 2012 at 01:00:59PM -0000, Garth Wells wrote:
> >> >> Public bug reported:
> >> >>
> >> >> LinearOperator is not written to work in parallel. It does not respect
> >> >> the parallel layout of matrices and vectors. This leads to crashes.
> >> >>
> >> >> The unit test has been disabled in parallel.
> >> >>
> >> >> ** Affects: dolfin
> >> >> Importance: Undecided
> >> >> Assignee: Anders Logg (logg)
> >> >> Status: New
> >> >>
> >> >> ** Changed in: dolfin
> >> >> Assignee: (unassigned) => Anders Logg (logg)
> >> >>
> >> >
> >> >
> >> > Title:
> >> > LinearOperator does not work in parallel
> >> >
> >> > To manage notifications about this bug go to:
> >> > https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions
> >>
> >
> >
> > Title:
> > LinearOperator does not work in parallel
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions
>

Revision history for this message
Anders Logg (logg) wrote :

On Sun, Dec 09, 2012 at 04:02:03PM +0100, Anders Logg wrote:
> On Sun, Dec 09, 2012 at 02:32:19PM -0000, Garth Wells wrote:
> > On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
> > > On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
> > >> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> > >> > I can take a look but object to the description. It is indeed written
> > >> > to work in parallel. It has also been tested to work in parallel
> > >> > before.
> > >> >
> > >>
> > >> Tested where?
> > >
> > > I thought the unit test was running fine on the buildbots but I see
> > > now it was never added to the list of tests (in the main test.py).
> > >
> >
> > The unit test is a special case for which it may run on 3 processes
> > because the number of dofs is divisible by 3. In general, the parallel
> > layouts will not match.
>
> It doesn't even work for that case. So what happened is that I thought
> the test was run on the buildbot and therefore worked in parallel. I
> see now why it doesn't work but it should be relatively easy to fix.

I've made an attempt to fix this but it still crashes. I'm
initializing the local range for the mat shell matrix using the same
local range as for the solution vector:

  // Get local range
  std::size_t m_local = M;
  std::size_t n_local = N;
  if (MPI::num_processes() > 1)
  {
    std::pair<std::size_t, std::size_t> local_range = x.local_range();
    m_local = local_range.first;
    n_local = local_range.second;
  }

  // Initialize PETSc matrix
  A.reset(new Mat, PETScMatrixDeleter());
  MatCreateShell(PETSC_COMM_WORLD, m_local, n_local, M, N, (void*) this, A.get());
  MatShellSetOperation(*A, MATOP_MULT, (void (*)()) usermult);

What am I missing? Does the solution vector not have the same parallel
layout as the right-hand side b (which gets multiplied by the matrix
in the Krylov solver)?

--
Anders

Revision history for this message
Garth Wells (garth-wells) wrote :

On Tue, Dec 11, 2012 at 11:21 PM, Anders Logg <email address hidden> wrote:
> On Sun, Dec 09, 2012 at 04:02:03PM +0100, Anders Logg wrote:
>> On Sun, Dec 09, 2012 at 02:32:19PM -0000, Garth Wells wrote:
>> > On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
>> > > On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
>> > >> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
>> > >> > I can take a look but object to the description. It is indeed written
>> > >> > to work in parallel. It has also been tested to work in parallel
>> > >> > before.
>> > >> >
>> > >>
>> > >> Tested where?
>> > >
>> > > I thought the unit test was running fine on the buildbots but I see
>> > > now it was never added to the list of tests (in the main test.py).
>> > >
>> >
>> > The unit test is a special case for which it may run on 3 processes
>> > because the number of dofs is divisible by 3. In general, the parallel
>> > layouts will not match.
>>
>> It doesn't even work for that case. So what happened is that I thought
>> the test was run on the buildbot and therefore worked in parallel. I
>> see now why it doesn't work but it should be relatively easy to fix.
>
> I've made an attempt to fix this but it still crashes.

If you're using test/unit/la/python/LinearOperator.py to test, the line:

    b = Vector(V.dim())

will certainly cause a problem. We have the function
GenericMatrix::.resize(GenericVector&x, std::size_t dim) to re-size a
vector consistently with a matrix operator,

Garth

> I'm
> initializing the local range for the mat shell matrix using the same
> local range as for the solution vector:
>
> // Get local range
> std::size_t m_local = M;
> std::size_t n_local = N;
> if (MPI::num_processes() > 1)
> {
> std::pair<std::size_t, std::size_t> local_range = x.local_range();
> m_local = local_range.first;
> n_local = local_range.second;
> }
>
> // Initialize PETSc matrix
> A.reset(new Mat, PETScMatrixDeleter());
> MatCreateShell(PETSC_COMM_WORLD, m_local, n_local, M, N, (void*) this, A.get());
> MatShellSetOperation(*A, MATOP_MULT, (void (*)()) usermult);
>
> What am I missing? Does the solution vector not have the same parallel
> layout as the right-hand side b (which gets multiplied by the matrix
> in the Krylov solver)?
>
> --
> Anders
>
> --
> You received this bug notification because you are a member of DOLFIN
> Core Team, which is subscribed to DOLFIN.
> https://bugs.launchpad.net/bugs/1088175
>
> Title:
> LinearOperator does not work in parallel
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/dolfin/+bug/1088175/+subscriptions

Revision history for this message
Anders Logg (logg) wrote :

On Wed, Dec 12, 2012 at 08:58:26AM -0000, Garth Wells wrote:
> On Tue, Dec 11, 2012 at 11:21 PM, Anders Logg <email address hidden> wrote:
> > On Sun, Dec 09, 2012 at 04:02:03PM +0100, Anders Logg wrote:
> >> On Sun, Dec 09, 2012 at 02:32:19PM -0000, Garth Wells wrote:
> >> > On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
> >> > > On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
> >> > >> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> >> > >> > I can take a look but object to the description. It is indeed written
> >> > >> > to work in parallel. It has also been tested to work in parallel
> >> > >> > before.
> >> > >> >
> >> > >>
> >> > >> Tested where?
> >> > >
> >> > > I thought the unit test was running fine on the buildbots but I see
> >> > > now it was never added to the list of tests (in the main test.py).
> >> > >
> >> >
> >> > The unit test is a special case for which it may run on 3 processes
> >> > because the number of dofs is divisible by 3. In general, the parallel
> >> > layouts will not match.
> >>
> >> It doesn't even work for that case. So what happened is that I thought
> >> the test was run on the buildbot and therefore worked in parallel. I
> >> see now why it doesn't work but it should be relatively easy to fix.
> >
> > I've made an attempt to fix this but it still crashes.
>
> If you're using test/unit/la/python/LinearOperator.py to test, the line:
>
> b = Vector(V.dim())
>
> will certainly cause a problem. We have the function
> GenericMatrix::.resize(GenericVector&x, std::size_t dim) to re-size a
> vector consistently with a matrix operator,

Yes, that line was obviously wrong. I've replaced it with a proper
right-hand side now in the unit test but it still breaks.

It looks like I need to initialize the local rows and columns separately:

http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatCreateShell.html

The local rows are determined by the result vector y and the local
columns by the vector x in y = Ax. But in a Krylov solve, the result
vector y will again be used to multiply the matrix, so it looks to me
that it needs to be the same?

--
Anders

Revision history for this message
Anders Logg (logg) wrote :

On Wed, Dec 12, 2012 at 10:39:37AM +0100, Anders Logg wrote:
> On Wed, Dec 12, 2012 at 08:58:26AM -0000, Garth Wells wrote:
> > On Tue, Dec 11, 2012 at 11:21 PM, Anders Logg <email address hidden> wrote:
> > > On Sun, Dec 09, 2012 at 04:02:03PM +0100, Anders Logg wrote:
> > >> On Sun, Dec 09, 2012 at 02:32:19PM -0000, Garth Wells wrote:
> > >> > On Sun, Dec 9, 2012 at 2:01 PM, Anders Logg <email address hidden> wrote:
> > >> > > On Sun, Dec 09, 2012 at 01:27:50PM -0000, Garth Wells wrote:
> > >> > >> On Sun, Dec 9, 2012 at 1:20 PM, Anders Logg <email address hidden> wrote:
> > >> > >> > I can take a look but object to the description. It is indeed written
> > >> > >> > to work in parallel. It has also been tested to work in parallel
> > >> > >> > before.
> > >> > >> >
> > >> > >>
> > >> > >> Tested where?
> > >> > >
> > >> > > I thought the unit test was running fine on the buildbots but I see
> > >> > > now it was never added to the list of tests (in the main test.py).
> > >> > >
> > >> >
> > >> > The unit test is a special case for which it may run on 3 processes
> > >> > because the number of dofs is divisible by 3. In general, the parallel
> > >> > layouts will not match.
> > >>
> > >> It doesn't even work for that case. So what happened is that I thought
> > >> the test was run on the buildbot and therefore worked in parallel. I
> > >> see now why it doesn't work but it should be relatively easy to fix.
> > >
> > > I've made an attempt to fix this but it still crashes.
> >
> > If you're using test/unit/la/python/LinearOperator.py to test, the line:
> >
> > b = Vector(V.dim())
> >
> > will certainly cause a problem. We have the function
> > GenericMatrix::.resize(GenericVector&x, std::size_t dim) to re-size a
> > vector consistently with a matrix operator,
>
> Yes, that line was obviously wrong. I've replaced it with a proper
> right-hand side now in the unit test but it still breaks.
>
> It looks like I need to initialize the local rows and columns separately:
>
> http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/MatCreateShell.html
>
> The local rows are determined by the result vector y and the local
> columns by the vector x in y = Ax. But in a Krylov solve, the result
> vector y will again be used to multiply the matrix, so it looks to me
> that it needs to be the same?

Works now. The problem was this...

  m_local = local_range.first;
  n_local = local_range.second;

:-)

Now replaced by this:

  m_local = local_range_y.second - local_range_y.first;
  n_local = local_range_x.second - local_range_x.first;

A LinearOperator must now be initialized with vectors x and y matching
the product y = Ax. In most cases, it will work fine to use the same
vector (vector b or u.vector()) since they will have the same parallel
layout.

--
Anders

Changed in dolfin:
status: New → Fix Committed
importance: Undecided → Medium
milestone: none → 1.1.0
Changed in dolfin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.