Bug #959052 “expand_indices does not scale” : Bugs : UFL

Revision history for this message

Johan Hake (johan-hake) wrote on 2012-03-19:

#1

visco.py Edit (5.1 KiB, text/x-python)

Martin Sandve Alnæs (martinal) on 2012-03-20

Changed in ufl:
importance:	Undecided → Critical
status:	New → In Progress
assignee:	nobody → Martin Sandve Alnæs (martinal)

Revision history for this message

Martin Sandve Alnæs (martinal) wrote on 2012-03-20:

#2

I have tried to improve this situation as follows.

1) Avoid caching repr strings for all non-terminal expressions. This reduces memory usage drastically; memory usage was something like O(n^2) in number of nodes in an expression. This makes repr(expr) slower, because it must be reconstructed each time.

2) Avoid using repr in __eq__ and some other places. This means a == b will now potentially have to compare all subexpressions of a and b, and == can thus be costly for large and (almost) equal a and b. But comparing their large repr strings would probably be almost as costly anyway.

3) Improve hashing algorithm. The new algorithm causes significantly fewer collisions, and thus reduces the need to call == a lot in algorithms depending on dicts or sets of expressions.

Together these improvements are quite significant. But it is difficult to foresee side effects, as the performance characteristics of core operations have changed quite a bit. Therefore I'm reluctant to just push, and want some more testing first. I have not run the ffc and dolfin tests yet, will do that tomorrow.

Anyone who wants to test should get this branch:
bzr pull lp:~martinal/ufl/work
You will also need a tiny patch to ffc:
bzr pull lp:~martinal/ffc/trunk

Any ideas for improvement to the hashing algorithm (in operatorbase.py) and the __eq__ implementation (in exproperators.py) are welcome!

In addition, I found an old attempt at improving expand_indices, which I may take a look at later to see if we can get further reductions in run time.

Another improvement could be to implement a more compact signature string to write to generated code instead of the repr string, which becomes quite large in this case.

Revision history for this message

Johan Hake (johan-hake) wrote on 2012-03-21:

#3

Nice the memory consumption for the visco case went down ~ 50%. However it is still pretty high... 2.8 GB instead of 4.6 GB.

Revision history for this message

Johan Hake (johan-hake) wrote on 2012-03-21:

#4

The repr string within the visco.py example might be huge, but the total string only takes some 18MB in memory.

Revision history for this message

Martin Sandve Alnæs (martinal) wrote on 2012-03-21:

#5

Status now is that I found some more optimizations today, giving significant memory savings for the Jacobi representation after expand_indices.

I accidentally pushed to trunk instead of my buildbot, so I hope the dolfin tests pass :)

It would be nice if someone with applications with complex forms can check the performance of the latest ufl. I don't think there are any assembly time changes, just preprocess/compile time.

In the long run, the most scalable solution would be to avoid using expand_indices, which will require some more index handling logic in the form compilers. I'm not going down that road at the moment, so for now lets hope at least these improvements open up some more doors.

Revision history for this message

Johan Hake (johan-hake) wrote on 2012-03-22: Re: [Bug 959052] Re: expand_indices does not scale

#6

Thanks Martin!

The memory now peaks at 1.2GB (from 4.8GB) and code for the jacobian
form is now generated within 4 min, as opposed to some 8 min using SFC.
FFC can still not generate optimized code, the memory peaks at 7GB and I
interrupted it after 10 min. When optimization is turned off the code is
generated but everything is flattened into a single line which does not
compile (gcc).

Some further optimization would be nice as 1.2GB is still quite large
but doable on most laptops these days.

Johan

On 03/21/2012 05:37 PM, Martin Sandve Alnæs wrote:
> Status now is that I found some more optimizations today, giving
> significant memory savings for the Jacobi representation after
> expand_indices.
>
> I accidentally pushed to trunk instead of my buildbot, so I hope the
> dolfin tests pass :)
>
> It would be nice if someone with applications with complex forms can
> check the performance of the latest ufl. I don't think there are any
> assembly time changes, just preprocess/compile time.
>
> In the long run, the most scalable solution would be to avoid using
> expand_indices, which will require some more index handling logic in the
> form compilers. I'm not going down that road at the moment, so for now
> lets hope at least these improvements open up some more doors.
>

Revision history for this message

Anders Logg (logg) wrote on 2012-03-22:

#7

On Wed, Mar 21, 2012 at 04:37:58PM -0000, Martin Sandve Alnæs wrote:
> Status now is that I found some more optimizations today, giving
> significant memory savings for the Jacobi representation after
> expand_indices.
>
> I accidentally pushed to trunk instead of my buildbot, so I hope the
> dolfin tests pass :)

Look good! :-)

--
Anders

> It would be nice if someone with applications with complex forms can
> check the performance of the latest ufl. I don't think there are any
> assembly time changes, just preprocess/compile time.
>
> In the long run, the most scalable solution would be to avoid using
> expand_indices, which will require some more index handling logic in the
> form compilers. I'm not going down that road at the moment, so for now
> lets hope at least these improvements open up some more doors.
>

Revision history for this message

Anders Logg (logg) wrote on 2012-03-22:

#8

On Thu, Mar 22, 2012 at 07:24:42AM -0000, Johan Hake wrote:
> Thanks Martin!
>
> The memory now peaks at 1.2GB (from 4.8GB) and code for the jacobian
> form is now generated within 4 min, as opposed to some 8 min using SFC.
> FFC can still not generate optimized code, the memory peaks at 7GB and I
> interrupted it after 10 min. When optimization is turned off the code is
> generated but everything is flattened into a single line which does not
> compile (gcc).

Why is the code flattened? I assume the quadrature representation is
used which shouldn't flatten loops.

--
Anders

> Some further optimization would be nice as 1.2GB is still quite large
> but doable on most laptops these days.
>
> Johan
>
> On 03/21/2012 05:37 PM, Martin Sandve Alnæs wrote:
> > Status now is that I found some more optimizations today, giving
> > significant memory savings for the Jacobi representation after
> > expand_indices.
> >
> > I accidentally pushed to trunk instead of my buildbot, so I hope the
> > dolfin tests pass :)
> >
> > It would be nice if someone with applications with complex forms can
> > check the performance of the latest ufl. I don't think there are any
> > assembly time changes, just preprocess/compile time.
> >
> > In the long run, the most scalable solution would be to avoid using
> > expand_indices, which will require some more index handling logic in the
> > form compilers. I'm not going down that road at the moment, so for now
> > lets hope at least these improvements open up some more doors.
> >
>

Revision history for this message

Johan Hake (johan-hake) wrote on 2012-03-22:

#9

On 03/22/2012 08:41 AM, Anders Logg wrote:
> On Thu, Mar 22, 2012 at 07:24:42AM -0000, Johan Hake wrote:
>> Thanks Martin!
>>
>> The memory now peaks at 1.2GB (from 4.8GB) and code for the jacobian
>> form is now generated within 4 min, as opposed to some 8 min using SFC.
>> FFC can still not generate optimized code, the memory peaks at 7GB and I
>> interrupted it after 10 min. When optimization is turned off the code is
>> generated but everything is flattened into a single line which does not
>> compile (gcc).
>
> Why is the code flattened? I assume the quadrature representation is
> used which shouldn't flatten loops.

When optimized is set to true in the ffc_options it is not flattened,
but that process is the one that fails to complete.

Johan

> --
> Anders
>
>
>> Some further optimization would be nice as 1.2GB is still quite large
>> but doable on most laptops these days.
>>
>> Johan
>>
>> On 03/21/2012 05:37 PM, Martin Sandve Alnæs wrote:
>>> Status now is that I found some more optimizations today, giving
>>> significant memory savings for the Jacobi representation after
>>> expand_indices.
>>>
>>> I accidentally pushed to trunk instead of my buildbot, so I hope the
>>> dolfin tests pass :)
>>>
>>> It would be nice if someone with applications with complex forms can
>>> check the performance of the latest ufl. I don't think there are any
>>> assembly time changes, just preprocess/compile time.
>>>
>>> In the long run, the most scalable solution would be to avoid using
>>> expand_indices, which will require some more index handling logic in the
>>> form compilers. I'm not going down that road at the moment, so for now
>>> lets hope at least these improvements open up some more doors.
>>>
>>
>

Revision history for this message

Kristian B. Ølgaard (k.b.oelgaard) wrote on 2012-03-22:

#10

On 22 March 2012 09:02, Johan Hake <email address hidden> wrote:
> On 03/22/2012 08:41 AM, Anders Logg wrote:
>> On Thu, Mar 22, 2012 at 07:24:42AM -0000, Johan Hake wrote:
>>> Thanks Martin!
>>>
>>> The memory now peaks at 1.2GB (from 4.8GB) and code for the jacobian
>>> form is now generated within 4 min, as opposed to some 8 min using SFC.
>>> FFC can still not generate optimized code, the memory peaks at 7GB and I
>>> interrupted it after 10 min. When optimization is turned off the code is
>>> generated but everything is flattened into a single line which does not
>>> compile (gcc).
>>
>> Why is the code flattened? I assume the quadrature representation is
>> used which shouldn't flatten loops.

Quadrature never unroll loops.
However, the expression to compute entries in the element tensor is
just one line:

loop i
loop j
A_ij += some_complex_expr

> When optimized is set to true in the ffc_options it is not flattened,
> but that process is the one that fails to complete.

which will result in:

loop i
  loop j
    A_map0(i,j) += simple_expr0
    A_map1(i,j) += simple_expr1
    A_map2(i,j) += simple_expr2

Kristian

> Johan
>
>> --
>> Anders
>>
>>
>>> Some further optimization would be nice as 1.2GB is still quite large
>>> but doable on most laptops these days.
>>>
>>> Johan
>>>
>>> On 03/21/2012 05:37 PM, Martin Sandve Alnæs wrote:
>>>> Status now is that I found some more optimizations today, giving
>>>> significant memory savings for the Jacobi representation after
>>>> expand_indices.
>>>>
>>>> I accidentally pushed to trunk instead of my buildbot, so I hope the
>>>> dolfin tests pass :)
>>>>
>>>> It would be nice if someone with applications with complex forms can
>>>> check the performance of the latest ufl. I don't think there are any
>>>> assembly time changes, just preprocess/compile time.
>>>>
>>>> In the long run, the most scalable solution would be to avoid using
>>>> expand_indices, which will require some more index handling logic in the
>>>> form compilers. I'm not going down that road at the moment, so for now
>>>> lets hope at least these improvements open up some more doors.
>>>>
>>>
>>
>
> --
> You received this bug notification because you are subscribed to UFL.
> https://bugs.launchpad.net/bugs/959052
>
> Title:
> expand_indices does not scale
>
> Status in Unified Form Language:
> In Progress
>
> Bug description:
> Then memory is filled pretty fast for large forms. This happens for
> both SFC and FFC within the UFL function:
>
> expand_indices
>
> The memory consumption for the attached script peaks at more than 4.6
> GB.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ufl/+bug/959052/+subscriptions

On 22 March 2012 09:02, Johan Hake <959052@bugs.launchpad.net> wrote:
> On 03/22/2012 08:41 AM, Anders Logg wrote:
>> On Thu, Mar 22, 2012 at 07:24:42AM -0000, Johan Hake wrote:
>>> Thanks Martin!
>>>
>>> The memory now peaks at 1.2GB (from 4.8GB) and code for the jacobian
>>> form is now generated within 4 min, as opposed to some 8 min using SFC.
>>> FFC can still not generate optimized code, the memory peaks at 7GB and I
>>> interrupted it after 10 min. When optimization is turned off the code is
>>> generated but everything is flattened into a single line which does not
>>> compile (gcc).
>>
>> Why is the code flattened? I assume the quadrature representation is
>> used which shouldn't flatten loops.

Quadrature never unroll loops.
However, the expression to compute entries in the element tensor is
just one line:

loop i
  loop j
    A_ij += some_complex_expr

> When optimized is set to true in the ffc_options it is not flattened,
> but that process is the one that fails to complete.

which will result in:

loop i
  loop j
    A_map0(i,j) += simple_expr0
    A_map1(i,j) += simple_expr1
    A_map2(i,j) += simple_expr2

Kristian

> Johan
>
>> --
>> Anders
>>
>>
>>> Some further optimization would be nice as 1.2GB is still quite large
>>> but doable on most laptops these days.
>>>
>>> Johan
>>>
>>> On 03/21/2012 05:37 PM, Martin Sandve Alnæs wrote:
>>>> Status now is that I found some more optimizations today, giving
>>>> significant memory savings for the Jacobi representation after
>>>> expand_indices.
>>>>
>>>> I accidentally pushed to trunk instead of my buildbot, so I hope the
>>>> dolfin tests pass :)
>>>>
>>>> It would be nice if someone with applications with complex forms can
>>>> check the performance of the latest ufl. I don't think there are any
>>>> assembly time changes, just preprocess/compile time.
>>>>
>>>> In the long run, the most scalable solution would be to avoid using
>>>> expand_indices, which will require some more index handling logic in the
>>>> form compilers. I'm not going down that road at the moment, so for now
>>>> lets hope at least these improvements open up some more doors.
>>>>
>>>
>>
>
> --
> You received this bug notification because you are subscribed to UFL.
> https://bugs.launchpad.net/bugs/959052
>
> Title:
>  expand_indices does not scale
>
> Status in Unified Form Language:
>  In Progress
>
> Bug description:
>  Then memory is filled pretty fast for large forms. This happens for
>  both SFC and FFC within the UFL function:
>
>    expand_indices
>
>  The memory consumption for the attached script peaks at more than 4.6
>  GB.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ufl/+bug/959052/+subscriptions

Revision history for this message

Martin Sandve Alnæs (martinal) wrote on 2012-04-10:

#11

I'm marking this issue as fixed. Although improvements to expand_indices may be possible, I think the better approach is to avoid calling it by improving the code generation pipeline. I'll get back to this later.

Changed in ufl:
status:	In Progress → Fix Committed

Martin Sandve Alnæs (martinal) on 2013-01-04

Changed in ufl:
milestone:	none → 1.1.0-alpha

Martin Sandve Alnæs (martinal) on 2013-01-07

Changed in ufl:
status:	Fix Committed → Fix Released

UFL

expand_indices does not scale

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches