Performance bug in Advection_Diffusion_DG.F90

Bug #1081598 reported by Fiona Reid
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fluidity
Incomplete
Undecided
Michael Lange

Bug Description

Since trunk 3981.1.75 was added the performance of the threaded version of Fluidity has been adversely affected. The problem only seems appear when configuring/compiling with OpenMP is enabled and more than one OpenMP thread is used. The MPI performance of the code appears unaffected by this.

I've tracked the problem down to the following lines (lines 2657-2659) which were added to Advection_Diffusion_DG.F90 in trunk revision 3981.1.75

if (neumann) then
    call addto(RHS_diff, T_face, shape_rhs(T_shape, detwei * ele_val_at_quad(bc_value, face)))
end if

The differences between 3981.1.74 and 3981.1.75 can be viewed here: http://bazaar.launchpad.net/~fluidity-core/fluidity/fluidity-petsc-3.3/revision/3981.1.75

Commenting out these 3 lines results in the performance of the threaded code behaving as expected. The numbers below were obtained using the lock_exchange_3d_dg long test. Most of my testing has been carried out on revision 3996 of branch fluidity-petsc-3.3 as after seeing odd performance with the trunk I returned back to where the code still performed correctly. fluidity-petsc-3.3 has been merged into the main trunk for sometime but I suspect very few people are running fluidity with threads and thus haven't had any issues. Results from revision 4128 of the trunk are also included to show the problem exists there too.

Temperature::advection_diffusion_dg_loop times
                                     branch3996+3981.1.74 branch3996+3981.1.75 branch3996+3981.1.75_no_neumann_addto trunk 4128
32 MPI (1 thread) 56.53 56.77 56.56 56.74
16 MPI (2 thread) 62.85 63.89 62.90 64.10
  8 MPI (4 thread) 52.85 78.67 53.47 70.92
  4 MPI (8 thread) 46.50 87.62 46.63 88.05
  2 MPI (16 thread) 42.58 94.27 42.94 94.08
  1 MPI (32 thread) 123.66 210.466 119.25 194.96

Basically the effect is that the threaded code now almost never goes faster than the MPI version which means all benefit from using threads has been lost.

I guess removing the 3 lines above isn't really an option? so any ideas / suggestions would be appreciated.

Cheers,
Fiona

Revision history for this message
Stephan Kramer (s-kramer) wrote :

First of all this change has nothing to do with the petsc 3.3 branch. It's actually commit 3879.2.1 on the trunk (bzr commit numbers aren't consistent like that between different branches), merged in at commit 4056.

I'm a bit puzzled by what you are seeing. Are you using the unmodified flml from the longtests/lock_exchange_3d_dg/ ? Cause in that .flml the only scalar field doesn't actually have a Neumann boundary condition so 'neumann' should always be .false. Can you check whether it is, and if not, why?

Revision history for this message
Fiona Reid (fiona-3) wrote :

Stephan,

Apologies re. the confusion over revision numbers. The issue is indeed due to changes in the trunk but I'd reverted back to a version with good threaded performance to try to work out where the problem occurred. This happened to be in a branch.

My flml file is *identical* to the one in longtests/lock_exchange_3d_dg/lock_exchange_3d.flml so that's not the problem. Generally, I'd only ever change the finish time to reduce the overall runtime but for this benchmark I didn't change anything.

HECToR is down for maintenance this afternoon so I'll need to wait until tomorrow to check whether neumann is always false.

I'll test that out tomorrow and get back to you.

Revision history for this message
Fiona Reid (fiona-3) wrote :

Okay, I've now the flml file on HECToR too and it's identical to that in longtests.

I've also checked the value of neumann as suggested using ewrite statements inserted in several places including directly before the if (neumann) block and also inside the conditional to see if we ever enter this block.

neumann is always ***FALSE*** which means that as Stephan says this block of code should not be executed. The ewrite statements I inserted inside the if(neumann) block don't appear in the output. FWIW I've also tried commenting out the call to addto and inserting a ewrite statement instead, e.g.

if (neumann) then
    ewrite(1,*)"FR debug: inside neumann if SHOULD NOT BE here, neumann = ",neumann," threadid = ",omp_get_thread_num()
! call addto(RHS_diff, T_face, shape_rhs(T_shape, detwei * ele_val_at_quad(bc_value, face)))
end if

The performance with the ewrite replacing the addto call is still poor (just as bad as with addto) which suggests something very odd is going on here. I am struggling to believe that an if/endif test that is never true should cause such a big performance problem. I'm starting to wonder if the compiler/ OpenMP runtime is doing something odd as the performance of the MPI version compiled without --enable-openmp is unchanged.

Revision history for this message
Jon Hill (jon-hill) wrote :

Not sure if this is still an issue. Assigning to Michael Lange to confirm and if needed close it.

Changed in fluidity:
status: New → Incomplete
assignee: nobody → Michael Lange (michael-lange)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.