ttDM EFT production : madspin issue (?)

Bug #1435389 reported by mia tosi
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Won't Fix
Undecided
Pierre Artoisenet

Bug Description

I'm trying to generate lhe files using the EffDM_UFO-no_b_mass model
the process I'm interested in is ttbar + DM DM + 0/1/2 jets
and I'm using MadGraph5_aMCatNLO [version 2.2.2]
and MadSpin for the top and W decays

while I am able to produce lhe files up to 1 jet multiplicity,
I'm not able to get the output for 2 jets in the final state

the madgraph step works, while apparently madspin goes out of memory
please, find as attachment
* the MS_debug file
* the datacards
* the model

in case additional info is needed, please let me know
thanks much

Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :
  • run card Edit (13.6 KiB, application/x-ns-proxy-autoconfig)
Revision history for this message
mia tosi (mia-tosi-1) wrote :
  • output Edit (63.6 MiB, application/octet-stream)
Revision history for this message
mia tosi (mia-tosi-1) wrote :
Revision history for this message
mia tosi (mia-tosi-1) wrote :

I forgot to stress one probably important point

I was able to produce privately 1M lhe events using
MG5v1.5.11_CERN_23082013_patched19092013.tar.gz

Revision history for this message
mia tosi (mia-tosi-1) wrote :

while the MadSpin step was successfully done w/ MG5_aMC_v2_1_2_beta2

Changed in mg5amcnlo:
assignee: nobody → Pierre Artoisenet (partois)
Revision history for this message
Kristian Hahn (kristian-hahn) wrote :

I find similar behavior when generating t t~ DM DM with an explicit mediator in a simplified model. MadSpin ultimately succeeds but claims ~60 GB of physical memory. This result is with v2.2.3. I noticed a related update in v2.2.2: "OM: Reduce the amount of RAM used by MadSpin in gridpack mode." Is MadSpin memory optimization for gridpack still on-going?

Revision history for this message
Kristian Hahn (kristian-hahn) wrote :

Failed to mention that the above was for +0,1,2j. Memory consumption with +0,1j peaks at about 3.5 GB,

Revision history for this message
mia tosi (mia-tosi-1) wrote :

thanks for your feedback,
I've just launched a new test asking for 80GB (last time I was already setting the LSF jobs for having 60GB, actually)

hope to have some good news

but, in the mean time, let me cross check,
is there any patch available in order to fix such an issue ?
thanks

Revision history for this message
mia tosi (mia-tosi-1) wrote :

are there any news on possible fix of this memory issue ?
thanks much
  mia

Revision history for this message
Pierre Artoisenet (partois) wrote :

Hi Olivier, Mia,

I have no idea. Do I understand correctly from the log file that the compilation of the helas library fails ?
I will try to reproduce the bug tomorrow

Pierre

Revision history for this message
mia tosi (mia-tosi-1) wrote :

thanks much
 mia

Revision history for this message
mia tosi (mia-tosi-1) wrote :

ciao,
are there any news on this issue ?

please, find the output I got from my last (still failing) test

Resource usage summary:

    CPU time : 39237.86 sec.
    Max Memory : 76613 MB
    Max Swap : 77789 MB

    Max Processes : 33
    Max Threads : 33

  mia

Revision history for this message
Pierre Artoisenet (partois) wrote :

I could not reproduce the bug because the compilation of the code in the *production event* phase takes forever on my latop ...
How much time did it take on your machine ?

Pierre

Revision history for this message
mia tosi (mia-tosi-1) wrote : Re: [Bug 1435389] Re: ttDM EFT production : madspin issue (?)

I'm running on LSF queues
it takes at least 8hours
  mia

On Mon, Mar 30, 2015 at 2:29 PM, Pierre Artoisenet <
<email address hidden>> wrote:

> I could not reproduce the bug because the compilation of the code in the
> *production event* phase takes forever on my latop ...
> How much time did it take on your machine ?
>
>
> Pierre
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1435389
>
> Title:
> ttDM EFT production : madspin issue (?)
>
> Status in MadGraph5_aMC@NLO Generator:
> New
>
> Bug description:
> I'm trying to generate lhe files using the EffDM_UFO-no_b_mass model
> the process I'm interested in is ttbar + DM DM + 0/1/2 jets
> and I'm using MadGraph5_aMCatNLO [version 2.2.2]
> and MadSpin for the top and W decays
>
> while I am able to produce lhe files up to 1 jet multiplicity,
> I'm not able to get the output for 2 jets in the final state
>
> the madgraph step works, while apparently madspin goes out of memory
> please, find as attachment
> * the MS_debug file
> * the datacards
> * the model
>
> in case additional info is needed, please let me know
> thanks much
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1435389/+subscriptions
>

Revision history for this message
Kristian Hahn (kristian-hahn) wrote :

Some more information:

- gridpack madspin for ttbar+chichi+0,1,2j just barely succeeds on our 64 GB headnode. Madspin memory consumption plateaus at 60GB. I actually don't understand why Madspin works on this machine, give what I observe on our compute nodes (below)

- the same process fails on our 96GB compute nodes. Here madspin memory consumption climbs to 77GB during the calculation of the full ME. Once it plateaus (~10 hours), MadSpin continues at 77 GB for ~1 hour and then crashes (MS_debug attached).

INFO: generating the full square matrix element (with decay)
INFO: generate p p > t t~ chi chi~ , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 0 --no_warning=duplicate;add process p p > t t~ chi chi~ j , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 1 --no_warning=duplicate;add process p p > t t~ chi chi~ j j , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 2 --no_warning=duplicate;
Command "launch" interrupted with error:
MadGraph5Error : Impossible to compile /projects/d20385/gridpacking/work/work_S_MFM_10_MMed_1000_gSM_1.0_gDM_1.0/gpack/DMScalar_ttbar012j_mphi_1000_mchi_10_gSM_1p0_gDM_1p0/DMScalar_ttbar012j_mphi_1000_mchi_10_gSM_1p0_gDM_1p0_gridpack/work/process/madspingrid/full_me/Source directory
        Trying to launch make command returns:
            [Errno 12] Cannot allocate memory
        In general this means that your computer is not able to compile.
Please report this bug to developers

           More information is found in 'MS_debug'.

           Please attach this file to your report.

The crash seems to occur in various.misc compile, I believe in the call to subprocess.Popen. Apparently Popen calls fork ... which has led to some documented OOM problems for python users (eg: http://stackoverflow.com/questions/1216794/python-subprocess-popen-erroring-with-oserror-errno-12-cannot-allocate-memory). I've tried one of the simplest suggested fixes (ie: to manually call gc.collect before the Popen) but this does nothing.

What is actually being held in this 77GB? It seems that at the time of the crash MadSpin is just writing / compiling fortran ...

Thanks,
Kristian

Changed in mg5amcnlo:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.