Comment 20 for bug 1435389

Revision history for this message
Kristian Hahn (kristian-hahn) wrote :

Some more information:

- gridpack madspin for ttbar+chichi+0,1,2j just barely succeeds on our 64 GB headnode. Madspin memory consumption plateaus at 60GB. I actually don't understand why Madspin works on this machine, give what I observe on our compute nodes (below)

- the same process fails on our 96GB compute nodes. Here madspin memory consumption climbs to 77GB during the calculation of the full ME. Once it plateaus (~10 hours), MadSpin continues at 77 GB for ~1 hour and then crashes (MS_debug attached).

INFO: generating the full square matrix element (with decay)
INFO: generate p p > t t~ chi chi~ , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 0 --no_warning=duplicate;add process p p > t t~ chi chi~ j , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 1 --no_warning=duplicate;add process p p > t t~ chi chi~ j j , (t~ > b~ w- , w- > all all QCD=99), (t > b w+ , w+ > all all QCD=99) @ 2 --no_warning=duplicate;
Command "launch" interrupted with error:
MadGraph5Error : Impossible to compile /projects/d20385/gridpacking/work/work_S_MFM_10_MMed_1000_gSM_1.0_gDM_1.0/gpack/DMScalar_ttbar012j_mphi_1000_mchi_10_gSM_1p0_gDM_1p0/DMScalar_ttbar012j_mphi_1000_mchi_10_gSM_1p0_gDM_1p0_gridpack/work/process/madspingrid/full_me/Source directory
        Trying to launch make command returns:
            [Errno 12] Cannot allocate memory
        In general this means that your computer is not able to compile.
Please report this bug to developers

           More information is found in 'MS_debug'.

           Please attach this file to your report.

The crash seems to occur in various.misc compile, I believe in the call to subprocess.Popen. Apparently Popen calls fork ... which has led to some documented OOM problems for python users (eg: http://stackoverflow.com/questions/1216794/python-subprocess-popen-erroring-with-oserror-errno-12-cannot-allocate-memory). I've tried one of the simplest suggested fixes (ie: to manually call gc.collect before the Popen) but this does nothing.

What is actually being held in this 77GB? It seems that at the time of the crash MadSpin is just writing / compiling fortran ...

Thanks,
Kristian