Pythia8 problem in Cluster mode

Bug #1655965 reported by Doyoun Kim
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Fix Released
Undecided
Valentin Hirschi

Bug Description

I am running MG/ME in the cluster (SGE, CENTOS6.8, GCC-4.9.2, Python-2.7.8) mode.
When I run in the single run mode, there is no problem.
But in the cluster mode, when Pythia8 merges parallelized split_N jobs,
it fails to find PY8_log.txt file. The message looks like this:

=====================
Splitting .lhe event file for PY8 parallelization...
Submitting Pythia8 jobs...
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
WARNING: cluster.get_job_identifier runs unexpectedly. This should be fine but report this message if you have problem.
INFO: All jobs finished
Pythia8 shower jobs: 0 Idle, 0 Running, 10 Done [7 seconds]
Merging results from the split PY8 runs...
Command "generate_events run_04" interrupted with error:
IOError : [Errno 2] No such file or directory: '/home/abistp00/test/test/MG5_aMC_v2_5_2/pp2gg/Events/run_04/PY8_parallelization/split_0/PY8_log.txt'
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/home/abistp00/test/test/MG5_aMC_v2_5_2/pp2gg/run_04_tag_1_debug.log'.
Please attach this file to your report.
INFO: storing files of previous run
INFO: Done
INFO:

INFO:

quit
INFO:

INFO:

MG5_aMC>

=====================

But I can see those files exist in all split_N directories and the contents of log files are again sane. I have tested this and that and found the cluster.wait(...) does not wait until the relevant files such as PY8_log.txt, djrs.dat, events.hepmc, pts.dat are generated. So it would be working if the Pythia8 parallel jobs and merging process can be well arranged in time scale.

Doyoun

Revision history for this message
Doyoun Kim (abistp00) wrote :
Changed in mg5amcnlo:
assignee: nobody → Valentin Hirschi (valentin-hirschi)
Revision history for this message
Valentin Hirschi (valentin-hirschi) wrote :

Could you run the PY8 shower job by hand in:

/home/abistp00/test/test/MG5_aMC_v2_5_2/pp2gg/Events/run_04/PY8_parallelization/split_0

with

./run_PY8.sh

This should give more insight as to what went wrong with the parallelization there. Let me know.

Revision history for this message
Doyoun Kim (abistp00) wrote :

Valentin,

here I attach a tar ball of
test/test/MG5_aMC_v2_5_2/pp2gg/Events/run_04/PY8_parallelization/split_0
after ./runPY8.sh.

As said above, all materials required in merging process are seemingly well created.
But please have a look.

Doyoun

Revision history for this message
Valentin Hirschi (valentin-hirschi) wrote : Re: [Bug 1655965] Re: Pythia8 problem in Cluster mode
Download full text (4.0 KiB)

Sorry for the late reply here, but let me ask you one more thing; you are
not running Pythia8 on the cluster but in multicore mode, correct?

It will be hard for me to fix this issue since I cannot reproduce it and I
never had the warning:

  WARNING: cluster.get_job_identifier runs unexpectedly. This should be
fine but report this message if you have problem.

which must surely be related to the problem.

On Fri, Jan 13, 2017 at 6:12 AM, Doyoun Kim <email address hidden>
wrote:

> Valentin,
>
> here I attach a tar ball of
> test/test/MG5_aMC_v2_5_2/pp2gg/Events/run_04/PY8_parallelization/split_0
> after ./runPY8.sh.
>
> As said above, all materials required in merging process are seemingly
> well created.
> But please have a look.
>
> Doyoun
>
> ** Attachment added: "split_0.tar.gz"
> https://bugs.launchpad.net/mg5amcnlo/+bug/1655965/+
> attachment/4803693/+files/split_0.tar.gz
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1655965
>
> Title:
> Pythia8 problem in Cluster mode
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> I am running MG/ME in the cluster (SGE, CENTOS6.8, GCC-4.9.2,
> Python-2.7.8) mode.
> When I run in the single run mode, there is no problem.
> But in the cluster mode, when Pythia8 merges parallelized split_N jobs,
> it fails to find PY8_log.txt file. The message looks like this:
>
> =====================
> Splitting .lhe event file for PY8 parallelization...
> Submitting Pythia8 jobs...
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> WARNING: cluster.get_job_identifier runs unexpectedly. This should be
> fine but report this message if you have problem.
> INFO: All jobs finished
> Pythia8 shower jobs: 0 Idle, 0 Running, 10 Done [7 seconds]
> Merging results from the split PY8 runs...
> Command "generate_events run_04" interrupted with error:
> IOError : [Errno 2] No such file or directory:
> '/home/abistp00/test/test/MG5_aMC_v2_5_2/pp2gg/Events/run_
> 04/PY8_parallelization/split_0/PY8_log.txt'
> Please report this bug on https://bugs.lau...

Read more...

Revision history for this message
Michele Papucci (mpapucci) wrote :

Hi,

I was going to file a bug but then I saw this thread. I had the same problem on the cluster I'm using (has SGE queue) and I tracked down the solution. The problem is that the wait() function doesn't wait for the submitted pythia8 jobs to finish and therefore the log files containing the cross sections are not yet available (they will become available at some point). In my case, the reason was that the wait() looks for jobs named differently and so it thinks everything is already finished, while instead the jobs are still running. My temporary fix was to add:

        elif 'PY8_parallelization' in path:
            target = path.rsplit('/PY8_parallelization',1)[0]

in the get_jobs_identifier function at line 260 of cluster.py (it's in madgraph/various, I'm using v.2.5.2). (in case it's not exactly line 260 since I had to introduce some print statements to track it down, the previous elif is exactly the same but for MCatNLO.) In my case I also had to remove the cluster.pyo to force it to recompile, but that may not be necessary. Not sure it helps, but it doesn't hurt either.

Revision history for this message
Doyoun Kim (abistp00) wrote :

Thank you Michele!

I'm away from my office for several days, so I will try your fix when return!

Cheers
Doyoun

Revision history for this message
Doyoun Kim (abistp00) wrote :
Download full text (14.4 KiB)

Hi,

1. I tried Michele's fix for a 'p p > a a' process, and see now the merging process goes on successfully, thank you. But have some error message as follows (I'm sorry for just copying and pasting a long standard output on the screen):

=====================================================================================
=====================================================================================
INFO: Update the dependent parameter of the param_card.dat
Generating 10000 events with run name run_02
survey run_02
INFO: compile directory
compile Source Directory
Error: no display specified
Using random number seed offset = 30
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO: P1_qq_aa
INFO: Idle: 1, Running: 0, Completed: 0 [ current time: 20h56 ]
INFO: Idle: 1, Running: 0, Completed: 0 [ 0.023s ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 1 [ 10s ]
INFO: End survey
refine 10000
Creating Jobs
INFO: Refine results to 10000
INFO: Generating 10000.0 unweigthed events.
INFO: Effective Luminosity 80.4073974806 pb^-1
INFO: need to improve 1 channels
Current estimate of cross-section: 149.24 +- 1.6106
    P1_qq_aa
INFO: Idle: 12, Running: 0, Completed: 0 [ current time: 20h56 ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 12 [ 10s ]
INFO: Combining runs
INFO: finish refine
refine 10000
Creating Jobs
INFO: Refine results to 10000
INFO: Generating 10000.0 unweigthed events.
INFO: Effective Luminosity 80.5044948343 pb^-1
INFO: need to improve 0 channels
Current estimate of cross-section: 149.06 +- 0.38742
    P1_qq_aa
INFO: Idle: 0, Running: 0, Completed: 0 [ current time: 20h56 ]
INFO: All jobs finished
INFO: Combining runs
INFO: finish refine
INFO: Combining Events
  === Results Summary for run: run_02 tag: tag_1 ===

     Cross-section : 149.1 +- 0.3874 pb
     Nb of events : 10000

INFO: can not run systematics since can not link python to lhapdf
store_events
INFO: Storing parton level results
INFO: End Parton
reweight -from_cards
decay_events -from_cards
INFO: Running Pythia8 [arXiv:1410.3012]
Splitting .lhe event file for PY8 parallelization...
Submitting Pythia8 jobs...
Pythia8 shower jobs: 10 Idle, 0 Running, 0 Done [7 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [17 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [27 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [37 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [47 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [57 seconds]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m07s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m17s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m27s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m37s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m47s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [1m57s]
Pythia8 shower jobs: 0 Idle, 10 Running, 0 Done [2m07s]
Pythia8 shower jobs: 0 Idle, 9 Running, 1 Done [2m17s]
Pythia8 shower jobs: 0 Idle, 5 Running, 5 Done [2m28s]
INFO: All jobs finished
Pythia8 shower jobs: 0 Idle, 0 Running, 10 Done [2m38s]
Merging results f...

Revision history for this message
Doyoun Kim (abistp00) wrote :

I think all the issues regarding to the parallelization has been addressed and solved.
While there are still problems, they are related to either HepMC or Delphes, so it's better to post to another discussion.

Thank you

Changed in mg5amcnlo:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.