Fail to read the number of unweighted events in the combine.log file

Bug #1280051 reported by Matthew Low
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Fix Released
Undecided
Unassigned

Bug Description

Hi,

I've had the bug where occasionally while running on a SLURM batch once all the jobs have been submitted and returned the following error is returned:

   Fail to read the number of unweighted events in the combine.log file

The combine.log file is there, however. I think the problem is that inside madevent_interface.py in the function do_combine_events, output tries to look at combine.log too fast before it's written. It looks like if reading combine.log fails it does sleep briefly, but then it doesn't check the output again. I've added the line:

     output = misc.mult_try_open(pjoin(self.me_dir,'SubProcesses','combine.log')).read()

After time.sleep(10) and this seems to alleviate the problem (in conjunction with increasing the sleep time to be safe).

- Matthew

P.S. This is for 1.5.14. I don't know if this is also the case in the v2 series.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Thanks so much for this :-)

Olivier

Changed in mg5amcnlo:
status: New → Fix Committed
Changed in mg5amcnlo:
status: Fix Committed → Fix Released
Revision history for this message
Niklas Garner (nkgarner) wrote :

I seem to be having a similar problem on a Condor cluster, I tried the suggested fix and increasing the sleep time to 120sec but the problem doesn't seem to be resolved. I attempted the fix on both 1.5.14 and 2.0.1, but they both returned the same error.

There is a combine.log file in the SubProcesses folder, and, upon opening, it only says :

"/var/lib/condor/execute/dir_4921/condor_exec.exe: line 6: ../bin/internal/combine_events: No such file or directory"

combine_events is where it should be, so I don't think that is the real problem. Any suggestions?

Thanks,

Niklas

Revision history for this message
Matthew Low (mattlow) wrote :

Hi,

Actually in some instances I've also still experienced this problem on slurm. After it fails the combine.log file looks fine, is there a way to still use the run and create all the LHE/HEP/ROOT files afterwards? I have tried running bin/madevent combine_events run_01 --tag=tag_1, but this does not work and it appears to mess up the combine.log too.

Perhaps there is a way to not run the combine commands during generate_events, and then manually run the combine part afterwards?

Thanks,
- Matthew

Revision history for this message
Andrea Peterson (adp777) wrote :

Hi,
I am also having this issue in version 2.1.0 on Condor. Is there a way to still use the runs to create the LHE files?
Thanks,
Andrea

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote : Re: [Bug 1280051] Fail to read the number of unweighted events in the combine.log file

HI Andrea,

Sorry for the late reply.

The only possibility would be to run

./bin/madevent
and then type combine_event [RUN_NAME]

But I have not guarantee that this is going to work.
Actually I think that the problem might be due to the fact that this step requires a central disk (only in reading mode)
while all the other part of the code did not need it. Allowing to have this without central disk is in my todo list.

Cheers,

Olivier

On Mar 23, 2014, at 3:36 PM, Andrea Peterson <email address hidden> wrote:

> Hi,
> I am also having this issue in version 2.1.0 on Condor. Is there a way to still use the runs to create the LHE files?
> Thanks,
> Andrea
>
> --
> You received this bug notification because you are subscribed to
> MadGraph5_aMC@NLO.
> https://bugs.launchpad.net/bugs/1280051
>
> Title:
> Fail to read the number of unweighted events in the combine.log file
>
> Status in MadGraph5_aMC@NLO Generator:
> Fix Released
>
> Bug description:
> Hi,
>
> I've had the bug where occasionally while running on a SLURM batch
> once all the jobs have been submitted and returned the following error
> is returned:
>
> Fail to read the number of unweighted events in the combine.log
> file
>
> The combine.log file is there, however. I think the problem is that
> inside madevent_interface.py in the function do_combine_events, output
> tries to look at combine.log too fast before it's written. It looks
> like if reading combine.log fails it does sleep briefly, but then it
> doesn't check the output again. I've added the line:
>
> output =
> misc.mult_try_open(pjoin(self.me_dir,'SubProcesses','combine.log')).read()
>
> After time.sleep(10) and this seems to alleviate the problem (in
> conjunction with increasing the sleep time to be safe).
>
> - Matthew
>
> P.S. This is for 1.5.14. I don't know if this is also the case in the
> v2 series.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1280051/+subscriptions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.