madgraph crash when combining runs

Bug #1749632 reported by Antonios Leisos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Invalid
Undecided
Unassigned

Bug Description

In madgraph 2.6.1 I am generatinh 100k events and after completion of the jobs and during the combination of runs I got a crash:
INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 2m ]
INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 12m ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 160 [ 9h 22m ]
INFO: Combining runs
Error when reading /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/G103i2/results.dat
Command "generate_events " interrupted with error:
ValueError : empty string for float()
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/run_01_tag_1_debug.log'.
Please attach this file to your report.
quit
INFO:

The folder G103i2 contains 4 files:
-bash-4.1$ more results.dat
end-code not correct 2
-bash-4.1$

-bash-4.1$ more input_sg.txt
    5000 9 3
    -3745428.24213
2
1
0
103
-bash-4.1$

-bash-4.1$ more moffset.dat
61
-bash-4.1$

events.lhe is empty

The STDOUT of ajob160 is the following
-bash-4.1$ more /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/LSFJOB_143428705/STDOUT
@(#)CERN job starter $Date: 2010/06/23 14:22:16 $
Working directory is </pool/lsf/leisos/143428705> on <b6a36a9c52.cern.ch>

At line 398 of file unwgt.f (unit = 25, file = '')
Fortran runtime error: Connection timed out
rm: cannot remove `results.dat': No such file or directory
ERROR DETECTED

Job finished at Wed Feb 14 19:03:35 CET 2018 on node
 under linux version Scientific Linux CERN SLC release 6.9 (Carbon)

CERN statistics: This process used approximately : 0:09:22 KSI2K hours (562 KSI2K seconds)
                 This process corresponds to : 0:36:33 HS06 hours (2193 HS06 seconds)

Is there a way to rerun only the specific job and continue from this point without rerunning all the jobs?
Or combine the rest of the events i.e. excluding this directory?

Revision history for this message
Antonios Leisos (leisos) wrote :
Antonios Leisos (leisos)
description: updated
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hi,

A connection timeout probably means that some disk space which is not possible to mount on the machine that you were using. Therefore I do not think that I can do anything concerning this.

If you relaunch the same job, do you have the same output/bug?
I guess that you should not reproduce this (but if unlikely this is a hardware problem on some node and that you go back on the same node)

Cheers,

Olivier

Revision history for this message
Antonios Leisos (leisos) wrote : Re: [Bug 1749632] Re: madgraph crash when combining runs
Download full text (4.4 KiB)

Thank you Olivier,

you are right. I resubmitted the jobs and everything worked fine.
Cheers,
Antonios

> On 20 Feb 2018, at 21:10, Olivier Mattelaer <email address hidden> wrote:
>
> Hi,
>
> A connection timeout probably means that some disk space which is not
> possible to mount on the machine that you were using. Therefore I do not
> think that I can do anything concerning this.
>
> If you relaunch the same job, do you have the same output/bug?
> I guess that you should not reproduce this (but if unlikely this is a hardware problem on some node and that you go back on the same node)
>
> Cheers,
>
> Olivier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1749632
>
> Title:
> madgraph crash when combining runs
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> In madgraph 2.6.1 I am generatinh 100k events and after completion of the jobs and during the combination of runs I got a crash:
> INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 2m ]
> INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 12m ]
> INFO: All jobs finished
> INFO: Idle: 0, Running: 0, Completed: 160 [ 9h 22m ]
> INFO: Combining runs
> Error when reading /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/G103i2/results.dat
> Command "generate_events " interrupted with error:
> ValueError : empty string for float()
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/run_01_tag_1_debug.log'.
> Please attach this file to your report.
> quit
> INFO:
>
> The folder G103i2 contains 4 files:
> -bash-4.1$ more results.dat
> end-code not correct 2
> -bash-4.1$
>
> -bash-4.1$ more input_sg.txt
> 5000 9 3
> -3745428.24213
> 2
> 1
> 0
> 103
> -bash-4.1$
>
> -bash-4.1$ more moffset.dat
> 61
> -bash-4.1$
>
> events.lhe is empty
>
>
> The STDOUT of ajob160 is the following
> -bash-4.1$ more /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/LSFJOB_143428705/STDOUT
> @(#)CERN job starter $Date: 2010/06/23 14:22:16 $
> Working directory is </pool/lsf/leisos/143428705> on <b6a36a9c52.cern.ch>
>
> At line 398 of file unwgt.f (unit = 25, file = '')
> Fortran runtime error: Connection timed out
> rm: cannot remove `results.dat': No such file or directory
> ERROR DETECTED
>
> Job finished at Wed Feb 14 19:03:35 CET 2018 on node
> under linux version Scientific Linux CERN SLC release 6.9 (Carbon)
>
> CERN statistics: This process used approximately : 0:09:22 KSI2K hours (562 KSI2K seconds)
> This process corresponds to : 0:36:33 HS06 hours (2193 HS06 seconds)
>
>
> Is there a way to rerun only the specific job and continue from this point without rerunning all the jobs?
> Or combine the rest of the events i.e. excluding this directory?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1749632/+subscriptions
>
> --
> This message has been sca...

Read more...

Changed in mg5amcnlo:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.