MadGraph5_aMC@NLO

madgraph crash when combining runs

Bug #1749632 reported by Antonios Leisos on 2018-02-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MadGraph5_aMC@NLO	Invalid	Undecided	Unassigned

Bug Description

In madgraph 2.6.1 I am generatinh 100k events and after completion of the jobs and during the combination of runs I got a crash:
INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 2m ]
INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 12m ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 160 [ 9h 22m ]
INFO: Combining runs
Error when reading /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/G103i2/results.dat
Command "generate_events " interrupted with error:
ValueError : empty string for float()
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/run_01_tag_1_debug.log'.
Please attach this file to your report.
quit
INFO:

The folder G103i2 contains 4 files:
-bash-4.1$ more results.dat
end-code not correct 2
-bash-4.1$

-bash-4.1$ more input_sg.txt
5000 9 3
-3745428.24213
2
1
0
103
-bash-4.1$

-bash-4.1$ more moffset.dat
61
-bash-4.1$

events.lhe is empty

The STDOUT of ajob160 is the following
-bash-4.1$ more /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/LSFJOB_143428705/STDOUT
@(#)CERN job starter $Date: 2010/06/23 14:22:16 $
Working directory is </pool/lsf/leisos/143428705> on <b6a36a9c52.cern.ch>

At line 398 of file unwgt.f (unit = 25, file = '')
Fortran runtime error: Connection timed out
rm: cannot remove `results.dat': No such file or directory
ERROR DETECTED

Job finished at Wed Feb 14 19:03:35 CET 2018 on node
under linux version Scientific Linux CERN SLC release 6.9 (Carbon)

CERN statistics: This process used approximately : 0:09:22 KSI2K hours (562 KSI2K seconds)
This process corresponds to : 0:36:33 HS06 hours (2193 HS06 seconds)

Is there a way to rerun only the specific job and continue from this point without rerunning all the jobs?
Or combine the rest of the events i.e. excluding this directory?

See original description

Revision history for this message

Antonios Leisos (leisos) wrote on 2018-02-15:

run_01_tag_1_debug.log Edit (26.8 KiB, text/plain)

Antonios Leisos (leisos) on 2018-02-15

description:

updated

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) wrote on 2018-02-20:

Hi,

A connection timeout probably means that some disk space which is not possible to mount on the machine that you were using. Therefore I do not think that I can do anything concerning this.

If you relaunch the same job, do you have the same output/bug?
I guess that you should not reproduce this (but if unlikely this is a hardware problem on some node and that you go back on the same node)

Cheers,

Olivier

Revision history for this message

Antonios Leisos (leisos) wrote on 2018-02-22: Re: [Bug 1749632] Re: madgraph crash when combining runs

Download full text (4.4 KiB)

Thank you Olivier,

you are right. I resubmitted the jobs and everything worked fine.
Cheers,
Antonios

> On 20 Feb 2018, at 21:10, Olivier Mattelaer <email address hidden> wrote:
>
> Hi,
>
> A connection timeout probably means that some disk space which is not
> possible to mount on the machine that you were using. Therefore I do not
> think that I can do anything concerning this.
>
> If you relaunch the same job, do you have the same output/bug?
> I guess that you should not reproduce this (but if unlikely this is a hardware problem on some node and that you go back on the same node)
>
> Cheers,
>
> Olivier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1749632
>
> Title:
> madgraph crash when combining runs
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> In madgraph 2.6.1 I am generatinh 100k events and after completion of the jobs and during the combination of runs I got a crash:
> INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 2m ]
> INFO: Idle: 0, Running: 1, Completed: 159 [ 9h 12m ]
> INFO: All jobs finished
> INFO: Idle: 0, Running: 0, Completed: 160 [ 9h 22m ]
> INFO: Combining runs
> Error when reading /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/G103i2/results.dat
> Command "generate_events " interrupted with error:
> ValueError : empty string for float()
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/run_01_tag_1_debug.log'.
> Please attach this file to your report.
> quit
> INFO:
>
> The folder G103i2 contains 4 files:
> -bash-4.1$ more results.dat
> end-code not correct 2
> -bash-4.1$
>
> -bash-4.1$ more input_sg.txt
> 5000 9 3
> -3745428.24213
> 2
> 1
> 0
> 103
> -bash-4.1$
>
> -bash-4.1$ more moffset.dat
> 61
> -bash-4.1$
>
> events.lhe is empty
>
>
> The STDOUT of ajob160 is the following
> -bash-4.1$ more /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/LSFJOB_143428705/STDOUT
> @(#)CERN job starter $Date: 2010/06/23 14:22:16 $
> Working directory is </pool/lsf/leisos/143428705> on <b6a36a9c52.cern.ch>
>
> At line 398 of file unwgt.f (unit = 25, file = '')
> Fortran runtime error: Connection timed out
> rm: cannot remove `results.dat': No such file or directory
> ERROR DETECTED
>
> Job finished at Wed Feb 14 19:03:35 CET 2018 on node
> under linux version Scientific Linux CERN SLC release 6.9 (Carbon)
>
> CERN statistics: This process used approximately : 0:09:22 KSI2K hours (562 KSI2K seconds)
> This process corresponds to : 0:36:33 HS06 hours (2193 HS06 seconds)
>
>
> Is there a way to rerun only the specific job and continue from this point without rerunning all the jobs?
> Or combine the rest of the events i.e. excluding this directory?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1749632/+subscriptions
>
> --
> This message has been sca...

Thank you Olivier,

you are right. I resubmitted the jobs and everything worked fine.
Cheers,
Antonios

> On 20 Feb 2018, at 21:10, Olivier Mattelaer <olivier.mattelaer@uclouvain.be> wrote:
>
> Hi,
>
> A connection timeout probably means that some disk space which is not
> possible to mount on the machine that you were using. Therefore I do not
> think that I can do anything concerning this.
>
> If you relaunch the same job, do you have the same output/bug?
> I guess that you should not reproduce this (but if unlikely this is a hardware problem on some node and that you go back on the same node)
>
> Cheers,
>
> Olivier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1749632
>
> Title:
>  madgraph crash when combining runs
>
> Status in MadGraph5_aMC@NLO:
>  New
>
> Bug description:
>  In madgraph 2.6.1 I am generatinh 100k events and after completion of the jobs and during the combination of runs I got a crash:
>  INFO:  Idle: 0,  Running: 1,  Completed: 159 [  9h 2m  ]
>  INFO:  Idle: 0,  Running: 1,  Completed: 159 [  9h 12m  ]
>  INFO: All jobs finished
>  INFO:  Idle: 0,  Running: 0,  Completed: 160 [  9h 22m  ]
>  INFO: Combining runs
>  Error when reading /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/G103i2/results.dat
>  Command "generate_events " interrupted with error:
>  ValueError : empty string for float()
>  Please report this bug on https://bugs.launchpad.net/mg5amcnlo
>  More information is found in '/afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/run_01_tag_1_debug.log'.
>  Please attach this file to your report.
>  quit
>  INFO:
>
>  The folder  G103i2 contains 4 files:
>  -bash-4.1$ more results.dat
>  end-code not correct 2
>  -bash-4.1$
>
>  -bash-4.1$ more input_sg.txt
>      5000       9       3
>      -3745428.24213
>  2
>  1
>  0
>  103
>  -bash-4.1$
>
>  -bash-4.1$ more moffset.dat
>  61
>  -bash-4.1$
>
>  events.lhe is empty
>
>
>  The STDOUT of ajob160 is the following
>  -bash-4.1$ more /afs/cern.ch/work/l/leisos/public/MG5_aMC_v2_6_1/VBS_LT012_ZZ_FT0_new/SubProcesses/P1_qq_zzqq_z_ll_z_ll/LSFJOB_143428705/STDOUT
>  @(#)CERN job starter $Date: 2010/06/23 14:22:16 $
>  Working directory is </pool/lsf/leisos/143428705> on <b6a36a9c52.cern.ch>
>
>  At line 398 of file unwgt.f (unit = 25, file = '')
>  Fortran runtime error: Connection timed out
>  rm: cannot remove `results.dat': No such file or directory
>  ERROR DETECTED
>
>  Job finished at Wed Feb 14 19:03:35 CET 2018 on node
>   under linux version Scientific Linux CERN SLC release 6.9 (Carbon)
>
>  CERN statistics: This process used approximately : 0:09:22 KSI2K hours (562 KSI2K seconds)
>                   This process corresponds to     : 0:36:33 HS06  hours (2193 HS06  seconds)
>
>
>  Is there a way to rerun only the specific job and continue from this point without rerunning all the jobs?
>  Or combine the rest of the events i.e. excluding this directory?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1749632/+subscriptions
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>

Διευκρίνιση ηλεκτρονικού ταχυδρομείου
Αυτό το ηλεκτρονικό μήνυμα (e-mail) μαζί με τα τυχόν επισυναπτόμενά του, είναι εμπιστευτικό και προορίζεται μόνο για χρήση από τα άτομα στα οποία απευθύνεται. Εάν δεν είστε ο προοριζόμενος παραλήπτης ή αρμόδιος για την παράδοση του μηνύματος στον προοριζόμενο παραλήπτη, παρακαλώ επιστρέψτε το στον αποστολέα και σας γνωρίζουμε ότι η αποθήκευση, ανάγνωση, μετάδοση, αναγραφή, αποκάλυψη, ανακοίνωση ή οποιασδήποτε άλλης μορφής χρήση οποιωνδήποτε πληροφοριών περιέχονται στο παρόν μήνυμα, δεν είναι σύννομη. Παρόλο που το Ε.Α.Π. έχει λάβει όλες τις απαιτούμενες προφυλάξεις για να διασφαλίσει ότι τα αποστελλόμενα μηνύματα ηλεκτρονικού ταχυδρομείου και τα όποια επισυναπτόμενα αρχεία τους έχουν ελεγχθεί για ιούς ή άλλο κακόβουλο λογισμικό, ωστόσο είναι ευθύνη του παραλήπτη να ελέγξει τη μη ύπαρξη ιών και κακόβουλου λογισμικού, πριν ανοίξετε οποιοδήποτε επισυναπτόμενο αρχείο. Το Ε.Α.Π δεν αποδέχεται ουδεμία ευθύνη για το περιεχόμενο του παρόντος μηνύματος και σε καμία περίπτωση, δεν ευθύνεται για οιαδήποτε τυχόν ζημία λόγω αποθήκευσης ή ανάγνωσης του παρόντος, καθυστερημένης διαβίβασης, υποκλοπής, αλλοίωσης ή μόλυνσης με ιούς. Οι απόψεις που διατυπώνονται στο παρόν μήνυμα ανήκουν αποκλειστικά στον αποστολέα αυτού.

Olivier Mattelaer (olivier-mattelaer) on 2018-04-11

Changed in mg5amcnlo:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

run_01_tag_1_debug.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.