running MG5 on a condor-based cluster

Bug #1914835 reported by Oscar Eboli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MadGraph5_aMC@NLO
Expired
Undecided
Unassigned

Bug Description

Good evening,

I'm having trouble running MG5 in a condor cluster since it always end up in the error message

Error when reading /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
Command "generate_events " interrupted with error:
ValueError : could not convert string to float:
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
Please attach this file to your report.

I checked the file results.dat and its content is

end-code not correct 127

I enclosed the log file. If I run the process in the multicore mode it finishes without any problem.

Thanks a lot for your help, Oscar

Revision history for this message
Oscar Eboli (oeboli) wrote :
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote : Re: [Bug 1914835] [NEW] running MG5 on a condor-based cluster

Hi,

Please merge your development branch with the latest stable branch (2.9.1.2 version of the code).
This python3 issue has been fixed in the official version, we do not merge such fix to all the alpha version of the code.
This branch is merged in an upper branch so we will not update this one anymore.

I know that SMEFTNLO paper suggest to use such type of branch but this is not something that we can support.
They try to track where we are in the merging prodecure to change their recomendation, but the True statement is that
MG5aMC@NLO does not support for the moment SMEFTNLO model.

3.0.4 will be soon in a beta release and then we will have our first offiical support for such model.

Cheers,

Olivier

> On 5 Feb 2021, at 23:33, Oscar Eboli <email address hidden> wrote:
>
> Public bug reported:
>
> Good evening,
>
> I'm having trouble running MG5 in a condor cluster since it always end
> up in the error message
>
> Error when reading /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> Command "generate_events " interrupted with error:
> ValueError : could not convert string to float:
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> Please attach this file to your report.
>
> I checked the file results.dat and its content is
>
> end-code not correct 127
>
> I enclosed the log file. If I run the process in the multicore mode it
> finishes without any problem.
>
> Thanks a lot for your help, Oscar
>
> ** Affects: mg5amcnlo
> Importance: Undecided
> Status: New
>
> ** Attachment added: "run_05_tag_1_debug.log"
> https://bugs.launchpad.net/bugs/1914835/+attachment/5460659/+files/run_05_tag_1_debug.log
>
> --
> You received this bug notification because you are subscribed to
> MadGraph5_aMC@NLO.
> https://bugs.launchpad.net/bugs/1914835
>
> Title:
> running MG5 on a condor-based cluster
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> Good evening,
>
> I'm having trouble running MG5 in a condor cluster since it always end
> up in the error message
>
> Error when reading /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> Command "generate_events " interrupted with error:
> ValueError : could not convert string to float:
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> Please attach this file to your report.
>
> I checked the file results.dat and its content is
>
> end-code not correct 127
>
> I enclosed the log file. If I run the process in the multicore mode it
> finishes without any problem.
>
> Thanks a lot for your help, Oscar
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1914835/+subscriptions

Revision history for this message
Oscar Eboli (oeboli) wrote :
Download full text (4.4 KiB)

Dear Olivier,

thanks for the prompt reply. However, I have a question: how do I merge two
different branches?

Thanks a lot, Oscar

Em sex., 5 de fev. de 2021 às 19:55, Olivier Mattelaer <
<email address hidden>> escreveu:

> Hi,
>
> Please merge your development branch with the latest stable branch
> (2.9.1.2 version of the code).
> This python3 issue has been fixed in the official version, we do not merge
> such fix to all the alpha version of the code.
> This branch is merged in an upper branch so we will not update this one
> anymore.
>
> I know that SMEFTNLO paper suggest to use such type of branch but this is
> not something that we can support.
> They try to track where we are in the merging prodecure to change their
> recomendation, but the True statement is that
> MG5aMC@NLO does not support for the moment SMEFTNLO model.
>
> 3.0.4 will be soon in a beta release and then we will have our first
> offiical support for such model.
>
> Cheers,
>
> Olivier
>
> > On 5 Feb 2021, at 23:33, Oscar Eboli <email address hidden> wrote:
> >
> > Public bug reported:
> >
> > Good evening,
> >
> > I'm having trouble running MG5 in a condor cluster since it always end
> > up in the error message
> >
> > Error when reading
> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> > Command "generate_events " interrupted with error:
> > ValueError : could not convert string to float:
> > Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> > More information is found in
> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> > Please attach this file to your report.
> >
> > I checked the file results.dat and its content is
> >
> > end-code not correct 127
> >
> > I enclosed the log file. If I run the process in the multicore mode it
> > finishes without any problem.
> >
> > Thanks a lot for your help, Oscar
> >
> > ** Affects: mg5amcnlo
> > Importance: Undecided
> > Status: New
> >
> > ** Attachment added: "run_05_tag_1_debug.log"
> >
> https://bugs.launchpad.net/bugs/1914835/+attachment/5460659/+files/run_05_tag_1_debug.log
> >
> > --
> > You received this bug notification because you are subscribed to
> > MadGraph5_aMC@NLO.
> > https://bugs.launchpad.net/bugs/1914835
> >
> > Title:
> > running MG5 on a condor-based cluster
> >
> > Status in MadGraph5_aMC@NLO:
> > New
> >
> > Bug description:
> > Good evening,
> >
> > I'm having trouble running MG5 in a condor cluster since it always end
> > up in the error message
> >
> > Error when reading
> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> > Command "generate_events " interrupted with error:
> > ValueError : could not convert string to float:
> > Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> > More information is found in
> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> > Please attach this file to your report.
> >
> > I checked the file results.dat and its content is
> >
> > end-code not correct 127
> >
> > I enclosed the log file. If I run the process in the multicore mode it
> > finishes without any problem.
> >
> > Thanks a lot for your help, O...

Read more...

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :
Download full text (6.0 KiB)

Hi,

I let the developers of the smeftnlo authors to see what they can do to help you with such merge. (They are in copy of this thread but you can also contact directly at <email address hidden>).
Since if you do not know how to do such merge, you are even less likely to understand how to fix all the conflicts that will be created by the procedure.

Cheers,

Olivier

> On 8 Feb 2021, at 19:06, Oscar Eboli <email address hidden> wrote:
>
> Dear Olivier,
>
> thanks for the prompt reply. However, I have a question: how do I merge two
> different branches?
>
> Thanks a lot, Oscar
>
> Em sex., 5 de fev. de 2021 às 19:55, Olivier Mattelaer <
> <email address hidden>> escreveu:
>
>> Hi,
>>
>> Please merge your development branch with the latest stable branch
>> (2.9.1.2 version of the code).
>> This python3 issue has been fixed in the official version, we do not merge
>> such fix to all the alpha version of the code.
>> This branch is merged in an upper branch so we will not update this one
>> anymore.
>>
>> I know that SMEFTNLO paper suggest to use such type of branch but this is
>> not something that we can support.
>> They try to track where we are in the merging prodecure to change their
>> recomendation, but the True statement is that
>> MG5aMC@NLO does not support for the moment SMEFTNLO model.
>>
>> 3.0.4 will be soon in a beta release and then we will have our first
>> offiical support for such model.
>>
>> Cheers,
>>
>> Olivier
>>
>>> On 5 Feb 2021, at 23:33, Oscar Eboli <email address hidden> wrote:
>>>
>>> Public bug reported:
>>>
>>> Good evening,
>>>
>>> I'm having trouble running MG5 in a condor cluster since it always end
>>> up in the error message
>>>
>>> Error when reading
>> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
>>> Command "generate_events " interrupted with error:
>>> ValueError : could not convert string to float:
>>> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
>>> More information is found in
>> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
>>> Please attach this file to your report.
>>>
>>> I checked the file results.dat and its content is
>>>
>>> end-code not correct 127
>>>
>>> I enclosed the log file. If I run the process in the multicore mode it
>>> finishes without any problem.
>>>
>>> Thanks a lot for your help, Oscar
>>>
>>> ** Affects: mg5amcnlo
>>> Importance: Undecided
>>> Status: New
>>>
>>> ** Attachment added: "run_05_tag_1_debug.log"
>>>
>> https://bugs.launchpad.net/bugs/1914835/+attachment/5460659/+files/run_05_tag_1_debug.log
>>>
>>> --
>>> You received this bug notification because you are subscribed to
>>> MadGraph5_aMC@NLO.
>>> https://bugs.launchpad.net/bugs/1914835
>>>
>>> Title:
>>> running MG5 on a condor-based cluster
>>>
>>> Status in MadGraph5_aMC@NLO:
>>> New
>>>
>>> Bug description:
>>> Good evening,
>>>
>>> I'm having trouble running MG5 in a condor cluster since it always end
>>> up in the error message
>>>
>>> Error when reading
>> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
>>> Command "generate_events " interrupted with error:
>>>...

Read more...

Revision history for this message
Oscar Eboli (oeboli) wrote :
Download full text (7.8 KiB)

Dear Olivier,

I'm probably missing something but I still have the same problems as before
with version 2.9.1.2 of MG5; see the enclosed files.

What should I try?

thanks a lot, Oscar

Em seg., 8 de fev. de 2021 às 17:15, Olivier Mattelaer <
<email address hidden>> escreveu:

> Hi,
>
> I let the developers of the smeftnlo authors to see what they can do to
> help you with such merge. (They are in copy of this thread but you can also
> contact directly at <email address hidden>).
> Since if you do not know how to do such merge, you are even less likely to
> understand how to fix all the conflicts that will be created by the
> procedure.
>
> Cheers,
>
> Olivier
>
> > On 8 Feb 2021, at 19:06, Oscar Eboli <email address hidden> wrote:
> >
> > Dear Olivier,
> >
> > thanks for the prompt reply. However, I have a question: how do I merge
> two
> > different branches?
> >
> > Thanks a lot, Oscar
> >
> > Em sex., 5 de fev. de 2021 às 19:55, Olivier Mattelaer <
> > <email address hidden>> escreveu:
> >
> >> Hi,
> >>
> >> Please merge your development branch with the latest stable branch
> >> (2.9.1.2 version of the code).
> >> This python3 issue has been fixed in the official version, we do not
> merge
> >> such fix to all the alpha version of the code.
> >> This branch is merged in an upper branch so we will not update this one
> >> anymore.
> >>
> >> I know that SMEFTNLO paper suggest to use such type of branch but this
> is
> >> not something that we can support.
> >> They try to track where we are in the merging prodecure to change their
> >> recomendation, but the True statement is that
> >> MG5aMC@NLO does not support for the moment SMEFTNLO model.
> >>
> >> 3.0.4 will be soon in a beta release and then we will have our first
> >> offiical support for such model.
> >>
> >> Cheers,
> >>
> >> Olivier
> >>
> >>> On 5 Feb 2021, at 23:33, Oscar Eboli <email address hidden>
> wrote:
> >>>
> >>> Public bug reported:
> >>>
> >>> Good evening,
> >>>
> >>> I'm having trouble running MG5 in a condor cluster since it always end
> >>> up in the error message
> >>>
> >>> Error when reading
> >>
> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> >>> Command "generate_events " interrupted with error:
> >>> ValueError : could not convert string to float:
> >>> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> >>> More information is found in
> >> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> >>> Please attach this file to your report.
> >>>
> >>> I checked the file results.dat and its content is
> >>>
> >>> end-code not correct 127
> >>>
> >>> I enclosed the log file. If I run the process in the multicore mode it
> >>> finishes without any problem.
> >>>
> >>> Thanks a lot for your help, Oscar
> >>>
> >>> ** Affects: mg5amcnlo
> >>> Importance: Undecided
> >>> Status: New
> >>>
> >>> ** Attachment added: "run_05_tag_1_debug.log"
> >>>
> >>
> https://bugs.launchpad.net/bugs/1914835/+attachment/5460659/+files/run_05_tag_1_debug.log
> >>>
> >>> --
> >>> You received this bug notification because you are subscribed to
> >>> MadGraph5_aMC@NLO.
> >>> https://bugs.launchpad....

Read more...

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Hum, then this is not the same bug as the one I was fixing before.
Which version of condor did you use?

Could you check within your cluster manual/sys admin if the following feature is supported:
https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
and that no cluster specific option/plugin modifies his default behaviour?

Then finally, could you add a "print(text)"
after
        text = """Executable = %(prog)s
                  output = %(stdout)s
                  error = %(stderr)s
                  log = %(log)s
                  %(argument)s
                  should_transfer_files = YES
                  when_to_transfer_output = ON_EXIT
                  transfer_input_files = %(input_files)s
                  %(output_files)s
                  Universe = vanilla
                  notification = Error
                  Initialdir = %(cwd)s
                  %(requirement)s
                  getenv=True
                  queue 1
               """

in madgraph/various/cluster.py (around line 958)

and then report here the associated printout if you contact your sys-admin, you can ask him if that file is correct for your cluster.
On my side, the key element to check is that the "transfer_input_files" you have the associated executable "madevent" or "madevent_forhel" (for LO run) if this is not the path indicated in the Executable field. (since your flag seems to indicate that you fail to find the executable.

Another potential issue from your bug report is that the code does not find some of the standard fortran library on the node on which you run. This can happens if you need to perform some action on the node before having such library available. In most cluster it is enough to setup that on the submission node and this will be propagated to the submitted node (thanks to getenv=True) but you might want to check within your cluster documentation on how to use fortran code within your cluster as well.

Cheers,

Olivier

Revision history for this message
Oscar Eboli (oeboli) wrote : Re: [Bug 1914835] Re: running MG5 on a condor-based cluster

Dear Olivier,

The answer to your questions are as follows:

Em sex., 12 de fev. de 2021 às 17:01, Olivier Mattelaer <
<email address hidden>> escreveu:

> Hum, then this is not the same bug as the one I was fixing before.
> Which version of condor did you use?
>
>
8.8.12-1.el7

> Could you check within your cluster manual/sys admin if the following
> feature is supported:
> https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
> and that no cluster specific option/plugin modifies his default behaviour?
>
>
From what I saw the file transfer feature is enabled. We don't have a
/home directory in the nodes so, I imagine, the files are transferred for
the run.

> Then finally, could you add a "print(text)"
> after
> text = """Executable = %(prog)s
> output = %(stdout)s
> error = %(stderr)s
> log = %(log)s
> %(argument)s
> should_transfer_files = YES
> when_to_transfer_output = ON_EXIT
> transfer_input_files = %(input_files)s
> %(output_files)s
> Universe = vanilla
> notification = Error
> Initialdir = %(cwd)s
> %(requirement)s
> getenv=True
> queue 1
> """
>
> in madgraph/various/cluster.py (around line 958)
>
> and then report here the associated printout if you contact your
> sys-admin, you can ask him if that file is correct for your cluster.
> On my side, the key element to check is that the "transfer_input_files"
> you have the associated executable "madevent" or "madevent_forhel" (for LO
> run) if this is not the path indicated in the Executable field. (since your
> flag seems to indicate that you fail to find the executable.
>
>
the output for transfer_inut_files is

transfer_input_files = %(input_files)s

Thanks a lot Oscar

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote : Re: [Bug 1914835] running MG5 on a condor-based cluster
Download full text (3.8 KiB)

> the output for transfer_inut_files is
>
> transfer_input_files = %(input_files)s

Not informative actually.
a bit below you should have a line like

        dico = {'prog': prog, 'cwd': cwd, 'stdout': stdout,
                'stderr': stderr,'log': log,'argument': argument,
                'requirement': requirement, 'input_files':input_files,
                'output_files':output_files}

Could you add a print(dico) such that we can see that information?

Thanks,

Olivier

> On 16 Feb 2021, at 17:58, Oscar Eboli <email address hidden> wrote:
>
> Dear Olivier,
>
> The answer to your questions are as follows:
>
> Em sex., 12 de fev. de 2021 às 17:01, Olivier Mattelaer <
> <email address hidden> <mailto:<email address hidden>>> escreveu:
>
>> Hum, then this is not the same bug as the one I was fixing before.
>> Which version of condor did you use?
>>
>>
> 8.8.12-1.el7
>
>
>> Could you check within your cluster manual/sys admin if the following
>> feature is supported:
>> https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html <https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html>
>> and that no cluster specific option/plugin modifies his default behaviour?
>>
>>
>> From what I saw the file transfer feature is enabled. We don't have a
> /home directory in the nodes so, I imagine, the files are transferred for
> the run.
>
>
>> Then finally, could you add a "print(text)"
>> after
>> text = """Executable = %(prog)s
>> output = %(stdout)s
>> error = %(stderr)s
>> log = %(log)s
>> %(argument)s
>> should_transfer_files = YES
>> when_to_transfer_output = ON_EXIT
>> transfer_input_files = %(input_files)s
>> %(output_files)s
>> Universe = vanilla
>> notification = Error
>> Initialdir = %(cwd)s
>> %(requirement)s
>> getenv=True
>> queue 1
>> """
>>
>> in madgraph/various/cluster.py (around line 958)
>>
>> and then report here the associated printout if you contact your
>> sys-admin, you can ask him if that file is correct for your cluster.
>> On my side, the key element to check is that the "transfer_input_files"
>> you have the associated executable "madevent" or "madevent_forhel" (for LO
>> run) if this is not the path indicated in the Executable field. (since your
>> flag seems to indicate that you fail to find the executable.
>>
>>
> the output for transfer_inut_files is
>
> transfer_input_files = %(input_files)s
>
>
> Thanks a lot Oscar
>
> --
> You received this bug notification because you are subscribed to
> MadGraph5_aMC@NLO.
> https://bugs.launchpad.net/bugs/1914835
>
> Title:
> running MG5 on a condor-based cluster
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> Good evening,
>
> I'm having trouble running MG5 in a condor cluster since it always end
> up in the error message
>
> Error when reading /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> Command ...

Read more...

Revision history for this message
Oscar Eboli (oeboli) wrote :
Download full text (5.7 KiB)

the output is

{'prog': '/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/survey.sh',
'cwd': '/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/P1_qq_ll',
'stdout': '/dev/null', 'stderr': '/dev/null', 'log': '/dev/null',
'argument': 'Arguments = 0 1 2', 'requirement': '', 'input_files':
'madevent,input_app.txt,symfact.dat,iproc.dat,dname.mg,/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/randinit,/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/lib/Pdfdata/NNPDF23_lo_as_0130_qed_mem0.grid',
'output_files': 'transfer_output_files = G1,G2'}

Thanks

Em ter., 16 de fev. de 2021 às 14:20, Olivier Mattelaer <
<email address hidden>> escreveu:

> > the output for transfer_inut_files is
> >
> > transfer_input_files = %(input_files)s
>
>
> Not informative actually.
> a bit below you should have a line like
>
> dico = {'prog': prog, 'cwd': cwd, 'stdout': stdout,
> 'stderr': stderr,'log': log,'argument': argument,
> 'requirement': requirement, 'input_files':input_files,
> 'output_files':output_files}
>
>
> Could you add a print(dico) such that we can see that information?
>
> Thanks,
>
> Olivier
>
> > On 16 Feb 2021, at 17:58, Oscar Eboli <email address hidden>
> wrote:
> >
> > Dear Olivier,
> >
> > The answer to your questions are as follows:
> >
> > Em sex., 12 de fev. de 2021 às 17:01, Olivier Mattelaer <
> > <email address hidden> <mailto:<email address hidden>>>
> escreveu:
> >
> >> Hum, then this is not the same bug as the one I was fixing before.
> >> Which version of condor did you use?
> >>
> >>
> > 8.8.12-1.el7
> >
> >
> >> Could you check within your cluster manual/sys admin if the following
> >> feature is supported:
> >>
> https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
> <https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
> >
> >> and that no cluster specific option/plugin modifies his default
> behaviour?
> >>
> >>
> >> From what I saw the file transfer feature is enabled. We don't have a
> > /home directory in the nodes so, I imagine, the files are transferred for
> > the run.
> >
> >
> >> Then finally, could you add a "print(text)"
> >> after
> >> text = """Executable = %(prog)s
> >> output = %(stdout)s
> >> error = %(stderr)s
> >> log = %(log)s
> >> %(argument)s
> >> should_transfer_files = YES
> >> when_to_transfer_output = ON_EXIT
> >> transfer_input_files = %(input_files)s
> >> %(output_files)s
> >> Universe = vanilla
> >> notification = Error
> >> Initialdir = %(cwd)s
> >> %(requirement)s
> >> getenv=True
> >> queue 1
> >> """
> >>
> >> in madgraph/various/cluster.py (around line 958)
> >>
> >> and then report here the associated printout if you contact your
> >> sys-admin, you can ask him if that file is correct for your cluster.
> >> On my side, the key element to check is that the "transfer_input_files"
> >> you have the associated executable "madevent"...

Read more...

Revision history for this message
Oscar Eboli (oeboli) wrote :
Download full text (6.2 KiB)

I don't know if this helps, but
searching log's I found the file SubProcesses/..../G2/log.txt whose
content is

-bash: les: command not found

The similar file G1/log.txt looked ok.

Em ter., 16 de fev. de 2021 às 14:24, oscar eboli <email address hidden>
escreveu:

> the output is
>
> {'prog': '/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/survey.sh',
> 'cwd': '/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/P1_qq_ll',
> 'stdout': '/dev/null', 'stderr': '/dev/null', 'log': '/dev/null',
> 'argument': 'Arguments = 0 1 2', 'requirement': '', 'input_files':
> 'madevent,input_app.txt,symfact.dat,iproc.dat,dname.mg,/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/SubProcesses/randinit,/home/oeboli/Tools/MG5_aMC_v2_9_1_2/T1/lib/Pdfdata/NNPDF23_lo_as_0130_qed_mem0.grid',
> 'output_files': 'transfer_output_files = G1,G2'}
>
>
> Thanks
>
> Em ter., 16 de fev. de 2021 às 14:20, Olivier Mattelaer <
> <email address hidden>> escreveu:
>
>> > the output for transfer_inut_files is
>> >
>> > transfer_input_files = %(input_files)s
>>
>>
>> Not informative actually.
>> a bit below you should have a line like
>>
>> dico = {'prog': prog, 'cwd': cwd, 'stdout': stdout,
>> 'stderr': stderr,'log': log,'argument': argument,
>> 'requirement': requirement, 'input_files':input_files,
>> 'output_files':output_files}
>>
>>
>> Could you add a print(dico) such that we can see that information?
>>
>> Thanks,
>>
>> Olivier
>>
>> > On 16 Feb 2021, at 17:58, Oscar Eboli <email address hidden>
>> wrote:
>> >
>> > Dear Olivier,
>> >
>> > The answer to your questions are as follows:
>> >
>> > Em sex., 12 de fev. de 2021 às 17:01, Olivier Mattelaer <
>> > <email address hidden> <mailto:<email address hidden>>>
>> escreveu:
>> >
>> >> Hum, then this is not the same bug as the one I was fixing before.
>> >> Which version of condor did you use?
>> >>
>> >>
>> > 8.8.12-1.el7
>> >
>> >
>> >> Could you check within your cluster manual/sys admin if the following
>> >> feature is supported:
>> >>
>> https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
>> <
>> https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html
>> >
>> >> and that no cluster specific option/plugin modifies his default
>> behaviour?
>> >>
>> >>
>> >> From what I saw the file transfer feature is enabled. We don't have a
>> > /home directory in the nodes so, I imagine, the files are transferred
>> for
>> > the run.
>> >
>> >
>> >> Then finally, could you add a "print(text)"
>> >> after
>> >> text = """Executable = %(prog)s
>> >> output = %(stdout)s
>> >> error = %(stderr)s
>> >> log = %(log)s
>> >> %(argument)s
>> >> should_transfer_files = YES
>> >> when_to_transfer_output = ON_EXIT
>> >> transfer_input_files = %(input_files)s
>> >> %(output_files)s
>> >> Universe = vanilla
>> >> notification = Error
>> >> Initialdir = %(cwd)s
>> >> %(requirement)s
>> >> getenv=True
>> >> ...

Read more...

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :

Thanks, this is helpfull since it indicates that the issue is not that the executable is missing but something else.

So looks like the code is trying to executed some "les" command and this explains why it does a return end code 127. Now this is really weird since
G1 directory is created and run before G2.
It is the first report where someone report an issue on G2 while G1 is going trough (in the past the opposite G1 failing but not G2 was happening due to slow file transfer making the executable/input missing for G1 but available for G2)

Now the issue is that we do not have any "les" command in that file.
The only line that could match in that script is a comment line (and not les but less)
# Perform some cleaning to keep less file on disk/transfer less file.

Looks like "les" is a network program for ATM in Linux:
https://command-not-found.com/les

One potential solution is to split that job in two such that G1 and G2 are handle by two separated job. You can do that by having the following line in the run_card:
   1 = survey_nchannel_per_job ! control how many Channel are integrated inside a single job on cluster/multicore

But this obviously does not solve the above issue so it might pop again...

Olivier

Revision history for this message
Oscar Eboli (oeboli) wrote : Re: [Bug 1914835] Re: running MG5 on a condor-based cluster

Dear Olivier,

I'm still struggling with the bug. I made sure that the server and nodes
have the same packages.
The jobs run on the server but some of the condor runs do not end as
expected. I have two additional
pieces of information that might give you a hint:

1. once I got a message that the condor run terminated without the
generation of a results.dat

2. I found out that for p p > w+ w- the run finishes without error
depending on the seed of the random
numbers!

Do you have any suggestions?

Thanks a lot, Oscar

Em qua., 17 de fev. de 2021 às 05:15, Olivier Mattelaer <
<email address hidden>> escreveu:

> Thanks, this is helpfull since it indicates that the issue is not that
> the executable is missing but something else.
>
> So looks like the code is trying to executed some "les" command and this
> explains why it does a return end code 127. Now this is really weird since
> G1 directory is created and run before G2.
> It is the first report where someone report an issue on G2 while G1 is
> going trough (in the past the opposite G1 failing but not G2 was happening
> due to slow file transfer making the executable/input missing for G1 but
> available for G2)
>
> Now the issue is that we do not have any "les" command in that file.
> The only line that could match in that script is a comment line (and not
> les but less)
> # Perform some cleaning to keep less file on disk/transfer less file.
>
> Looks like "les" is a network program for ATM in Linux:
> https://command-not-found.com/les
>
> One potential solution is to split that job in two such that G1 and G2 are
> handle by two separated job. You can do that by having the following line
> in the run_card:
> 1 = survey_nchannel_per_job ! control how many Channel are integrated
> inside a single job on cluster/multicore
>
> But this obviously does not solve the above issue so it might pop
> again...
>
> Olivier
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1914835
>
> Title:
> running MG5 on a condor-based cluster
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> Good evening,
>
> I'm having trouble running MG5 in a condor cluster since it always end
> up in the error message
>
> Error when reading
> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> Command "generate_events " interrupted with error:
> ValueError : could not convert string to float:
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in
> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> Please attach this file to your report.
>
> I checked the file results.dat and its content is
>
> end-code not correct 127
>
> I enclosed the log file. If I run the process in the multicore mode it
> finishes without any problem.
>
> Thanks a lot for your help, Oscar
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1914835/+subscriptions
>

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote : Re: [Bug 1914835] running MG5 on a condor-based cluster
Download full text (4.9 KiB)

> Do you have any suggestions?

Not really. All those elements leads to an issue related to some node/...
The only suggestion is to set in the run_card
>> 1 = survey_nchannel_per_job ! control how many Channel are integrated

and to increase in mg5_configuration.txt
cluster_nb_retry to a large number
after that if a job fails it should be re-submitted enough time to make it trough.

Cheers,

Olivier

> On 24 Feb 2021, at 18:48, Oscar Eboli <email address hidden> wrote:
>
> Dear Olivier,
>
> I'm still struggling with the bug. I made sure that the server and nodes
> have the same packages.
> The jobs run on the server but some of the condor runs do not end as
> expected. I have two additional
> pieces of information that might give you a hint:
>
> 1. once I got a message that the condor run terminated without the
> generation of a results.dat
>
> 2. I found out that for p p > w+ w- the run finishes without error
> depending on the seed of the random
> numbers!
>
> Do you have any suggestions?
>
> Thanks a lot, Oscar
>
>
>
> Em qua., 17 de fev. de 2021 às 05:15, Olivier Mattelaer <
> <email address hidden> <mailto:<email address hidden>>> escreveu:
>
>> Thanks, this is helpfull since it indicates that the issue is not that
>> the executable is missing but something else.
>>
>> So looks like the code is trying to executed some "les" command and this
>> explains why it does a return end code 127. Now this is really weird since
>> G1 directory is created and run before G2.
>> It is the first report where someone report an issue on G2 while G1 is
>> going trough (in the past the opposite G1 failing but not G2 was happening
>> due to slow file transfer making the executable/input missing for G1 but
>> available for G2)
>>
>> Now the issue is that we do not have any "les" command in that file.
>> The only line that could match in that script is a comment line (and not
>> les but less)
>> # Perform some cleaning to keep less file on disk/transfer less file.
>>
>> Looks like "les" is a network program for ATM in Linux:
>> https://command-not-found.com/les
>>
>> One potential solution is to split that job in two such that G1 and G2 are
>> handle by two separated job. You can do that by having the following line
>> in the run_card:
>> 1 = survey_nchannel_per_job ! control how many Channel are integrated
>> inside a single job on cluster/multicore
>>
>> But this obviously does not solve the above issue so it might pop
>> again...
>>
>> Olivier
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1914835 <https://bugs.launchpad.net/bugs/1914835>
>>
>> Title:
>> running MG5 on a condor-based cluster
>>
>> Status in MadGraph5_aMC@NLO:
>> New
>>
>> Bug description:
>> Good evening,
>>
>> I'm having trouble running MG5 in a condor cluster since it always end
>> up in the error message
>>
>> Error when reading
>> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
>> Command "generate_events " interrupted with error:
>> ValueError : could not convert string to float:
>> Please report this bug on https://bugs....

Read more...

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :
Download full text (5.0 KiB)

One think that you can check is if the package
"atm-tools"
is available on all node
This is not a module needed by MG5aMC.
I only pointed you to that one because of this website:
http://bryan-murdock.blogspot.com/2010/10/how-to-disable-ubuntu-command-not-found.html

Another idea to check, is if your condor cluster can susped your job, and if the crash is not linked to suspended job (where the bad atm-tools is making the suspension to fail)

Actually at this stage your local sys-admin might know better than me what can cause such issue.

Cheers,

Olivier

> On 24 Feb 2021, at 18:48, Oscar Eboli <email address hidden> wrote:
>
> Dear Olivier,
>
> I'm still struggling with the bug. I made sure that the server and nodes
> have the same packages.
> The jobs run on the server but some of the condor runs do not end as
> expected. I have two additional
> pieces of information that might give you a hint:
>
> 1. once I got a message that the condor run terminated without the
> generation of a results.dat
>
> 2. I found out that for p p > w+ w- the run finishes without error
> depending on the seed of the random
> numbers!
>
> Do you have any suggestions?
>
> Thanks a lot, Oscar
>
>
>
> Em qua., 17 de fev. de 2021 às 05:15, Olivier Mattelaer <
> <email address hidden> <mailto:<email address hidden>>> escreveu:
>
>> Thanks, this is helpfull since it indicates that the issue is not that
>> the executable is missing but something else.
>>
>> So looks like the code is trying to executed some "les" command and this
>> explains why it does a return end code 127. Now this is really weird since
>> G1 directory is created and run before G2.
>> It is the first report where someone report an issue on G2 while G1 is
>> going trough (in the past the opposite G1 failing but not G2 was happening
>> due to slow file transfer making the executable/input missing for G1 but
>> available for G2)
>>
>> Now the issue is that we do not have any "les" command in that file.
>> The only line that could match in that script is a comment line (and not
>> les but less)
>> # Perform some cleaning to keep less file on disk/transfer less file.
>>
>> Looks like "les" is a network program for ATM in Linux:
>> https://command-not-found.com/les
>>
>> One potential solution is to split that job in two such that G1 and G2 are
>> handle by two separated job. You can do that by having the following line
>> in the run_card:
>> 1 = survey_nchannel_per_job ! control how many Channel are integrated
>> inside a single job on cluster/multicore
>>
>> But this obviously does not solve the above issue so it might pop
>> again...
>>
>> Olivier
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1914835 <https://bugs.launchpad.net/bugs/1914835>
>>
>> Title:
>> running MG5 on a condor-based cluster
>>
>> Status in MadGraph5_aMC@NLO:
>> New
>>
>> Bug description:
>> Good evening,
>>
>> I'm having trouble running MG5 in a condor cluster since it always end
>> up in the error message
>>
>> Error when reading
>> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results....

Read more...

Revision history for this message
Oscar Eboli (oeboli) wrote :
Download full text (6.9 KiB)

Olivier,

thanks a lot for the help. I'll try to get hold of the system manager for
the
Alice experimental group because my system manager (me) is rather
incompetent :-)

A last question: our system is configure in such way that users have a home
directory in the
nodes that is empty, ie, differente from where MG5 is installed. Does MG5
requires
access to its installation?

Greetings, Oscar

Em qua., 24 de fev. de 2021 às 18:05, Olivier Mattelaer <
<email address hidden>> escreveu:

> One think that you can check is if the package
> "atm-tools"
> is available on all node
> This is not a module needed by MG5aMC.
> I only pointed you to that one because of this website:
>
> http://bryan-murdock.blogspot.com/2010/10/how-to-disable-ubuntu-command-not-found.html
>
> Another idea to check, is if your condor cluster can susped your job,
> and if the crash is not linked to suspended job (where the bad atm-tools
> is making the suspension to fail)
>
> Actually at this stage your local sys-admin might know better than me
> what can cause such issue.
>
> Cheers,
>
> Olivier
>
> > On 24 Feb 2021, at 18:48, Oscar Eboli <email address hidden>
> wrote:
> >
> > Dear Olivier,
> >
> > I'm still struggling with the bug. I made sure that the server and nodes
> > have the same packages.
> > The jobs run on the server but some of the condor runs do not end as
> > expected. I have two additional
> > pieces of information that might give you a hint:
> >
> > 1. once I got a message that the condor run terminated without the
> > generation of a results.dat
> >
> > 2. I found out that for p p > w+ w- the run finishes without error
> > depending on the seed of the random
> > numbers!
> >
> > Do you have any suggestions?
> >
> > Thanks a lot, Oscar
> >
> >
> >
> > Em qua., 17 de fev. de 2021 às 05:15, Olivier Mattelaer <
> > <email address hidden> <mailto:<email address hidden>>>
> escreveu:
> >
> >> Thanks, this is helpfull since it indicates that the issue is not that
> >> the executable is missing but something else.
> >>
> >> So looks like the code is trying to executed some "les" command and this
> >> explains why it does a return end code 127. Now this is really weird
> since
> >> G1 directory is created and run before G2.
> >> It is the first report where someone report an issue on G2 while G1 is
> >> going trough (in the past the opposite G1 failing but not G2 was
> happening
> >> due to slow file transfer making the executable/input missing for G1 but
> >> available for G2)
> >>
> >> Now the issue is that we do not have any "les" command in that file.
> >> The only line that could match in that script is a comment line (and not
> >> les but less)
> >> # Perform some cleaning to keep less file on disk/transfer less file.
> >>
> >> Looks like "les" is a network program for ATM in Linux:
> >> https://command-not-found.com/les
> >>
> >> One potential solution is to split that job in two such that G1 and G2
> are
> >> handle by two separated job. You can do that by having the following
> line
> >> in the run_card:
> >> 1 = survey_nchannel_per_job ! control how many Channel are integrated
> >> inside a single job on cluster/multicore
> >>
> >> ...

Read more...

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) wrote :
Download full text (9.1 KiB)

Hi,

> A last question: our system is configure in such way that users have a home
> directory in the
> nodes that is empty, ie, differente from where MG5 is installed. Does MG5
> requires
> access to its installation?

No this should not be a problem. We do not even need the filesystem where MG5/the process directory is installed to be mounted on the node of the cluster.

Cheers,

olivier

> On 24 Feb 2021, at 22:44, Oscar Eboli <email address hidden> wrote:
>
> Olivier,
>
> thanks a lot for the help. I'll try to get hold of the system manager for
> the
> Alice experimental group because my system manager (me) is rather
> incompetent :-)
>
> A last question: our system is configure in such way that users have a home
> directory in the
> nodes that is empty, ie, differente from where MG5 is installed. Does MG5
> requires
> access to its installation?
>
> Greetings, Oscar
>
>
>
> Em qua., 24 de fev. de 2021 às 18:05, Olivier Mattelaer <
> <email address hidden> <mailto:<email address hidden>>> escreveu:
>
>> One think that you can check is if the package
>> "atm-tools"
>> is available on all node
>> This is not a module needed by MG5aMC.
>> I only pointed you to that one because of this website:
>>
>> http://bryan-murdock.blogspot.com/2010/10/how-to-disable-ubuntu-command-not-found.html
>>
>> Another idea to check, is if your condor cluster can susped your job,
>> and if the crash is not linked to suspended job (where the bad atm-tools
>> is making the suspension to fail)
>>
>> Actually at this stage your local sys-admin might know better than me
>> what can cause such issue.
>>
>> Cheers,
>>
>> Olivier
>>
>>> On 24 Feb 2021, at 18:48, Oscar Eboli <email address hidden>
>> wrote:
>>>
>>> Dear Olivier,
>>>
>>> I'm still struggling with the bug. I made sure that the server and nodes
>>> have the same packages.
>>> The jobs run on the server but some of the condor runs do not end as
>>> expected. I have two additional
>>> pieces of information that might give you a hint:
>>>
>>> 1. once I got a message that the condor run terminated without the
>>> generation of a results.dat
>>>
>>> 2. I found out that for p p > w+ w- the run finishes without error
>>> depending on the seed of the random
>>> numbers!
>>>
>>> Do you have any suggestions?
>>>
>>> Thanks a lot, Oscar
>>>
>>>
>>>
>>> Em qua., 17 de fev. de 2021 às 05:15, Olivier Mattelaer <
>>> <email address hidden> <mailto:<email address hidden>> <mailto:<email address hidden> <mailto:<email address hidden>>>>
>> escreveu:
>>>
>>>> Thanks, this is helpfull since it indicates that the issue is not that
>>>> the executable is missing but something else.
>>>>
>>>> So looks like the code is trying to executed some "les" command and this
>>>> explains why it does a return end code 127. Now this is really weird
>> since
>>>> G1 directory is created and run before G2.
>>>> It is the first report where someone report an issue on G2 while G1 is
>>>> going trough (in the past the opposite G1 failing but not G2 was
>> happening
>>>> due to slow file transfer making the executable/input missing for G1 but
>>>> available for G2)
>>>>
>>>> Now the ...

Read more...

Changed in mg5amcnlo:
status: New → Incomplete
Revision history for this message
Oscar Eboli (oeboli) wrote : Re: [Bug 1914835] Re: running MG5 on a condor-based cluster

Dear Olivier,

Thanks for the help in making MadGraph work in our system.
We finally were able to run MadGraph on our cluster. As far as
I understand, MadGraph was not transferring all files that are needed.
So we started exporting the disk to all machines and MG5 started working.
We still have some issues with PYTHIA but we are trying to pinpoint the
cause.

Best Oscar

Em sex., 9 de abr. de 2021 às 16:55, Olivier Mattelaer <
<email address hidden>> escreveu:

> ** Changed in: mg5amcnlo
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1914835
>
> Title:
> running MG5 on a condor-based cluster
>
> Status in MadGraph5_aMC@NLO:
> Incomplete
>
> Bug description:
> Good evening,
>
> I'm having trouble running MG5 in a condor cluster since it always end
> up in the error message
>
> Error when reading
> /home/oeboli/Tools/3.0.3-neworders/T1/SubProcesses/P1_qq_ll/G1a0/results.dat
> Command "generate_events " interrupted with error:
> ValueError : could not convert string to float:
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in
> '/home/oeboli/Tools/3.0.3-neworders/T1/run_05_tag_1_debug.log'.
> Please attach this file to your report.
>
> I checked the file results.dat and its content is
>
> end-code not correct 127
>
> I enclosed the log file. If I run the process in the multicore mode it
> finishes without any problem.
>
> Thanks a lot for your help, Oscar
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1914835/+subscriptions
>

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MadGraph5_aMC@NLO because there has been no activity for 60 days.]

Changed in mg5amcnlo:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.