MadGraph5_aMC@NLO

thread.error: can't start new thread

Bug #1784866 reported by Dirk Hufnagel on 2018-08-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MadGraph5_aMC@NLO	Invalid	Undecided	Unassigned

Bug Description

In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines we observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits.

  File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__
    self.start_demon()
  File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon
    t.start()
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start
    _start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread

We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048.

Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ?

We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines.

Cheers

Dirk

See original description

Dirk Hufnagel (hufnagel6) on 2018-08-01

description:	updated
description:	updated

Dirk Hufnagel (hufnagel6) on 2018-08-01

description:

updated

Revision history for this message

Kenneth Long (kdlong-e) wrote on 2018-08-01:

Something seems a little off with runmode=0

If I run "./run.sh 100 1" with runmode=0 on a 36 core machine, I see 72 threads created. The mostly seem to be idle, however. If I set runmode=2 but nb_core=1, I see 3 threads. Is this intentional?

I'm using a gridpack created in 2.4.2 but can test a newer version if something has changed.

Thanks,

Kenneth

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) wrote on 2018-08-06: Re: [Bug 1784866] thread.error: can't start new thread

Hi,

The gridpack changed quite a lot on the "cluster" support since 2.4.2
so can you test with the latest version to see if you still reproduce this?

Would also be interesting to see the full traceback such that one can spot where the creation of the
initialisation of the multicore cluster occur.

Cheers,

Olivier

> On 1 Aug 2018, at 19:12, Kenneth Long <email address hidden> wrote:
>
> Something seems a little off with runmode=0
>
> If I run "./run.sh 100 1" with runmode=0 on a 36 core machine, I see 72
> threads created. The mostly seem to be idle, however. If I set runmode=2
> but nb_core=1, I see 3 threads. Is this intentional?
>
> I'm using a gridpack created in 2.4.2 but can test a newer version if
> something has changed.
>
> Thanks,
>
> Kenneth
>
> --
> You received this bug notification because you are subscribed to
> MadGraph5_aMC@NLO.
> https://bugs.launchpad.net/bugs/1784866
>
> Title:
> thread.error: can't start new thread
>
> Status in MadGraph5_aMC@NLO:
> New
>
> Bug description:
> In CMS we are using madgraph internally to generate LHC collission
> events. In such production jobs on KNL machines we observe very high
> failure rates from within madgraph. The problem is that madgraph seems
> to trigger the system process/thread limits.
>
> File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__
> self.start_demon()
> File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon
> t.start()
> File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start
> _start_new_thread(self.__bootstrap, ())
> thread.error: can't start new thread
>
> We are running this on KNL nodes with 272 logical cores. We are
> running multiple of these processes in parallel on the same machine.
> The system process/thread limit (ulimit -u) is 2048.
>
> Is it possible that madgraph is internally spawning as many threads as
> there are cores in the system ? If so, is there a way to prevent this
> from happening or to limit it ?
>
> We are occasionally also seeing this error on non-KNL nodes, but it's
> especially bad on KNL, which makes me suspect something related to the
> high number of logical cores on these machines.
>
> Cheers
>
> Dirk
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1784866/+subscriptions

Hi,

The gridpack changed quite a lot on the "cluster" support since 2.4.2
so can you test with the latest version to see if you still reproduce this?

Would also be interesting to see the full traceback such that one can spot where the creation of the 
initialisation of the multicore cluster occur.

Cheers,

Olivier

> On 1 Aug 2018, at 19:12, Kenneth Long <kdlong@wisc.edu> wrote:
> 
> Something seems a little off with runmode=0
> 
> If I run "./run.sh 100 1" with runmode=0 on a 36 core machine, I see 72
> threads created. The mostly seem to be idle, however. If I set runmode=2
> but nb_core=1, I see 3 threads. Is this intentional?
> 
> I'm using a gridpack created in 2.4.2 but can test a newer version if
> something has changed.
> 
> Thanks,
> 
> Kenneth
> 
> -- 
> You received this bug notification because you are subscribed to
> MadGraph5_aMC@NLO.
> https://bugs.launchpad.net/bugs/1784866
> 
> Title:
>  thread.error: can't start new thread
> 
> Status in MadGraph5_aMC@NLO:
>  New
> 
> Bug description:
>  In CMS we are using madgraph internally to generate LHC collission
>  events. In such production jobs on KNL machines we observe very high
>  failure rates from within madgraph. The problem is that madgraph seems
>  to trigger the system process/thread limits.
> 
>    File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__
>      self.start_demon()
>    File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon
>      t.start()
>    File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start
>      _start_new_thread(self.__bootstrap, ())
>  thread.error: can't start new thread
> 
>  We are running this on KNL nodes with 272 logical cores. We are
>  running multiple of these processes in parallel on the same machine.
>  The system process/thread limit (ulimit -u) is 2048.
> 
>  Is it possible that madgraph is internally spawning as many threads as
>  there are cores in the system ? If so, is there a way to prevent this
>  from happening or to limit it ?
> 
>  We are occasionally also seeing this error on non-KNL nodes, but it's
>  especially bad on KNL, which makes me suspect something related to the
>  high number of logical cores on these machines.
> 
>  Cheers
> 
>  Dirk
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mg5amcnlo/+bug/1784866/+subscriptions

Olivier Mattelaer (olivier-mattelaer) on 2018-11-10

Changed in mg5amcnlo:
status:	New → Invalid

Revision history for this message

Efe Yazgan (efe-yazgan) wrote on 2018-12-19:

Hi Olivier,

The latest official one we have in CMS is 2.6.0. Do you expect the problem to be solved already in 2.6.0 or in later versions?

Thanks,
Efe

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) wrote on 2018-12-19:

Hi,

Well the main change in the gridpack handling that I was referring above is in 2.6.1
For the rest, I'm still "waiting" for the full traceback of when this occurs. Such that I would be able to see if that call is still present (in 2.6.0 or later...)

So the short answer is I do not know.

Cheers,

Olivier

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.