thread.error: can't start new thread
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MadGraph5_aMC@NLO |
Invalid
|
Undecided
|
Unassigned |
Bug Description
In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines we observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits.
File "/tmp/glide_
self.
File "/tmp/glide_
t.start()
File "/cvmfs/
_start_
thread.error: can't start new thread
We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048.
Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ?
We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines.
Cheers
Dirk
description: | updated |
description: | updated |
description: | updated |
Changed in mg5amcnlo: | |
status: | New → Invalid |
Something seems a little off with runmode=0
If I run "./run.sh 100 1" with runmode=0 on a 36 core machine, I see 72 threads created. The mostly seem to be idle, however. If I set runmode=2 but nb_core=1, I see 3 threads. Is this intentional?
I'm using a gridpack created in 2.4.2 but can test a newer version if something has changed.
Thanks,
Kenneth