MadGraph5_aMC@NLO

Bug #1784866
Activity log

Activity log for bug #1784866

Date	Who	What changed	Old value	New value	Message
2018-08-01 13:55:14	Dirk Hufnagel	bug			added bug
2018-08-01 13:57:14	Dirk Hufnagel	bug			added subscriber Alexander Grohsjean
2018-08-01 13:57:30	Dirk Hufnagel	bug			added subscriber Kenneth Long
2018-08-01 13:58:30	Dirk Hufnagel	description	In CMS we are using madgraph internally to generate PHC collission events. In such production jobs on KNL machines I observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk PS: Madgraph version seems to be V5_2.4.2	In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines I observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk PS: Madgraph version seems to be V5_2.4.2
2018-08-01 13:58:48	Dirk Hufnagel	description	In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines I observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk PS: Madgraph version seems to be V5_2.4.2	In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines we observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk PS: Madgraph version seems to be V5_2.4.2
2018-08-01 14:36:59	Dirk Hufnagel	description	In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines we observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk PS: Madgraph version seems to be V5_2.4.2	In CMS we are using madgraph internally to generate LHC collission events. In such production jobs on KNL machines we observe very high failure rates from within madgraph. The problem is that madgraph seems to trigger the system process/thread limits. File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 608, in __init__ self.start_demon() File "/tmp/glide_yRHYSp/execute/dir_248087/job/WMTaskSpace/cmsRun1/lheevent/process/madevent/bin/internal/cluster.py", line 615, in start_demon t.start() File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread We are running this on KNL nodes with 272 logical cores. We are running multiple of these processes in parallel on the same machine. The system process/thread limit (ulimit -u) is 2048. Is it possible that madgraph is internally spawning as many threads as there are cores in the system ? If so, is there a way to prevent this from happening or to limit it ? We are occasionally also seeing this error on non-KNL nodes, but it's especially bad on KNL, which makes me suspect something related to the high number of logical cores on these machines. Cheers Dirk
2018-11-10 19:35:12	Olivier Mattelaer	mg5amcnlo: status	New	Invalid