> Hi, > I did not find any previous notes on this. > I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this. Great, glad it is useful to you. > However, when using ipcluster on one of these machines in ssh mode, with this clusterfile, > send_furl = False > engines = { 'tev1' : 16, > 'tev2' : 16, > 'tev3' : 16, > 'tev4' : 16 > } > > I typically get about 50 engines to actually start. Since there seems to > be no log file for ipcluster (in spite of code that seems like it should > record which engines it tried to start), I can't send that. The > ipcontroller log file looks fine, except for recording fewer than the 64 > engines that I expected. > > I have an alternative, very klugy method that starts a controller, then > executes 64 ssh commands to the respective machines to simply run > ipengine. I found the same problem, which went away when I introduced a > 1 second delay after each ssh call, which of course takes more than a > minute to run, and leaves all those ssh processes running. I think I know what the issue is here. We have found that sometimes the engines startup so fast that the controller is not yet up and running. The engine that try to connect before the controller is running fail. Twisted is fully capable of handling many simultaneous connections, so I don't think it is that. The good news is that all of this is fixed in trunk (ipcluster is much improved). The bad news is that I haven't yet gotten the ssh mode cluster working with the new ipcluster in trunk. It shouldn't be difficult, and Vishal knows this code as well. In the mean time, I would suggest looking through ipcluster.py - you should be able to put a delay between when the controller is started and when the engines are started. Cheers, Brian > So I suspect that the same thing would work in the loop in this method > of ipcluster.SSHEngineSet > > def _ssh_engine(self, hostname, count): > exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % ( > hostname, self.temp_dir, > os.environ['USER'], self.engine_command > ) > cmds = exec_engine.split() > dlist = [] > log.msg("about to start engines...") > for i in range(count): > log.msg('Starting engines: %s' % exec_engine) > d = getProcessOutput(cmds[0], cmds[1:], env=os.environ) > dlist.append(d) > return gatherBoth(dlist, consumeErrors=True) > > but that would be inelegant, given that the real problem is probably > related to the controller not responding properly to multiple requests. > > Thanks for looking at this. > > --Toby Burnett > > ** Affects: ipython > Importance: Undecided > Status: New > > -- > ipcluster does not start all the engines > https://bugs.launchpad.net/bugs/509015 > You received this bug notification because you are a member of IPython > Developers, which is subscribed to IPython. > > Status in IPython - Enhanced Interactive Python: New > > Bug description: > As reported on the mailing list... > > > ---------- Forwarded message ---------- > From: Toby Burnett > Date: Sat, Jan 16, 2010 at 8:58 AM > Subject: [IPython-user] ipcluster does not start all the engines > To: "" > > > Hi, > I did not find any previous notes on this. > I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this. > > However, when using ipcluster on one of these machines in ssh mode, with this clusterfile, > send_furl = False > engines = { 'tev1' : 16, > 'tev2' : 16, > 'tev3' : 16, > 'tev4' : 16 > } > > I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected. > > I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running. > > So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet > > def _ssh_engine(self, hostname, count): > exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % ( > hostname, self.temp_dir, > os.environ['USER'], self.engine_command > ) > cmds = exec_engine.split() > dlist = [] > log.msg("about to start engines...") > for i in range(count): > log.msg('Starting engines: %s' % exec_engine) > d = getProcessOutput(cmds[0], cmds[1:], env=os.environ) > dlist.append(d) > return gatherBoth(dlist, consumeErrors=True) > > but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests. > > Thanks for looking at this. > > --Toby Burnett > > > -- Brian E. Granger, Ph.D. Assistant Professor of Physics Cal Poly State University, San Luis Obispo