slurmd runs as root. I did not modify the installations and I used a slurm config file created at the slurm website.
The same setup works for me on debian and (as I tried recently) on Ubuntu Intrepid Ibex alpha as well.
Here is some details, but this bug is specific to Ubuntu Hardy Heron.
# scontrol show config | grep SlurmdLog
SlurmdLogFile = (null)
even so the logfile exists:
# ls /var/run/slurm-llnl/
slurmd slurmd.log slurmd.pid
Here is the SlurmdLog of a failed job:
[Oct 23 11:43:11] setup for a batch_job
[Oct 23 11:43:11] entering batch_job_create
[Oct 23 11:43:11] [412] Message thread started pid = 21069
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] debug3: _rpc_batch_job: return from _forkexec_slurmstepd
[Oct 23 11:43:11] [412] Entered job_manager for 412.4294967294 pid=21069
[Oct 23 11:43:11] [412] alloc LLLP
[Oct 23 11:43:11] [412] task affinity plugin loaded
[Oct 23 11:43:11] [412] mpi type = (null)
[Oct 23 11:43:11] [412] Entering _setup_normal_io
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] stdin file name = /dev/null
[Oct 23 11:43:11] [412] stdout file name = /home/gs/calc/tubes/cnt-4.0/tt1/slurm-412.out
[Oct 23 11:43:11] [412] stderr file name = /home/gs/calc/tubes/cnt-4.0/tt1/slurm-412.out
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Leaving _setup_normal_io
[Oct 23 11:43:11] [412] debug level = 2
[Oct 23 11:43:11] [412] Before call to spank_init()
[Oct 23 11:43:11] [412] spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
[Oct 23 11:43:11] [412] After call to spank_init()
[Oct 23 11:43:11] [412] num tasks on this node = 1
[Oct 23 11:43:11] [412] New fdpair[0] = 12, fdpair[1] = 13
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CPU in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_FSIZE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_DATA in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_STACK in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CORE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_RSS in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NPROC in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NOFILE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_AS in environment
[Oct 23 11:43:11] [412] task 0 (21074) started Oct 23 11:43:11
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Unblocking 412.4294967294 task 0, writefd = 13
[Oct 23 11:43:11] [412] affinity task_pre_launch: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Using sched_affinity for tasks
[Oct 23 11:43:11] [412] execve(): /var/run/slurm-llnl/slurmd/job00412/script: Permission denied
[Oct 23 11:43:11] [412] task 0 (21074) exited status 0x0d00 Oct 23 11:43:11
[Oct 23 11:43:11] [412] affinity task_post_term: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Aggregated 1 task exit messages
[Oct 23 11:43:11] [412] sending task exit msg for 1 tasks
[Oct 23 11:43:11] [412] Before call to spank_fini()
[Oct 23 11:43:11] [412] After call to spank_fini()
[Oct 23 11:43:11] [412] job 412 completed with slurm_rc = 0, job_rc = 3328
[Oct 23 11:43:11] [412] sending REQUEST_COMPLETE_BATCH_SCRIPT
[Oct 23 11:43:11] [412] auth plugin for Munge (Chris Dunlap, LLNL) loaded
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] false, shutdown
[Oct 23 11:43:11] [412] Message thread exited
[Oct 23 11:43:11] [412] done with job
[Oct 23 11:43:11] debug3: in the service_connection
[Oct 23 11:43:11] debug2: got this type of message 6010
[Oct 23 11:43:11] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[Oct 23 11:43:11] debug: _rpc_terminate_job, uid = 64030
[Oct 23 11:43:11] debug: task_slurmd_release_resources: 412
[Oct 23 11:43:11] debug3: release LLLP job [412.*]
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 expires:691231160000
[Oct 23 11:43:11] debug: credential for job 412 revoked
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 18
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 15
[Oct 23 11:43:11] debug4: sent ALREADY_COMPLETE
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 revoked:081023114311 expires:081023114311
[Oct 23 11:43:11] debug2: set revoke expiration for jobid 412 to 081023115311
[Oct 23 11:45:06] debug3: in the service_connection
[Oct 23 11:45:06] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes
[Oct 23 11:45:06] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[Oct 23 11:45:06] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[Oct 23 11:45:06] debug2: _slurm_send_timeout: Socket no longer there.
[Oct 23 11:45:06] error: slurm_msg_sendto: Transport endpoint is not connected
[Oct 23 11:46:18] debug3: in the service_connection
[Oct 23 11:46:18] debug2: got this type of message 1008
slurmd runs as root. I did not modify the installations and I used a slurm config file created at the slurm website.
The same setup works for me on debian and (as I tried recently) on Ubuntu Intrepid Ibex alpha as well.
Here is some details, but this bug is specific to Ubuntu Hardy Heron.
# scontrol show config | grep SlurmdLog
SlurmdLogFile = (null)
even so the logfile exists: slurm-llnl/
# ls /var/run/
slurmd slurmd.log slurmd.pid
Here is the SlurmdLog of a failed job:
[Oct 23 11:43:11] setup for a batch_job readable slurmstepd readable readable calc/tubes/ cnt-4.0/ tt1/slurm- 412.out calc/tubes/ cnt-4.0/ tt1/slurm- 412.out readable readable llnl/plugstack. conf readable readable MEMLOCK in environment readable readable slurm-llnl/ slurmd/ job00412/ script: Permission denied COMPLETE_ BATCH_SCRIPT readable TERMINATE_ JOB release_ resources: 412 691231160000 081023114311 expires: 081023114311 msg_and_ forward: Zero Bytes were transmitted or received send_timeout: Socket no longer there.
[Oct 23 11:43:11] entering batch_job_create
[Oct 23 11:43:11] [412] Message thread started pid = 21069
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] debug3: _rpc_batch_job: return from _forkexec_
[Oct 23 11:43:11] [412] Entered job_manager for 412.4294967294 pid=21069
[Oct 23 11:43:11] [412] alloc LLLP
[Oct 23 11:43:11] [412] task affinity plugin loaded
[Oct 23 11:43:11] [412] mpi type = (null)
[Oct 23 11:43:11] [412] Entering _setup_normal_io
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] stdin file name = /dev/null
[Oct 23 11:43:11] [412] stdout file name = /home/gs/
[Oct 23 11:43:11] [412] stderr file name = /home/gs/
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] Leaving _setup_normal_io
[Oct 23 11:43:11] [412] debug level = 2
[Oct 23 11:43:11] [412] Before call to spank_init()
[Oct 23 11:43:11] [412] spank: opening plugin stack /etc/slurm-
[Oct 23 11:43:11] [412] After call to spank_init()
[Oct 23 11:43:11] [412] num tasks on this node = 1
[Oct 23 11:43:11] [412] New fdpair[0] = 12, fdpair[1] = 13
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CPU in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_FSIZE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_DATA in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_STACK in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CORE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_RSS in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NPROC in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NOFILE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_AS in environment
[Oct 23 11:43:11] [412] task 0 (21074) started Oct 23 11:43:11
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] Unblocking 412.4294967294 task 0, writefd = 13
[Oct 23 11:43:11] [412] affinity task_pre_launch: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Using sched_affinity for tasks
[Oct 23 11:43:11] [412] execve(): /var/run/
[Oct 23 11:43:11] [412] task 0 (21074) exited status 0x0d00 Oct 23 11:43:11
[Oct 23 11:43:11] [412] affinity task_post_term: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Aggregated 1 task exit messages
[Oct 23 11:43:11] [412] sending task exit msg for 1 tasks
[Oct 23 11:43:11] [412] Before call to spank_fini()
[Oct 23 11:43:11] [412] After call to spank_fini()
[Oct 23 11:43:11] [412] job 412 completed with slurm_rc = 0, job_rc = 3328
[Oct 23 11:43:11] [412] sending REQUEST_
[Oct 23 11:43:11] [412] auth plugin for Munge (Chris Dunlap, LLNL) loaded
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_
[Oct 23 11:43:11] [412] false, shutdown
[Oct 23 11:43:11] [412] Message thread exited
[Oct 23 11:43:11] [412] done with job
[Oct 23 11:43:11] debug3: in the service_connection
[Oct 23 11:43:11] debug2: got this type of message 6010
[Oct 23 11:43:11] debug2: Processing RPC: REQUEST_
[Oct 23 11:43:11] debug: _rpc_terminate_job, uid = 64030
[Oct 23 11:43:11] debug: task_slurmd_
[Oct 23 11:43:11] debug3: release LLLP job [412.*]
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 expires:
[Oct 23 11:43:11] debug: credential for job 412 revoked
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 18
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 15
[Oct 23 11:43:11] debug4: sent ALREADY_COMPLETE
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 revoked:
[Oct 23 11:43:11] debug2: set revoke expiration for jobid 412 to 081023115311
[Oct 23 11:45:06] debug3: in the service_connection
[Oct 23 11:45:06] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes
[Oct 23 11:45:06] error: slurm_receive_
[Oct 23 11:45:06] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[Oct 23 11:45:06] debug2: _slurm_
[Oct 23 11:45:06] error: slurm_msg_sendto: Transport endpoint is not connected
[Oct 23 11:46:18] debug3: in the service_connection
[Oct 23 11:46:18] debug2: got this type of message 1008