Comment 2 for bug 271518

Revision history for this message
gs (gs-orst) wrote :

slurmd runs as root. I did not modify the installations and I used a slurm config file created at the slurm website.
The same setup works for me on debian and (as I tried recently) on Ubuntu Intrepid Ibex alpha as well.

Here is some details, but this bug is specific to Ubuntu Hardy Heron.

# scontrol show config | grep SlurmdLog
SlurmdLogFile = (null)

even so the logfile exists:
# ls /var/run/slurm-llnl/
slurmd slurmd.log slurmd.pid

Here is the SlurmdLog of a failed job:

[Oct 23 11:43:11] setup for a batch_job
[Oct 23 11:43:11] entering batch_job_create
[Oct 23 11:43:11] [412] Message thread started pid = 21069
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] debug3: _rpc_batch_job: return from _forkexec_slurmstepd
[Oct 23 11:43:11] [412] Entered job_manager for 412.4294967294 pid=21069
[Oct 23 11:43:11] [412] alloc LLLP
[Oct 23 11:43:11] [412] task affinity plugin loaded
[Oct 23 11:43:11] [412] mpi type = (null)
[Oct 23 11:43:11] [412] Entering _setup_normal_io
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] stdin file name = /dev/null
[Oct 23 11:43:11] [412] stdout file name = /home/gs/calc/tubes/cnt-4.0/tt1/slurm-412.out
[Oct 23 11:43:11] [412] stderr file name = /home/gs/calc/tubes/cnt-4.0/tt1/slurm-412.out
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Leaving _setup_normal_io
[Oct 23 11:43:11] [412] debug level = 2
[Oct 23 11:43:11] [412] Before call to spank_init()
[Oct 23 11:43:11] [412] spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
[Oct 23 11:43:11] [412] After call to spank_init()
[Oct 23 11:43:11] [412] num tasks on this node = 1
[Oct 23 11:43:11] [412] New fdpair[0] = 12, fdpair[1] = 13
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Uncached user/gid: gs/100
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CPU in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_FSIZE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_DATA in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_STACK in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_CORE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_RSS in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NPROC in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_NOFILE in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_MEMLOCK in environment
[Oct 23 11:43:11] [412] Couldn't find SLURM_RLIMIT_AS in environment
[Oct 23 11:43:11] [412] task 0 (21074) started Oct 23 11:43:11
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] Unblocking 412.4294967294 task 0, writefd = 13
[Oct 23 11:43:11] [412] affinity task_pre_launch: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Using sched_affinity for tasks
[Oct 23 11:43:11] [412] execve(): /var/run/slurm-llnl/slurmd/job00412/script: Permission denied
[Oct 23 11:43:11] [412] task 0 (21074) exited status 0x0d00 Oct 23 11:43:11
[Oct 23 11:43:11] [412] affinity task_post_term: 412.4294967294, task 0
[Oct 23 11:43:11] [412] Aggregated 1 task exit messages
[Oct 23 11:43:11] [412] sending task exit msg for 1 tasks
[Oct 23 11:43:11] [412] Before call to spank_fini()
[Oct 23 11:43:11] [412] After call to spank_fini()
[Oct 23 11:43:11] [412] job 412 completed with slurm_rc = 0, job_rc = 3328
[Oct 23 11:43:11] [412] sending REQUEST_COMPLETE_BATCH_SCRIPT
[Oct 23 11:43:11] [412] auth plugin for Munge (Chris Dunlap, LLNL) loaded
[Oct 23 11:43:11] [412] eio: handling events for 1 objects
[Oct 23 11:43:11] [412] Called _msg_socket_readable
[Oct 23 11:43:11] [412] false, shutdown
[Oct 23 11:43:11] [412] Message thread exited
[Oct 23 11:43:11] [412] done with job
[Oct 23 11:43:11] debug3: in the service_connection
[Oct 23 11:43:11] debug2: got this type of message 6010
[Oct 23 11:43:11] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[Oct 23 11:43:11] debug: _rpc_terminate_job, uid = 64030
[Oct 23 11:43:11] debug: task_slurmd_release_resources: 412
[Oct 23 11:43:11] debug3: release LLLP job [412.*]
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 expires:691231160000
[Oct 23 11:43:11] debug: credential for job 412 revoked
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 18
[Oct 23 11:43:11] debug2: No steps in jobid 412 to send signal 15
[Oct 23 11:43:11] debug4: sent ALREADY_COMPLETE
[Oct 23 11:43:11] debug3: job state 412: ctime:081023114311 revoked:081023114311 expires:081023114311
[Oct 23 11:43:11] debug2: set revoke expiration for jobid 412 to 081023115311
[Oct 23 11:45:06] debug3: in the service_connection
[Oct 23 11:45:06] debug: _slurm_recv_timeout at 0 of 4, recv zero bytes
[Oct 23 11:45:06] error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
[Oct 23 11:45:06] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[Oct 23 11:45:06] debug2: _slurm_send_timeout: Socket no longer there.
[Oct 23 11:45:06] error: slurm_msg_sendto: Transport endpoint is not connected
[Oct 23 11:46:18] debug3: in the service_connection
[Oct 23 11:46:18] debug2: got this type of message 1008