A missing directory and wrong init-scripts in torque-2.3.6 in Jaunty

Bug #360827 reported by Katsura
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
torque (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I report the bugs of torque-server, torque-mom, and torque-sched in torque-2.3.6 packages in Jaunty.

pbs-server fails to invoke with the following messages

* Starting Torque batch queue server
PBS_Server: No such file or directory (2) in pbsd_init, unable to change to directory /var/lib/torque/server_priv/arrays/
PBS_Server: PBS_Server, pbsd_init failed

Actually "arrays" directory is missing.
I make a blank "arrays" directory in /var/lib/torque/server_priv/,
and then pbs-server is invoked successfully.

In addition, the locations of torque files in /etc/init.d/{torque-server, torque-mom, torque-sched} are wrong.
"/var/spool/torque/" in these init-scripts should be changed to "/var/lib/torque/"

And
the PIDFILE name in /etc/init.d/torque-sched is

PIDFILE=/var/lib/torque/server_priv/server.lock

this file name should be changed to "sched.lock".

After these modifications, torque works fine in my enviroment.

Katsura (katsura)
security vulnerability: yes → no
Katsura (katsura)
visibility: private → public
Revision history for this message
Hiro Protagonist (surfer) wrote :

these changes were necessary (and worked) for me as well.

small exception: /etc/init.d/torque-sched is called /etc/init.d/torque-scheduler

and: i had to create (and # chmod 1777) the following directories:
/var/lib/torque/spool
/var/lib/torque/undelivered

now torque works fine in my environment.

Revision history for this message
Jordi Mallach (jordi) wrote :

These issues seem to be fixed in SVN.

Revision history for this message
schmitz (m-schmitz) wrote :

A similar bug exists in version 2.1.8 (locally backported to feisty):

Setting up torque-mom (2.1.8+dfsg-0ubuntu1-bionmr2) ...
 * Starting Torque Mom:
pbs_mom: No such file or directory (2) in chk_file_sec, Security violation with
"/var/lib/torque/mom_priv/jobs"
invoke-rc.d: initscript torque-mom, action "start" failed.
dpkg: error processing torque-mom (--install):
 subprocess post-installation script returned error exit status 3
Setting up torque-client (2.1.8+dfsg-0ubuntu1-bionmr2) ...
Errors were encountered while processing:
 torque-mom

I had to create /var/lib/torque/mom_priv/jobs here, and chmod og+w /var/lib/torque/spool/ (and perhaps ../undelivered/ as well) before mom would run any jobs. Is 1777 the correct mode to use?

Installing torque-server threw quite a number of errors because it tried to start without the directories /var/lib/torque/server_priv/jobs, ../queues and ../accounting.

I'm glad I have used DQS before, otherwise I'd have barfed on configuring the server ...

  Michael
  (<email address hidden> FWIW)

Revision history for this message
Dan Kortschak (dan-kortschak) wrote :

After spending a bit of time, more than necessary, I got the server working, but only after a number of times seeing /etc/init.d/torque-server clobbering established node files.

To help others out in the same situation (i.e. people who don't read carefully) I want to clarify the first bug report instructions - particularly that the check for existing serverdb needs to change too:

case "$1" in
  start)
        log_daemon_msg "Starting $DESC"

        if [ ! -r /var/lib/torque/server_priv/serverdb ]; then
                DAEMON_SERVER_OPTS="-t create $DAEMON_SERVER_OPTS"
        fi

In the provided package, the test is for the non-existent /var/spool/torque/server_priv/serverdb file.

Revision history for this message
Lorin Hochstein (lorinh) wrote :

Confirmed that this is still an issue in Karmic.

Revision history for this message
Steffen Möller (moeller-debian) wrote :

Dear all,

Dominique, Morten and Jordi have now completed working on a 2.4-based packaging of Torque and this should hit the Ubuntu servers any time soon. For the time speaking it is Debian unstable for the very last four days now.
To do so add
deb-src http://ftp2.de.debian.org/debian/ unstable main contrib non-free
to etc/apt/sources.list, apt-get update and apt-get source -b torque plus dpkg -i *.deb

If there are any surprising or difficult problems while migrating then please drop us emails so we can add respective notes to the documentation.

Best,

Steffen

Revision history for this message
Steffen Möller (moeller-debian) wrote :

The torque 2.4 packages have arrived in Maverick. Please kindly investigate if the bug is still valid.

Revision history for this message
Bram Metsch (metsch) wrote :

This also bug affects Lucid

Revision history for this message
hamish (hamish-b) wrote :

[Lucid pkgs] everything starts up and runs ok with the fixes listed above, one last detail: '/etc/init.d/torque-server stop' fails. It seems the server.lock file in /var/lib/torque/serv_priv/ does not contain a PID (??; look for another closed launchpad but report which speaks of this, maybe related)

Hamish

Revision history for this message
hamish (hamish-b) wrote :

Hi,

for any Lucid / 10.04 LTS users who find themselves stuck on this--

I had to get it working on a number of machines so put together all the above steps, and a few more into a fix-it script. It is assumed that upgrading to the maverick+ packages will also solve these problems without such manual intervention, but I prefer to fix the official Lucid version. YMMV

{{{
## Local fixes to get Lucid's Torque server running.
sudo su

mkdir /var/lib/torque/server_priv/arrays/
mkdir /var/lib/torque/sched_priv/accounting/

sed -i -e 's+/var/spool/torque/+/var/lib/torque/+' /etc/init.d/torque-server
sed -i -e 's+/var/spool/torque/+/var/lib/torque/+' /etc/init.d/torque-mom
sed -i -e 's+/var/spool/torque/+/var/lib/torque/+' /etc/init.d/torque-scheduler
sed -i -e 's+sched_priv/server.lock+sched_priv/sched.lock+' \
  /etc/init.d/torque-scheduler

#########

# set up for local use as per /usr/share/doc/torque-base/README.Debian
hostname --long > /var/lib/torque/server_priv/nodes
hostname --long > /var/lib/torque/server_name
hostname --long > /var/lib/torque/mom_priv/config

cat << EOF >> /etc/services

#(torque
# Standard PBS services
pbs 15001/tcp # pbs server (pbs_server)
pbs 15001/udp # pbs server (pbs_server)
pbs_mom 15002/tcp # mom to/from server
pbs_mom 15002/udp # mom to/from server
pbs_resmom 15003/tcp # mom resource management requests
pbs_resmom 15003/udp # mom resource management requests
pbs_sched 15004/tcp # scheduler
pbs_sched 15004/udp # scheduler
#)
EOF

/etc/init.d/torque-mom restart
/etc/init.d/torque-server restart
/etc/init.d/torque-scheduler restart

qmgr -c "s s scheduling=true"
qmgr -c "c q batch queue_type=execution"
qmgr -c "s q batch started=true"
qmgr -c "s q batch enabled=true"
qmgr -c "s q batch resources_default.nodes=1"
qmgr -c "s q batch resources_default.walltime=3600"
qmgr -c "s s default_queue=batch"

/etc/init.d/torque-mom restart
/etc/init.d/torque-server restart
/etc/init.d/torque-scheduler restart

qmgr -c "s n `/bin/hostname --long` state=free" -e

## ubu bug #441063, https://bugs.launchpad.net/ubuntu/+source/torque/+bug/441063
cd /tmp
wget "https://bugs.launchpad.net/ubuntu/+source/torque/+bug/441063/+attachment/765094/+files/xpbs-tclIndex"
wget "https://bugs.launchpad.net/ubuntu/+source/torque/+bug/441063/+attachment/765095/+files/xpbsmon-tclIndex"
patch -p0 < xpbs-tclIndex
patch -p0 < xpbsmon-tclIndex

exit

#########
# test: in Prefs.. menu change torqueserver to your $HOSTNAME, auto-update to 1 minute
xpbs &
xpbsmon &
# submit a job
echo "sleep 30" | qsub
}}}

regards,
Hamish

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in torque (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.