A client asked me to write a quick documentation on Howto deploy a computing grid on a Ubuntu distro. I’ve never worked on grid solutions, so the challenge was interesting (while a bit painfull reading the various tutos laying around).
Theory of operations
The user gsadmin will be the job submitter on master node and the job executor on remote nodes.
For conveniance we will use a shared home directory to ensure that code is the same on every node and will gather the output on the master node (assuming code outputs into its home directory).
SSH connection :
The user gsadmin on the master should connect to worker nodes without password via SSH.
So, for NFS to work correctly we must ensure that the user gsadmin will have the UID/GID on every node of the grid.
Preparing the master node
Create the user with fixed UID/GID
groupadd -g 600 gsadmin
useradd \ --gid gsadmin \ -d /home/gsadmin \ -m -N --shell /bin/bash \ gsadmin
Prepare the SSH paswordless connection by generating a ssh key pair and setting a password for the gsadmin user.
passwd gsadmin su - gsadmin # generate and copy ID locally ssh-keygen ssh-copy-id gsadmin@master # enter the password # check the success with : ssh gsadmin@master
Now install the NFS Server package and configure it with the home directory of the new user « gsadmin » for sharing.
apt-get install nfs-kernel-server
At this point NFS kernel server has been successfully installed, now we will make use of few commands as shown below to configure the default exports file and then restart nfs services.
echo "/home/gsadmin *(rw,sync,no_subtree_check)" >> /etc/exports exportfs -a service nfs-kernel-server restart
Install relevant packages
# Install software suite on master apt-get install gridengine-master apt-get install gridengine-client
start the daemon and fix any problem that arises by checking the output
/etc/init.d/gridengine-master start tail -f /var/spool/gridengine/qmaster/messages
Configure the permissions for gsadmin user and create a queue and a hostlist and link them together.
Every configuration in gridEngine is done via qconf.
We begin by allowing the root user to interact with the configuration.
sudo -u sgeadmin qconf -am root sudo -u sgeadmin qconf -au root users # Now we can keep working as root qconf -am gsadmin qconf -au gsadmin users # we configure the hosts and the queue with 2 slots each # 2 correspond to the number of CPU on the worker node qconf -as master # create a hostlist named allhosts qconf -ahgrp @allhosts # save and exit # add the master into the list qconf -aattr hostgroup hostlist master @allhosts # create a work queue main.q and link it to the @allhosts group qconf -aq main.q # save and exit qconf -aattr queue hostlist @allhosts main.q # replace 2 with your worker CPU number # and set this number to the work slots available qconf -aattr queue slots "2" main.q
Now that the master works, we can export its data to the rest of the grid via NFS and prepare the user home directory to be mounted on remote nodes.
Of course, in real world, you should increase NFS security.
apt-get install nfs-kernel-server echo "/home/gsadmin *(rw,sync,no_subtree_check)" >> /etc/exports # export the cell configuration in read-only mode echo "/var/lib/gridengine/default *(ro,sync,no_subtree_check)" >> /etc/exports exportfs -a service nfs-kernel-server restart
Worker node configuration
First of all, we must check the proper DNS configuration by issuing a ping master. If it fails, modify /etc/hosts by adding the master IP to the /etc/hosts.
Create the gsadmin user with the same characteristics than the one on the master
groupadd -g 600 gsadmin useradd \ --gid gsadmin \ -d /home/gsadmin \ -m -N --shell /bin/bash \ gsadmin apt-get install nfs-common
Configure gsdmin@worker shared home dir and ensure automounting upon reboot.
Add this line to the /etc/fstab file
master:/home/gsadmin /home/gsadmin nfs auto
and issue a mount -a to see if it’s correctly mounted by entering « mount » at the prompt. You should see the /home/gsadmin listed as NFS.
Install and configure the grid software Executor
apt-get install gridengine-exec # configure with the same information as the master (postix too) # Anyway, we will reuse the cell configuration exported by the master via NFS echo 'master:/var/lib/gridengine/default /var/lib/gridengine/default nfs auto,ro' >> /etc/fstab mount -a # check if all goes well
Add the worker in the master configuration (!! take care, this occurs on the {{master}} server !)
qconf -ah worker qconf -aattr hostgroup hostlist worker @allhosts
Now the master is aware of the new worker node. We can start the executor in the node
/etc/init.d/gridengine-exec start
The new server should come up with information on mem/cpu/… on the master.
root@master:~# qhost -q HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS -------------------------------------------------------- global - - - - - - - worker lx26-amd64 2 0.01 2.0G 99.7M 0.0 0.0 main.q BIP 0/0/2
Launching a job
We will use a simple bash script and check that ir runs correctly.
We must use the gsadmin user to submit jobs to worker nodes so become user gsadmin (su – gsadmin) to edit and launch jobs.
Edit hello_world.sh and paste these lines in
#!/bin/bash echo "Start Date : " `date` # let's sleep a bit in order to show up as a running task sleep 100 # issue a few information echo "Working dir : " `pwd` echo "Host : " `hostname` echo "Stop Date : " `date`
Then issue a chmod+x hello_world.sh
gsadmin@master:~$qsub hello_world.sh Your job 3 ("hello_world.sh") has been submitted gsadmin@master:~$ qstat
You will find the script’s results the files named according to script name and Job ID.
– scriptName.e+JobID : contains Error (stderr) output
– scriptName.o+JobID : contains stdout output