Install-Slurm
Introduction
Install Slurm on CentOS-7 Virtual Cluster.
Preparation
Cluster Server and Computing Nodes
List of master node and computing nodes within the cluster.
| Hostname | IP Addr |
|---|---|
| master | 10.0.1.5 |
| node1 | 10.0.1.6 |
| node2 | 10.0.1.7 |
(Optional) Delete failed installation of Slurm
Remove database:
yum remove mariadb-server mariadb-devel -y
Remove Slurm and Munge:
yum remove slurm munge munge-libs munge-devel -y
Delete the users and corresponding folders:
userdel -r slurm
suerdel -r munge
Create the global users
Slurm and Munge require consistent UID and GID across every node in the cluster. For all the nodes, before you install Slurm or Munge:
export MUNGEUSER=971
groupadd -g $MUNGEUSER munge
useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=972
groupadd -g $SLURMUSER slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Install Munge
Get the latest REPL repository:
yum install epel-release -y
Install Munge:
yum install munge munge-libs munge-devel -y
Create a secret key on master node. First install rig-tools to properly create the key:
yum install rng-tools -y
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
Alternatively install from source
wget https://github.com/dun/munge/releases/download/munge-0.5.14/munge-0.5.14.tar.xz
./configure, etc.
add locallib.conf to /etc/ld.so.conf.d with /usr/local/lib
ldconfig -v
yum install rng-tools -y
rngd -r /dev/urandom
dd if=/dev/urandom bs=1 count=1024 > /usr/local/etc/munge/munge.key
chown munge:munge /usr/local/etc/munge
chown munge:munge /usr/local/etc/munge/munge.key
chmod 0700 /usr/local/etc/munge
chown munge:munge /usr/local/etc/munge
chown -R munge:munge /usr/local/var/log/munge
chown -R munge:munge /usr/local/var/lib/munge
chown -R munge:munge /usr/local/var/run/munge/
systemctl enable munge
systemctl start munge
Send this key to all of the compute nodes:
scp /etc/munge/munge.key root@10.0.1.6:/etc/munge
scp /etc/munge/munge.key root@10.0.1.7:/etc/munge
SSH into every node and correct the permissions as well as start the Munge service:
chown -R munge: /etc/munge/ /var/log/munge/
chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge
systemctl start munge
To test Munge, try to access another node with Munge from master node:
munge -n
munge -n | munge
munge -n | ssh 10.0.1.6 unmunge
remunge
If you encounter no errors, then Munge is working as expected.
Install Slurm
Install a few dependencies:
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y
Download the latest version of Slurm in the shared folder:
cd /nfsshare
wget https://download.schedmd.com/slurm/slurm-19.05.4.tar.bz2
If you don't have rpmbuild yet:
yum install rpm-build
rpmbuild -ta slurm-19.05.4.tar.bz2
Check the rpms created by rpmbuild:
cd /root/rpmbuild/RPMS/x86_64
Move the Slurm rpms for installation for all nodes:
mkdir /nfsshare/slurm-rpms
cp * /nfsshare/slurm-rpms
On every node, install these rpms:
yum --nogpgcheck localinstall * -y
Alternatively install from source https://slurm.schedmd.com/quickstart_admin.html
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc lua readline-devel ncurses-devel man2html libibmad libibumad -y
yum install mariadb-server mariadb-devel -y
cd /usr/local
wget https://download.schedmd.com/slurm/slurm-21.08.6.tar.bz2
tar xf slurm-21.08.6.tar.bz2
cd slurm-21.08.6
./configure --sysconfdir=/etc/slurm, make , make install
add /usr/local/lib/slurm to /etc/ld.conf.d/locallib.conf
/sbin/ldconfig -v
copy conf from etc directory
cp slurm.conf.example /etc/slurm/slurm.conf
/usr/local/sbin/slurmd -C
copy service files in etc to /etc/systemd/system
On the master node:
vim /etc/slurm/slurm.conf
Paste the slurm.conf in Configs and paste it into slurm.conf.
Notice: we manually add lines under #COMPUTE NODES.
NodeName=node1 NodeAddr=10.0.1.6 CPUs=1 State=UNKNOWN
NodeName=node2 NodeAddr=10.0.1.7 CPUs=1 State=UNKNOWN
Now the master node has the slurm.conf correctly, we need to send this file to the other compute nodes:
scp /etc/slurm/slurm.conf root@10.0.1.6:/etc/slurm/
scp /etc/slurm/slurm.conf root@10.0.1.7:/etc/slurm/
On the master node, make sure that the master has all the right configurations and files:
mkdir /var/spool/slurm
chown slurm: /var/spool/slurm/
chmod 755 /var/spool/slurm/
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
On the computing nodes node[1-2], make sure that all the computing nodes have the right configurations and files:
mkdir /var/spool/slurm
chown slurm: /var/spool/slurm
chmod 755 /var/spool/slurm
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log
Use the following command to make sure that slurmd is configured properly:
slurmd -C
You should get something like this:
NodeName=node1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=990 UpTime=0-07:45:41
Disable the firewall on the computing nodes node[1-2]:
systemctl stop firewalld
systemctl disable firewalld
On the master node, open the default ports that Slurm uses:
firewall-cmd --permanent --zone=public --add-port=6817/udp
firewall-cmd --permanent --zone=public --add-port=6817/tcp
firewall-cmd --permanent --zone=public --add-port=6818/udp
firewall-cmd --permanent --zone=public --add-port=6818/tcp
firewall-cmd --permanent --zone=public --add-port=6819/udp
firewall-cmd --permanent --zone=public --add-port=6819/tcp
firewall-cmd --reload
If the port freeing does not work, stop the firewall for testing.
Sync clocks on the cluster. On every node:
yum install ntp -y
chkconfig ntpd on
ntpdate pool.ntp.org
systemctl start ntpd
On the computing nodes node[1-2]:
systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service
Setting up MariaDB database: master
Install MariaDB:
yum install mariadb-server mariadb-devel -y
Start the MariaDB service:
systemctl enable mariadb
systemctl start mariadb
systemctl status mariadb
Create the Slurm database user:
mysql
In mariaDB (use own password instead of 1234):
MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
MariaDB[(none)]> FLUSH PRIVILEGES;
MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
MariaDB[(none)]> quit;
Verify the databases grants for the slurm user:
mysql -p -u slurm
Tpye password for slurm: 1234. In mariaDB:
MariaDB[(none)]> show grants;
MariaDB[(none)]> quit;
Create a new file /etc/my.cnf.d/innodb.cnf containing:
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
To implement this change you have to shut down the database and move/remove logfiles:
systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
systemctl start mariadb
You can check the current setting in MySQL like so:
MariaDB[(none)]> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
Create slurmdbd configuration file:
vim /etc/slurm/slurmdbd.conf
Set up files and permissions:
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
touch /var/log/slurmdbd.log
chown slurm: /var/log/slurmdbd.log
Paste the slurmdbd.conf in Configs and paste it into slurmdbd.conf.
Some variables are:
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
StoragePass=1234
StorageLoc=slurm_acct_db
Try to run slurndbd manually to see the log:
slurmdbd -D -vvv
Terminate the process by Control+C when the testing is OK.
Start the slurmdbd service:
systemctl enable slurmdbd
systemctl start slurmdbd
systemctl status slurmdbd
On the master node:
systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service
Starting Notes adapted to install slurm for a CentOS 8 server running galaxy server
Above notes mostly work but some changes to note
wget https://download.schedmd.com/slurm/slurm-21.08.2.tar.bz2
mkdir /var/log/slurm
touch /var/log/slurm/slurmdbd.log
chown slurm /var/log/slurm/slurmdbd.log
Install drmaa packages and dependencies to build source
https://pypi.org/project/drmaa/
https://github.com/natefoo/slurm-drmaa
cd /usr/local
wget http://www.colm.net/files/ragel/ragel-6.10.tar.gz
tar xf ragel-6.10.tar.gz
cd ragel-6.10
./configure
make
sudo make install
cd /usr/local
wget http://ftp.gnu.org/pub/gnu/gperf/gperf-3.1.tar.gz
cd gperf-3.1
./configure
make
sudo make install
cd /usr/local
git clone https://github.com/natefoo/slurm-drmaa.git
cd slurm-drmaa
git submodule init && git submodule update
./autogen.sh
./configure , or
./configure --with-slurm-lib=/usr/local/lib/slurm
make
sudo make install
sudo /sbin/ldconfig -v | grep drmaa
libdrmaa.so.1 -> libdrmaa.so.1.0.8
pip install drmaa
Copy example conf files in /etc/slurm to actual conf files. Edit slurm.conf. You can set the partition name. See Quick Start User Guide
Use this to get slurm.conf Slurm Version 21.08 Configuration Tool
hostname -s
vclv99-252
vi slurm.conf
SlurmctldHost=vclv99-252
NodeName=vclv99-252 NodeAddr=152.7.99.252 CPUs=8 State=UNKNOWN
PartitionName=testing Nodes=ALL Default=YES MaxTime=INFINITE State=UP
Some useful commands to see if it's working. It may be necessary to use these commands and check the log files for errors, usually permission errors, until these commands are successful. Note that for slurm installed from source these commands are located in /usr/local/bin and /usr/local/sbin and would need to prefix the commands with path.
slurmdbd -D -vv
sinfo -Ne
NODELIST NODES PARTITION STATE
vclv99-252 1 testing* idle
slurmd -C
NodeName=vclv99-252 CPUs=8 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=15580
UpTime=2-18:01:50
scontrol ping
Slurmctld(primary) at vclv99-252 is UP
Start 2 handlers or restart them if job_conf.xml is updated.
cd /usr/local/galaxy
cd .venv/bin
source activate
cd /usr/local/galaxy
./scripts/galaxy-main -c config/galaxy.yml --server-name handler0 --attach-to-pool job-handlers --pid-file handler0.pid --daemonize
./scripts/galaxy-main -c config/galaxy.yml --server-name handler1 --attach-to-pool job-handlers --pid-file handler1.pid --daemonize
Edit job_conf.xml
If workflows do not run and possibly with the message 'Flushed transaction for WorkflowInvocation' then change assign_with="db-skip-locked" to assign_with="db-self"
Depending on galaxy version may need to set job_config_file: config/job_conf.xml
<?xml version="1.0"?>
<job_conf>
<plugins>
<plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
<plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner" workers="4"/>
</plugins>
<handlers assign_with="db-skip-locked">
<handler id="handler0">
<plugin id="slurm"/>
</handler>
<handler id="handler1">
<plugin id="slurm"/>
</handler>
</handlers>
<destinations default="slurm">
<destination id="local" runner="local"/>
<destination id="slurm" runner="slurm">
<param id="native_specification">--mem=4000 --ntasks=2</param>
</destination>
</destinations>
<limits>
<limit type="registered_user_concurrent_jobs">2</limit>
<limit type="destination_total_concurrent_jobs" id="slurm">8</limit>
</limits>
</job_conf>
To restart all for new slurm config:
systemctl stop slurmd
systemctl stop slurmctld
systemctl start slurmd
systemctl start slurmctld
ps aux | grep handler
kill 2882149 2882193
sudo su galaxy
cd /usr/local/galaxy
cd .venv/bin
source activate
cd /usr/local/galaxy
./scripts/galaxy-main -c config/galaxy.yml --server-name handler0 --attach-to-pool job-handlers --pid-file handler0.pid --daemonize
./scripts/galaxy-main -c config/galaxy.yml --server-name handler1 --attach-to-pool job-handlers --pid-file handler1.pid --daemonize
After reboot may need to fix sinfo -Ne also for error log - error: cgroup_dbus_attach_to_scope: cannot connect to dbus system daemon: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
scontrol update nodename=vclvm178-23 state=idle
sudo systemctl start dbus.service
To reset node with jobs still running in state drng
scontrol update nodename=vclvm178-23 state=resume
On some servers the slurm commands may need the full path, ie if you get scontrol: command not found.:
sudo /usr/local/bin/scontrol update nodename=vclvm178-23 state=resume
Invalid node state can be caused by resources found on server do not match what is shown in /etc/slurm/slurm.conf. For example if the server memory is reduced then line with RealMemory=350000 need to be set at less than real memory on the server. After fixing then set state as above.
NODELIST NODES PARTITION STATE
vclvm178-26 1 debug* inval