Open main menu

Humanoid Robots Wiki β

Changes

K-Scale Cluster

4,890 bytes added, 00:12, 25 May 2024
SLURM Commands
The K-Scale Labs cluster is a clusters are shared cluster for robotics research. This page contains notes on how to access the clusterclusters.
=== Onboarding ===
To get onboarded, you should send us the public key that you want to use and maybe your preferred username.
 
=== Lambda Cluster ===
After being onboarded, you should receive the following information:
IdentityFile ~/.ssh/id_rsa
</syntaxhighlight>
 
You need to restart <code>ssh</code> to get it working.
After setting this up, you can use the command <code>ssh cluster</code> to directly connect.
=== Notes ===
* You may need to restart <code>ssh</code> to get it working.
* You may be sharing your part of the cluster with other users. If so, it is a good idea to avoid using all the GPUs. If you're training models in PyTorch, you can do this using the <code>CUDA_VISIBLE_DEVICES</code> command.
* You should avoid storing data files and model checkpoints to your root directory. Instead, use the <code>/ephemeral</code> directory. Your home directory should come with a symlink to a subdirectory which you have write access to.
 
=== Andromeda Cluster ===
 
The Andromeda cluster is a different cluster which uses Slurm for job management. Authentication is different from the Lambda cluster - Ben will provide instructions directly.
 
Don't do anything computationally expensive on the main node or you will crash it for everyone. Instead, when you need to run some experiments, reserve a GPU (see below).
 
==== SLURM Commands ====
 
Show all currently running jobs:
 
<syntaxhighlight lang="bash">
squeue
</syntaxhighlight>
 
Show your own running jobs:
 
<syntaxhighlight lang="bash">
squeue --me
</syntaxhighlight>
 
Show the available partitions on the cluster:
 
<syntaxhighlight lang="bash">
sinfo
</syntaxhighlight>
 
You'll see something like this:
 
<syntaxhighlight lang="bash">
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 8 idle compute-permanent-node-[68,285,493,580,625-626,749,801]
</syntaxhighlight>
 
This means:
 
* There is one compute node type, called <code>compute</code>
* There are 8 nodes of that type, all currently in <code>idle</code> state
* The node names are things like <code>compute-permanent-node-68</code>
 
To launch a job, use [https://slurm.schedmd.com/srun.html srun] or [https://slurm.schedmd.com/sbatch.html sbatch].
 
* '''srun''' runs a command directly with the requested resources
* '''sbatch''' queues the job to run when resources become available
 
For example, suppose I have the following Shell script:
 
<syntaxhighlight lang="bash">
#!/bin/bash
 
echo "Hello, world!"
 
nvidia-smi
</syntaxhighlight>
 
I can use <code>srun</code> to run this script with the following result:
 
<syntaxhighlight lang="bash">
$ srun --gpus 8 ./test.sh
Hello, world!
Sat May 25 00:02:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
 
... truncated
</syntaxhighlight>
 
Alternatively, I can queue the job using <code>sbatch</code>, which gives me the following result:
 
<syntaxhighlight lang="bash">
$ sbatch --gpus 16 test.sh
Submitted batch job 461
</syntaxhighlight>
 
We can specify <code>sbatch</code> options inside our shell script instead using the following syntax:
 
<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --gpus 16
 
echo "Hello, world!"
</syntaxhighlight>
 
After launching the job, we can see it running using our original <code>squeue</code> command:
 
<syntaxhighlight lang="bash">
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
461 compute test.sh ben R 0:37 1 compute-permanent-node-285
</syntaxhighlight>
 
We can cancel an in-progress job by running <code>scancel</code>:
 
<syntaxhighlight lang="bash">
scancel 461
</syntaxhighlight>
 
[https://github.com/kscalelabs/mlfab/blob/master/mlfab/task/launchers/slurm.py#L262-L309 Here is a reference] <code>sbatch</code> script for launching machine learning jobs.
 
==== Reserving a GPU ====
 
Here is a script you can use for getting an interactive node through Slurm.
 
<syntaxhighlight lang="bash">
gpunode () {
local job_id=$(squeue -u $USER -h -t R -o %i -n gpunode)
if [[ -n $job_id ]]
then
echo "Attaching to job ID $job_id"
srun --jobid=$job_id --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --pty $SLURM_XPUNODE_SHELL
return 0
fi
echo "Creating new job"
srun --partition=$SLURM_GPUNODE_PARTITION --gpus=$SLURM_GPUNODE_NUM_GPUS --cpus-per-gpu=$SLURM_GPUNODE_CPUS_PER_GPU --interactive --job-name=gpunode --pty $SLURM_XPUNODE_SHELL
}
</syntaxhighlight>
 
Example env vars:
<syntaxhighlight lang="bash">
export SLURM_GPUNODE_PARTITION='compute'
export SLURM_GPUNODE_NUM_GPUS=1
export SLURM_GPUNODE_CPUS_PER_GPU=4
export SLURM_XPUNODE_SHELL='/bin/bash'
</syntaxhighlight>
 
Integrate the example script into your shell then run <code>gpunode</code>.
 
You can see partition options by running <code>sinfo</code>.
 
You might get an error like this: <code>groups: cannot find name for group ID 1506</code>. But things should still run fine. Check with <code>nvidia-smi</code>.
 
==== Useful Commands ====
 
Set a node state back to normal:
 
<syntaxhighlight lang="bash">
sudo scontrol update nodename='nodename' state=resume
</syntaxhighlight>
 
[[Category:K-Scale]]
488
edits