Frequently Asked Questions
Accounts¶
How do I get an account?¶
See the CCI projects article for information on creating a project and associated accounts. All forms for new accounts should be send to accounts[at]ccni.rpi.edu only.
How do I change my password?¶
(You must change your password before you can log into the landing pads.)
Use the Password Change form. If you have multiple accounts for different projects, this form will change the password for all of your accounts together.
What is the meaning of the user name format?¶
Resource usage is tracked and regulated by the user ID associated with a job or files on disk. If one person is involved with multiple projects they will be given different user IDs for each project that they are involved with. These IDs are formatted to include an identifier of both the project and user.
File System¶
How do I check my GPFS quota usage?¶
Executing "df -h ." in a directory will display usage based on the quota enforced for that particular directory tree.
How do I increase my quota?¶
A quota increase request for barn space must be sent to support with an explanation by the project PI/sponsor or, for a non-RPI project, by the organization manager. Please include an explanation of why the quota-free scratch space is insufficient for your needs.
Usage¶
How do I access the systems?¶
First, use ssh
to connect to a landing pad.
Then connect from a landing pad to a cluster front-end such as
dcsfen01 to access the DCS
Supercomputer. Please check our List of
Available Systems for more
information.
Why can't I connect to a landing pad?¶
Why do I receive this error when connecting to a landing pad: Operation timed out?
You must use two-factor authentication and connect to the two-factor (bravo) landing pads, blp01-04.
(More information is available in the landing pads article.)
Why can't I download XYZ from the Internet?¶
CCI systems, specifically the landing pads, do not allow general outbound access to other resources or sites on the Internet. Some common remote repositories are available via a proxy.
How do I get data onto or off of the system from off campus?¶
Use scp
to transfer data to the landing pad
systems. The file systems mounted there match those present on the
compute systems. See this page
for large file transfers.
Scheduler / Slurm¶
What is the time limit for running jobs?¶
The default wall clock time limit varies between systems. Please consult the wiki page for each system for details. If you have a different time limit it will be enforced automatically (you do not have to do anything).
There are enough free nodes - why isn't my job running? / Why are my jobs waiting in the queue for a long time?¶
Jobs are automatically prioritized based on a number of parameters such as size, project usage, project classification, and once in queue, age or time in queue. The queue is not a simple FIFO queue. Jobs may be inserted into the middle of the queue based on initial priority and even move towards the end of the queue if a project's usage increases significantly while the job is pending. A job will begin once it reached the head of the queue.
When there is a large job in queue that requires many nodes, the scheduler will hold nodes free when jobs complete in anticipation of the large job. Smaller jobs, shorter jobs will fill in nodes (backfill) if they have priority and do not interfere with the large job beginning at the expected time. The scheduler will also hold nodes free in anticipation of a maintenance outage.
On some systems, partitioning has been done to reduce fragmentation and improve overall system throughput. It is important to check the number of nodes in the partition before assuming there is a problem with the scheduler.
It is also possible that a special reservation is necessary for system diagnostics. This type of allocation will result in nodes being listed as idle but they will be unavailable to users and they will not have jobs scheduled on them.
My job has a different state than normal/isn't starting or running/has the wrong parameters, should I cancel and resubmit it?¶
No, do not cancel and resubmit your job. A job's accumulated wait time is factored into the job priority, making it more likely to run the longer it waits. If you cancel and resubmit then this advantage is obviously lost. If there is a real system issue, removing the job makes it harder to determine what the problem could be. Contact support and they will assist with any issues; problems can often be resolved without losing priority.
Why does Slurm give me this error: error: Unable to allocate resources: Invalid account or account/partition combination specified?¶
Your account is not authorized to submit to this partition. There may be another partition that your account is authorized to use or your account may not have authorization at all for that particular cluster/system.
Software¶
Why isn't library or tool foo installed on bar?¶
The systems have a base set of common libraries and tools installed for their native architecture and operating system. A reasonable effort is made to provide libraries and tools required by most users, but there will always be some library or tool that someone needs that we do not provide. In these cases we can render advice on how to obtain or set up the given package but we will not install it for you, globally or locally. This applies to both free/open source and commercial software.
What MPI implementations are available? How do I use them?¶
There are several different implementations of MPI available on the clusters including OpenMPI and MVAPICH2. We use modules to simplify the process of making these libraries available to users.