Home HPC Documentation Sun Center of Excellence Galaxy Cluster - Frequently Asked Questions
Sun Center of Excellence Galaxy Cluster - Frequently Asked Questions

Sun Center of Excellence Galaxy Cluster - Frequently Asked Questions 

HPCC Team <help@hpcc.uh.edu

Abstract 

This document presents some questions and answers about the Sun Microsystems "Galaxy" cluster located in the High Performance Computing Center at the University of Houston.

Your input is appreciated to make this FAQ more useful.

Copyright © The University of Houston 2003-2005

Contents

  1. How do I report a bug or problem or ask a question?
  2. How do I acknowledge the Center in a publication?
  3. Where are mpicc, mpif77 (the MPI compilers), mpirun and so on?
  4. Is BLAS or LaPACK installed?
  5. Can I login with telnet or rlogin? So how do I login?
  6. How do I find out what's going on?
  7. Is "xyz" installed?
  8. Could you install "xyz" for me?
  9. The "df" output is horrible
  10. "make" doesn't behave like it does under Linux?
  11. "tar" doesn't understand the "z" option?
  12. How do I ftp files onto the cluster?
  13. ftp from the login nodes doesn't work?
  14. I got an error message trying to run "xterm" and other visual programs on the cluster?
  15. My .forward file is being ignored (do I need a .forward?)
  16. How do I run my jobs?
  17. Where do my jobs run?
  18. How do I talk to the job submission system?
  19. qmon dies on Linux systems
  20. a.out: command not found
  21. whoami: command not found
  22. I need accurate system clocks for my programs
  23. Why are the login nodes called "kang" and "kodos"?
  24. So is there a difference between "kang" and "kodos"?
  25. My program crashes when I open large files or try to allocate lots of memory
  26. Can I have a login shell that's not listed?
  27. What file systems are available?
  28. How do I choose which interconnects to use?
  29. Oops, how do I get back a deleted file?

 

1. How do I report a bug or problem or ask a question?

In general, email us at help@hpcc.uh.edu. (It's just us HPC guys, by the way, you won't end up going in circles around a help desk somewhere.) This feeds into a trouble-ticket system which helps us track problems through to resolution.

Please don't report problems to individual members of HPCC. The trouble-ticket system allows all of us to track what's going on and to help with problem resolution. If one person is away on vacation, then there'll always be someone else around to field questions. The trouble-tickets also allow us to keep a record of previous problems to help with future enquiries. The database of tickets is private.

And please:

  • use a meaningful subject line that succinctly describes the issue or question.
  • reply to our responses so that we can maintain the thread within our database (otherwise you will probably open a new ticket each time).
  • tell us if the problem has been resolved or your question has been answered. If you don't, we end up with open tickets lying around, and we'll have to come and bug you about them.

[top]

2. How do I acknowledge the Center in a publication?

If you publish work that used the resources of the Center of Excellence in some way, please acknowledge this.

There's no set text, but something along the lines of

This work was performed using the computational resources of the
Sun Microsystems Center of Excellence in the Geosciences
at the University of Houston. See http://www.suncoe.uh.edu/
for information about the Center.

would be suitable.


[top]

3. Where are mpicc, mpif77 (the MPI compilers), mpirun and so on?

I'm used to using MPICH for MPI programs on Linux Beowulf clusters, but I can't find MPICH on the Galaxy cluster, where is it?

The MPI on this machine is Sun's own implementation, optimized for their architecture.

In case you weren't aware, MPI is a specification; it is not a language or a particular implementation. MPICH, LAM and ClusterTools i.a. are implementations of that specification.

MPICH is not installed, and you cannot use it as an alternative, as the queuing system is integrated with SunMPI / ClusterTools 5.

Here's a quick translation, although:

  • take a look at the online examples for how to perform optimized compilation, especially for Fortran and -dalign,
  • notice that ClusterTools requires explicit mention of the MPI library, as there are different libraries available depending on whether you use threading

 

MPICH SunMPI / ClusterTools
mpicc -O3 file.c mpcc -fast -xarch=v8plusb file.c -lmpi
mpif77 -O3 file.f mpf77 -fast -xarch=v8plusb file.f -lmpi
mpif90 -O3 file.f mpf90 -fast -xarch=v8plusb file.f -lmpi
mpirun -np 4 ./foo submit job through Grid Engine with mprun,

Dummy scripts for the most common MPICH commands are in place on the system; they merely explain that MPI is not MPICH here, and point to this FAQ entry (and exit with a non-zero status so that configuration/compilation runs don't use the wrong commands).

 

[top]

4. Is BLAS or LaPACK installed?

Not explicitly as separate libraries as you might find on some machines, but the functionality of BLAS and LaPACK is bundled with the sunperf library that is part of the Sun Studio compilers. ScaLaPACK, the parallel version, is available through the ClusterTools package:

See the Center of Excellence articles about BLAS and LAPACK and Interval Arithmetic.

 

[top]

5. Can I login with telnet or rlogin? So how do I login?

"Telnet" and "rlogin" (the Berkeley "r*" commands) pass all data, including passwords, over the network in a clear form where they can be snooped and thus these tools are just asking for trouble. We only allow login with Secure Shell. There are various implementations of Secure Shell, e.g.

UNIX Windows
OpenSSH (free) Putty (free)
SSH Communications SSH Communications
lsh (free) bitvise

Please tell us if you know of other clients that can be added to this list (such as for Mac OS).

 

[top]

6. How do I find out what's going on?

 

  • We make all announcements about HPC, our systems and the Center of Excellence on the web site news service;
  • The message-of-the-day you see at login refers you to this web site;
  • In addition, important items are also sent to the Galaxy mailing list, to which everyone is subscribed by default when they get an account;
  • There is also a news program on the cluster that provides a command-line, or other, browser interface to the news section of the Center of Excellence web site but it's probably more useful to just visit the web site directly from your desktop.


 

[top]

7. Is "xyz" installed?

The most obvious thing is to try running the command you're looking for. If it's installed permanently, it will be made available as part of the standard environment. On the Center's web site is a database of locally installed software.

Some tools are not part of the standard environment, but are nonetheless installed. These will include things like beta-versions of compilers, new tools, etc. that need to be tested before they replace the current versions of standard tools. Relevant news items on the Center's web site will detail how to access these versions in each case.

 

[top]

8. Could you install "xyz" for me?

Ask us at help@hpcc.uh.edu.

We make a distinction between globally installed software that is part of our service provision, and software that particular groups might want to make available to their members.

We support the former, and will provide assistance with the latter (with only a small group, we can't do everything!).

 

[top]

9. The "df" output is horrible

Use

    df -k
or
    df -h
to make it look nicer; use a shell alias or function for permanence. See the df man page for more information.

 

[top]

10. "make" doesn't behave like it does under Linux?

The make program is the standard Solaris version (SUNWsprot package).

The GNU version is installed as gmake. The same "g" prefix applies to other GNU versions of standard tools.

 

[top]

11. "tar" doesn't understand the "z" option?

The "tar" program is the standard Solaris version which only understands the standard UNIX "compress" algorithm (SUNWcsu package).

The GNU version is installed as gtar. The same "g" prefix applies to other GNU versions of standard tools.

Jörg Schilling's star is also available. This version will attempt to auto-detect compressed archives (bzip2, gzip and Lempel-Ziv) and behave accordingly.

 

[top]

12. How do I ftp files onto the cluster?

Although you can get/put files through FTP from/to any FTP server (to which you have access), we don't support FTP onto the cluster's login nodes. To transfer files there, use the Secure Shell.

In other words, from somewhere else, you can't do:

    ftp kodos.hpcc.uh.edu
(or to kang.hpcc.uh.edu).

We maintain an aggressive security posture and tend to shy away from clear-text network services. By "aggressive" we mean that how is how we protect your accounts and data.

 

[top]

13. ftp from the login nodes doesn't work?

Passive ftp is necessary to get through our firewall configuration.

 

Command Usage
ftp the standard Solaris 9 ftp client can be told to go passive via the "-p" command-line option,
or via the command "passive" once you've started an ftp session (or through the .netrc file)
ncftp automatically detects that it needs to use passive mode
wget handles passive ftp automatically (local configuration has passive-ftp enabled).

 

[top]

14. I got an error message trying to run "xterm" and other visual programs on the cluster?

 

$ xterm
xterm Xt error: Can't open display:
Both the Secure Shell server and client must be configured to allow X11Forwarding. The servers on our login nodes are configured to allow X11-over-ssh so check your local client configuration (ssh_config) on the machine you're logging in from to see if X11Forwarding is enabled there. You can also enable it via the command-line and in personal ssh configuration files.

You also must have an X server running on your desktop!


[top]

15. My .forward file is being ignored (do I need a .forward?)

The system automatically forwards your mail to the (preferred) address you gave us when you got your account. Mail also appears to come from that address, rather than from the cluster because we don't run a mail service.

If your contact email address changes, please let us know at help@hpcc.uh.edu.

 

[top]

 

16. How do I run my jobs?

GridEngine Enterprise Edition is currently used as the job submission agent.

We've put together some sample programs for various job types for you on the cluster under the directory

    /galaxy/local/examples
(NB this directory is not on the web server!)
NB Not being omniscient, these examples reflect our best understanding at any given time, and they're subject to change if we find a better way to do something shown there or find an error.

Feel free to contribute code if you have interesting examples you'd like to share with the HPC community on campus.

 

[top]

 

17. Where do my jobs run?

The cluster consists of various machines, and is split into a number of processor pools. At the time of writing there are 4 pools.

 

Pool Description
admin This pool includes the login nodes and is not used for running jobs. Programs run on the login nodes are not accounted for and should not last longer than 30 minutes. Actual runs of jobs for research and parallel programs must be submitted to Grid Engine. Why? Well,
  1. if everyone ran large codes on the login nodes they would grind to a halt and everyone would lose;
  2. by submitting jobs you enable us to collect job statistics across the cluster; these statistics allow us to go and ask for more funding. And that means bigger and faster machines for you to use!
shortpool This pool consists of 2 machines each with 2 CPUs and 2GB of memory. Jobs here are restricted to these 4 CPUs.
Think of this as a test area (at the time of writing, we are testing Solaris 10 and new Myrinet drivers). In general, do not expect to be able to do timing runs in the shortpool.
These machines may be rebooted at short notice, but we won't just pull them out from under you if that can be avoided.
hpc The general compute pool, there are 92 CPUs spread over the machines in this pool. Jobs can run for up to 1 week (that's 7 actual days of computation no matter how many CPUs you request).
core This pool overlaps the hpc pool, but is only available to specific users working directly within the Center of Excellence.

 

[top]

18. How do I talk to the job submission system?

 

Function Command Example Extra Comments
Graphical User Interface qmon Most people just use a text editor to write job files and submit jobs from the command-line
Submit a job qsub [file] Takes input from either stdin, or a file
Examine the queues qstat Standard Grid Engine command, rather verbose output
Examine the queues (local)
  • qmaster, quick abbreviated output
  • showq, provides more statistics
Locally written commands that provide an easier-to-read summary of running and queued jobs.
Also see graphic of processor usage.
Why is a job not running? qstat -j <jobnumber> See the last line of the output
List the machines in the cluster qhost All the nodes of the cluster, their contents and current state
Find machines of a particular type qhost -l num_proc=8 Shows which machines in the cluster have ≥ 8 CPUs
(N.B. num_proc=N means machines with at least N CPUs)
Remove a job from a queue qdel <jobnumber> This works for both running and queued jobs
Forcibly remove a job from a queue qdel -f <jobnumber> In case a simple qdel doesn't delete the job

 

[top]

19. qmon dies on Linux systems

If the graphical interface to Grid Engine, qmon, refuses to run, instead generating a nasty error on a Linux machine, this is because qmon only supports a 16-bit display depth. You need to change your X11 display depth to 16 bits. (Annoying, but c'est la vie.) Most people write their own job scripts or adapt the online examples.

[top]

20. a.out: command not found

I just compiled a program and tried to run it but the system said that the resultant executable couldn't be found?

In order to run programs in the current directory you need to indicate how to find that directory.

The safe way is to say

  $ ./a.out

Adding "." to the PATH in your environment is another way but that carries security risks. We do not set this by default, and we do not recommend it.

See the entry in the comp.unix.shell FAQ for a more detailed discussion.

[top]

 

21. whoami: command not found

Commands such as whoami are in the /usr/ucb directory.

See the Solaris FAQ for more information.

/usr/ucb is deprecated and it is generally not recommended to rely on commands there being available in the future. However, if you want to add /usr/ucb to your PATH, feel free. The safest place to put it would be at the end, otherwise you will override standard Solaris commands and weird things may well happen.

 

[top]

 

22. I need accurate system clocks for my programs

We use NTP to synchronize 2 administrative machines of the cluster from

  • the main UH time server, and Texas A&M
  • a known pool of public servers (secondary)
These local machines then broadcast into the cluster on the administrative 100Mbit network (not normally used for file I/O or MPI) to provide local reference clocks. Some auxiliary equipment like the managed RAID arrays (/galaxy) synchronize in this way too.

The Sun clocks on SPARC machines tend to be very accurate; any drift is on the order of a few microseconds, and often much less.

If you want to see the synchronisation for whatever reason, use the command:

    $ ntpq -c peers

[top]


23. Why are the login nodes called "kang" and "kodos"?

We're glad you asked about our cartoon obsession. They are the aliens on The Simpsons.

This might help explain things, but then again it might not:

Kang and Kodos

Worst login nodes. Ever. You have to vote for one of us!

[top]

 

24. So is there a difference between "kang" and "kodos"?

Basically, no difference at all. They have the same amount of memory, the same number of CPUs, the same network connections, and see the same applications and home directories.

We should make it clear, kang and kodos are merely the machines through which you access the cluster. There's much more inside, where you run your jobs.

But be aware that some files will be local to an individual node, such as under /tmp and /var/tmp. "Cron" and "at" jobs are also local to the machine where they were created. Further, /tmp is volatile; anything under there will disappear after a reboot.

[top]

 

25. My program crashes when I open large files or try to allocate lots of memory

Basically, your compiled code is 32-bit and cannot handle such large objects (more than about 2GB). The simplest solution is to make the code 64-bit as all machines are headed that way now.

See this article on our news site for more information.

[top]

 

26. Can I have a login shell that's not listed?

Yes, if it's installed already. But you need to be aware that you won't get any support at all with regard to shell issues.

[top]

 

27. What file systems are available?

Each node has local file systems for the O.S. and temporary files.

But in a wider sense, there are currently 2 file systems available from the Sun cluster:

  1. /galaxy which is visible throughout the Sun cluster. It's used for applications and home directories.
  2. /cxfs which is available on all HPCC machines. Use this for sharing project data between the various machines.
    You can create your own project directory under /cxfs/work, e.g. /cxfs/work/quantum if you're working on a project called "quantum".

[top]

28. How do I choose which interconnects to use?

There are 4 possible interconnects on the cluster:

  1. Shared Memory
  2. Myrinet
  3. Gigabit Ethernet
  4. 100 Mbit Ethernet

The default behaviour (we're talking about MPI here) is as follows:

  • inside a single node, use shared memory (shm)
  • between nodes, prefer Myrinet (myr) → Gigabit (tcp) → 100 Mbit (tcp)

So the preference is: shm → myr → tcp.

ClusterTools will failover through these possibilities. If Myrinet is down for whatever reason, Gigabit will be tried, then 100 Mbit (although the test order can be configured at the system level).

You can override this setting with the environment variable MPI_PMODULES. In the job script, you can say

#$ -v MPI_PMODULES="shm,tcp"
to make jobs ignore Myrinet. If MPI_PMODULES is not set (default) then it is the same as
#$ -v MPI_PMODULES="shm,myr,tcp"

There's also a news article about how to control SunMPI.

[top]

 

29. Oops, how do I get back a deleted file?

We do incremental backups of the shared file systems every night. If the file was created before the last backup we should have it, and 3 previous versions. Please ask to get it restored. Requests can of course be for single or multiple files, or whole directory structures.

In the request, please make sure you tell us

  • exactly where the file is (include the path);
  • if it's the most recent copy you want or from a specific date;
  • how to restore; in place (overwrite any existing files) or somewhere temporary for you to grab later.
In the future, we hope to automate all of this so you can simply archive and/or restore yourself from the backup server without having to ask someone here.

[top]


 
home search