| Sun Center of Excellence Galaxy Cluster - Frequently Asked Questions |
|
Sun Center of Excellence Galaxy Cluster - Frequently Asked QuestionsHPCC Team <help@hpcc.uh.edu>Abstract
This document presents some questions and answers
about the Sun Microsystems "Galaxy" cluster located
in the High Performance Computing Center at the
University of Houston.
Your input is appreciated to make this FAQ more useful.
Copyright © The University of Houston 2003-2005
Contents
1. How do I report a bug or problem or ask a question?In general, email us at help@hpcc.uh.edu. (It's just us HPC guys, by the way, you won't end up going in circles around a help desk somewhere.) This feeds into a trouble-ticket system which helps us track problems through to resolution. Please don't report problems to individual members of HPCC. The trouble-ticket system allows all of us to track what's going on and to help with problem resolution. If one person is away on vacation, then there'll always be someone else around to field questions. The trouble-tickets also allow us to keep a record of previous problems to help with future enquiries. The database of tickets is private. And please:
2. How do I acknowledge the Center in a publication?If you publish work that used the resources of the Center of Excellence in some way, please acknowledge this. There's no set text, but something along the lines of
This work was performed using the computational resources of the
Sun Microsystems Center of Excellence in the Geosciences at the University of Houston. See http://www.suncoe.uh.edu/ for information about the Center. would be suitable. 3. Where are mpicc, mpif77 (the MPI compilers), mpirun and so on?I'm used to using MPICH for MPI programs on Linux Beowulf clusters, but I can't find MPICH on the Galaxy cluster, where is it? The MPI on this machine is Sun's own implementation, optimized for their architecture. In case you weren't aware, MPI is a specification; it is not a language or a particular implementation. MPICH, LAM and ClusterTools i.a. are implementations of that specification. MPICH is not installed, and you cannot use it as an alternative, as the queuing system is integrated with SunMPI / ClusterTools 5. Here's a quick translation, although:
Dummy scripts for the most common MPICH commands are in place on the system; they merely explain that MPI is not MPICH here, and point to this FAQ entry (and exit with a non-zero status so that configuration/compilation runs don't use the wrong commands).
4. Is BLAS or LaPACK installed?Not explicitly as separate libraries as you might find on some machines, but the functionality of BLAS and LaPACK is bundled with the sunperf library that is part of the Sun Studio compilers. ScaLaPACK, the parallel version, is available through the ClusterTools package: See the Center of Excellence articles about BLAS and LAPACK and Interval Arithmetic.
5. Can I login with telnet or rlogin? So how do I login?"Telnet" and "rlogin" (the Berkeley "r*" commands) pass all data, including passwords, over the network in a clear form where they can be snooped and thus these tools are just asking for trouble. We only allow login with Secure Shell. There are various implementations of Secure Shell, e.g.
Please tell us if you know of other clients that can be added to this list (such as for Mac OS).
6. How do I find out what's going on?
7. Is "xyz" installed?The most obvious thing is to try running the command you're looking for. If it's installed permanently, it will be made available as part of the standard environment. On the Center's web site is a database of locally installed software. Some tools are not part of the standard environment, but are nonetheless installed. These will include things like beta-versions of compilers, new tools, etc. that need to be tested before they replace the current versions of standard tools. Relevant news items on the Center's web site will detail how to access these versions in each case.
8. Could you install "xyz" for me?Ask us at help@hpcc.uh.edu. We make a distinction between globally installed software that is part of our service provision, and software that particular groups might want to make available to their members. We support the former, and will provide assistance with the latter (with only a small group, we can't do everything!).
9. The "df" output is horribleUse df -kor df -hto make it look nicer; use a shell alias or function for permanence. See the df man page for more information.
10. "make" doesn't behave like it does under Linux?The make program is the standard Solaris version (SUNWsprot package). The GNU version is installed as gmake. The same "g" prefix applies to other GNU versions of standard tools.
11. "tar" doesn't understand the "z" option?The "tar" program is the standard Solaris version which only understands the standard UNIX "compress" algorithm (SUNWcsu package). The GNU version is installed as gtar. The same "g" prefix applies to other GNU versions of standard tools. Jörg Schilling's star is also available. This version will attempt to auto-detect compressed archives (bzip2, gzip and Lempel-Ziv) and behave accordingly.
12. How do I ftp files onto the cluster?Although you can get/put files through FTP from/to any FTP server (to which you have access), we don't support FTP onto the cluster's login nodes. To transfer files there, use the Secure Shell. In other words, from somewhere else, you can't do: ftp kodos.hpcc.uh.edu(or to kang.hpcc.uh.edu). We maintain an aggressive security posture and tend to shy away from clear-text network services. By "aggressive" we mean that how is how we protect your accounts and data.
13. ftp from the login nodes doesn't work?Passive ftp is necessary to get through our firewall configuration.
14. I got an error message trying to run "xterm" and other visual programs on the cluster?
$ xtermBoth the Secure Shell server and client must be configured to allow X11Forwarding. The servers on our login nodes are configured to allow X11-over-ssh so check your local client configuration (ssh_config) on the machine you're logging in from to see if X11Forwarding is enabled there. You can also enable it via the command-line and in personal ssh configuration files. You also must have an X server running on your desktop! 15. My .forward file is being ignored (do I need a .forward?)The system automatically forwards your mail to the (preferred) address you gave us when you got your account. Mail also appears to come from that address, rather than from the cluster because we don't run a mail service. If your contact email address changes, please let us know at help@hpcc.uh.edu.
16. How do I run my jobs?GridEngine Enterprise Edition is currently used as the job submission agent. We've put together some sample programs for various job types for you on the cluster under the directory /galaxy/local/examples(NB this directory is not on the web server!)
Feel free to
contribute code
if you have interesting examples you'd like to share with the HPC
community on campus.
17. Where do my jobs run?The cluster consists of various machines, and is split into a number of processor pools. At the time of writing there are 4 pools.
18. How do I talk to the job submission system?
19. qmon dies on Linux systemsIf the graphical interface to Grid Engine, qmon, refuses to run, instead generating a nasty error on a Linux machine, this is because qmon only supports a 16-bit display depth. You need to change your X11 display depth to 16 bits. (Annoying, but c'est la vie.) Most people write their own job scripts or adapt the online examples. 20. a.out: command not foundI just compiled a program and tried to run it but the system said that the resultant executable couldn't be found? In order to run programs in the current directory you need to indicate how to find that directory. The safe way is to say $ ./a.out Adding "." to the PATH in your environment is another way but that carries security risks. We do not set this by default, and we do not recommend it. See the entry in the comp.unix.shell FAQ for a more detailed discussion.
21. whoami: command not foundCommands such as whoami are in the /usr/ucb directory. See the Solaris FAQ for more information. /usr/ucb is deprecated and it is generally not recommended to rely on commands there being available in the future. However, if you want to add /usr/ucb to your PATH, feel free. The safest place to put it would be at the end, otherwise you will override standard Solaris commands and weird things may well happen.
22. I need accurate system clocks for my programsWe use NTP to synchronize 2 administrative machines of the cluster from
The Sun clocks on SPARC machines tend to be very accurate; any drift is on the order of a few microseconds, and often much less. If you want to see the synchronisation for whatever reason, use the command: $ ntpq -c peers 23. Why are the login nodes called "kang" and "kodos"?We're glad you asked about our cartoon obsession. They are the aliens on The Simpsons. This might help explain things, but then again it might not: Worst login nodes. Ever. You have to vote for one of us!
24. So is there a difference between "kang" and "kodos"?Basically, no difference at all. They have the same amount of memory, the same number of CPUs, the same network connections, and see the same applications and home directories. We should make it clear, kang and kodos are merely the machines through which you access the cluster. There's much more inside, where you run your jobs. But be aware that some files will be local to an individual node, such as under /tmp and /var/tmp. "Cron" and "at" jobs are also local to the machine where they were created. Further, /tmp is volatile; anything under there will disappear after a reboot.
25. My program crashes when I open large files or try to allocate lots of memoryBasically, your compiled code is 32-bit and cannot handle such large objects (more than about 2GB). The simplest solution is to make the code 64-bit as all machines are headed that way now. See this article on our news site for more information.
26. Can I have a login shell that's not listed?Yes, if it's installed already. But you need to be aware that you won't get any support at all with regard to shell issues.
27. What file systems are available?Each node has local file systems for the O.S. and temporary files. But in a wider sense, there are currently 2 file systems available from the Sun cluster:
28. How do I choose which interconnects to use?There are 4 possible interconnects on the cluster:
The default behaviour (we're talking about MPI here) is as follows:
So the preference is: shm → myr → tcp. ClusterTools will failover through these possibilities. If Myrinet is down for whatever reason, Gigabit will be tried, then 100 Mbit (although the test order can be configured at the system level). You can override this setting with the environment variable MPI_PMODULES. In the job script, you can say #$ -v MPI_PMODULES="shm,tcp"to make jobs ignore Myrinet. If MPI_PMODULES is not set (default) then it is the same as #$ -v MPI_PMODULES="shm,myr,tcp" There's also a news article about how to control SunMPI.
29. Oops, how do I get back a deleted file?We do incremental backups of the shared file systems every night. If the file was created before the last backup we should have it, and 3 previous versions. Please ask to get it restored. Requests can of course be for single or multiple files, or whole directory structures. In the request, please make sure you tell us
|
