Home
Condor Grid at UH
Wednesday, 20 August 2008 06:44

About the Condor Grid at UH

The prideful condor

Condor is a software framework for distributing computing workload across a variety of platforms. It can schedule and run jobs on dedicated compute clusters as well as "scavenge" resources from otherwise idle desktop computers.

The grid at UH has access to dedicated and scavenged cycles from servers at the RCC as well as from selected desktop computers on the UH campus.

The grid is available to all UH researchers, graduate and undergraduate students.

At this time, 6 pools form a grid through a feature known as "flocking". A user with local access to any of the individual pools can submit jobs to the entire grid.

3 Universes in 6 Pools

In Condor the execution environment is called a universe. The UH Condor grid supports three universes: standard, vanilla and java. The standard universe is the default, but any one can be selected when a job is submitted.

The standard universe supports checkpointing. Condor jobs may be temporarily interrupted by jobs with higher priority. In the standard universe, when the interrupted job resumes, it will continue from the last checkpoint, rather than restart from the beginning. To run in the standard universe, you must relink your program (no code changes are required) with the Condor support libraries using the command, condor_compile.

The vanilla universe lacks support for checkpointing, but will run programs that cannot be relinked. It also runs scripts from various intepretors, like command shells, perl or python.

The java universe provides a JVM with appropriate classpath for execution of Java applications.

In the RCC, each compute cluster participating in the Condor grid appears as a separate pool. Use the frontend or "login" node in the cluster to submit and monitor jobs. The compute nodes are dedicated to job execution. Jobs submitted in a particular pool will generally start on the execution nodes of that cluster but if no resources are available, the job will automatically be assigned to another pool in the grid.

Suggested Reading

Basic Condor Commands

Condor Submit Description Files

Condor Version 7.4 Manual

Wikipedia: the Free Encyclopedia "Condor High-Throughput Computing System"

Last Updated on Tuesday, 24 August 2010 11:51
 
BOINC at the RCC
Wednesday, 28 April 2010 10:30

BOINC in the RCC                                                                    BOINC

Cycle-Scavenging

Within the clusters HPC manages in the RCC, absolute priority is given to jobs submitted by local cluster users. They use TORQUE to submit and manage their jobs. TORQUE jobs run with a reserved set of resources for a guaranteed amount of time. They run until they finish, or until the requested amount of time is exceeded.  Most of the compute jobs on Maxwell, our flagship cluster, run under TORQUE. For various reasons, there are times when one or more compute nodes in a cluster are not in use by TORQUE jobs. The period of time a node is idle may be short or relatively long, but it is always unpredictable and thus cannot be rescheduled by TORQUE.

Rather than allow compute cycles to go to waste, we decided to use a cycle scavenging system Condor, along with TORQUE.

Condor is useful for certain kinds of jobs that can run intermittently, suspending and resuming perhaps repeatedly over a long period of time. The jobs use checkpointing to maintain state if they are interrupted. When they resume, they restore their previous state and continue from where they left off. As we use it, Condor also supports BOINC .

BOINC jobs also run intermittently, saving their results by checkpointing. At times, BOINC will report results to a central project server and ask for more work. It is up to the project server to manage the progress of the computational problem. Usually a project server will hand out just enough work to keep a BOINC instance busy for a few hours - expecting to get the results within a few days at most. A particular BOINC installation will typically run computations for more than one research project from one run to the next.

Condor and BOINC jobs are scheduled opportunistically. If a compute node is left idle by TORQUE for even for a few minutes, cycle scavenging starts. The moment TORQUE needs the node again, Condor / BOINC will back off. Even at this rate, the accumulated computing time is significant. Over time in a grid of several thousand cores, it has amounted to hundreds of cpu years.

BOINC Projects

The UH IT HPC team is the primary contributor of cycles for research directed in the Department of Computer Science at the University of Houston: Virtual Prairie (VIP) powered by BOINC. We also contribute support to several projects at the World Community Grid (WCG). Two major current projects are Discovering Denque Drugs and Help Fight Childhood Cancer.

Contributing Your Cycles

There are many projects in the world of volunteer computing. We believe that VIP and the projects in the WCG are very worthy of support. Should you decide to contribute your own cycles, you are welcome to join our team UH IT HPC. Of course, there are many other teams to choose from - or you could decide to create your own team. It all benefits the same cause.

Join the VIP Project

Join the World Community Grid

A Green Hint

UH staff, faculty, and students are encouraged to allow the software to run during weekdays, rather than leave University computers on at night or over weekends. Basically, use your computer as you normally would. BOINC is designed to run in the background.

Run Boinc Only On Authorized Computers

You should run BOINC only on computers which you own, or for which you have obtained the owner's permission.

Last Updated on Tuesday, 11 May 2010 08:51
 
home search