The twin Sun Ultra 10 servers we are using for general logins and statistical processing is "wjh.harvard.edu".
The new systems have 512 megabytes of RAM and when fully configured will be running two processors. The method for running statistical packages is somewhat new: check out general information and also specific sections on Stata, SPSS and SAS.
Disk space is provided through a RAID 5 disk array called a Viper. The Viper provides 20GB of storage space to be used both for common datasets and for private storage of datasets for statistical work. This very fast drive is mounted directly on ISR2 and provides the most efficient access to data for statistical processing. Space for storing data online has been markedly increased, and you will find some new directory names and guidelines for use. Read "Storing and Accessing Data" below.
We will continue to provide some locally developed facilities for
extracting data: check out "Local
Access Programs" below. And welcome to a faster and more spacious
environment!
General to all packages:
While we have a fast machine with ample memory, it may still be possible for many users executing large statistics jobs simultaneously to run the machine ragged "sharing time" rather than processing. We want to ask your cooperation in observing the following guideline -- whether your process is interactive or background, please confine your personal use of a statistical package to one process at at time. This means you will not submit five jobs to SPSS simultaneously, log out and go home. However, you are welcome to use the queuing facility described below to submit those 5 jobs in a single command to run sequentially, and then logout and go home. Benchmarks on the new system have shown a significant speed increase, so more often you will find the machine keeping pace with your requests..
On our prior system, a locally written utility intervened when statistical packages were called and placed these jobs in system-wide grocery-store-style queues -- we had an express "check" line, regular and a big job line. There was a definite limit to the number of stat jobs that could run simultaneously. Behind this queuing were a need to protect the machine from overloading, and in some cases, licensing issues with the stat packs.
The queuing mechanism for our new machine is much simpler:
licensing is no longer a factor and our increased capability lessens
the risk of overloading this machine. For each package, you will have
a q____ command (qstata, qspss, qsas) for submitting one or more
background jobs in sequence -- see specific package writeups for
detail. Some general examples will give the flavor:
qspss myjob &
Suppose you have four jobs ready to go at once (a Stata example
this time)
qstata job1 job2 whatever whatelse &
Note: "qstata job1 job2 whatever whatelse &" is entirely different from four separate commands issued one immediately after the other: "qstata job1 &", "qstata job2 &", "qstata whatever &", "qstata whatelse &". The first method runs "job1.do" to completion, then "job2.do" and so on; it follows the one statistical process per user guideline. It also means that any dependency job2 might have on job1's actions would be okay -- providing job1.do ran correctly. Submitting these four backgrounds separately would place them in the system nearly simultaneously. There would be no guarantee the job1 would complete before job2 started and, more important from other users' perspective, one user would have 4 jobs in the system at once..
If need be we will monitor and prevent this type of use through
local utilities, but we will first see if a simple "one user: one
process" guideline can work. That way we will have no "monitoring
process" using system resources and users can employ the system
differentially depending on load -- i.e. are there ten users running
Stata at 3 p.m.,or one user alone on the system at 4 a.m.?
Monitoring Jobs
1) Your Own: you may often find jobs running so quickly that there is no need to monitor, but here's how, just in case.
2) General We have a new flavor of Unix and new arguments for the "ps" command, so don't be surprised if you have difficulty initially -- type "man ps" for detail. Meanwhile, you may find the following two aliases useful to add to your .cshrc file for job and usage monitoring:
The odd string "\!$" allows you to put an argument in your request to the alias; it picks up the last term in your request as the argument, so in the examples above:
Play around with them until you see what combination gets you the
most useful information. Do be careful anytime you edit your ".cshrc"
file -- a mistake here can cause your login sessions to behave
peculiarly. Ask for help the first time or two if you haven't created
new aliases before.
Killing a job.
If you know a job has an error, you may want to remove it from the system rather than wait for it to complete. Once you have its process id from a "ps" command, you can use the command "kill -9 pid#". Or, if you have gotten a job # from the "jobs" command, you can kill that background job with the command "kill %jobnumber".
Note that you can only kill your own job. If you see a process that seems to have run amok, you could write "help" with the process id and your thought -- a system manager might well look into the situation and concur.
N.B. If you are running a statistical job with very large datasets
and wind up killing the job, it will be a courtesy to let a system
manager know in case there are large temporary files left to be dealt
with.
Specifics on Running the Basic Packages Available on
ISR
As of October 22, 2002, Stata is available only on the server WJH2 -- this is due to a licensing restriction. Also new as of October 22, 2002, we are running Stata 7.0 as the default version, and we have Stata "special edition" available on WJH2 -- see note below.
Interactive: Still a new kid on the block, Stata is designed for interactive
use. You must have the line set path = ( $path /usr/local/stata ) in your .cshrc file to access Stata interactively. Just type
"stata" and you'll have your interactive session. Keep in mind that you can return to a Unix shell to edit a .do
file or execute other Unix commands. From within Stata, type
! on the command line. You'll have your Unix shell;
do what you like and then type "exit" to return to your Stata
session. Background: Background processing is an extension of the .do facility in
Stata. Your prepare your stata program as "whatever.do". This must be
an ordinary text file, nothing fancy like an uploaded Word or Word
Perfect document. ".do" files can be executed from within an
interactive Stata session and can save a lot of retyping in
interactive work. However, they can also be run from the Unix
prompt: It may be good work style to have a big do file executing and be
setting up new statements in an interactive session. Be sensitive as
to system load on this -- as you see in monitoring jobs above, you
can tell how many stats processes are going and can govern your own
use accordingly.
Special Edition Stata Version 7.0 is called with the commands
Interactive: While a limited interactive session via regular terminal
emulations is available with SPSS, this facility is not SPSS for
Windows with drop-down menus. It is a modified command line session
which could be useful for data exploration. If you are interested,
check the manual in 544/548 -- we'll have a few copies of the manual
relevant section available. There are commands to learn in order to
use this facility effectively, but if SPSS is your primary tool and
ISR your main access, you may want to get comfortable with this
mode. Background: Program files for SPSSx background processing should be given a
file type of .sps for easy file management and to fit easily with PC
naming conventions. Note that the names are arbitrary and can be long and include
subdirectory paths. These program names are checked by "qspss" and
may either be the actual full name, or "pgmname.sps" or
"pgmname.spx". If more than one file in the directory fills
the bill, qspss will let you know and ask you to specify the file
completely. Given the popularity of SPSS for Windows, we strongly recommend
that you adopt the custom of naming program files with a filetype of
.sps. These files downloaded to a PC on which SPSS is
installed (all our lab NT machines) will be immediately
recognized as an SPSS program. (Note that if you also get in the
habit of ending each command in your program file with a period, the
programs written on ISR2 will execute fine in a syntax window of SPSS
on PC.) Temporary files: On our previous machine, queued SPSS automatically placed any
files which had no directory path in the directory /tmp. This is no
longer true. Now files with no path will be stored in your current
directory, i.e. wherever you are located when you issue the qspss
command. If you want to place files somewhere "temporary" so that you
needn't worry about erasing them, use the directory "/datatmp",
e.g. xsave outfile='/datatmp/mytmp1.sav' Files in "/datatmp" will be available until they are found to have
gone unused for 24 during the nightly system cleaning. The area
"/tmp" is used for a wide variety of system functions and users are
expressly requested not to write files there as they may interfere
with system activity and adversely affect the work of other
users. Memory: The qspss facility provides no mechanism for increasing the memory
available to your job. Your default is high and so far the issue has
not arisen. If you get a failed job due to insufficient memory,
please check with Cheri Minton -- there may be a specification
problem we can work through. If you do have an unavoidably giant job,
we'll set up a way to increase memory for a specific job without
increasing the general default. Interactive There is no facility for using SAS interactively on ISR2. Check
with "manager@wjh" if SAS is your primary package and you want to
explore possibilities for interactive user on some public
machine. Background Program files for background use of SAS must have ".sas" as the
final 4 characters. The remainder of the name can be long, contain
dots or underscores or a directory path. ISR will provide several options for online data storage and
access. Because the access to the Viper disk is direct and very fast,
users are strongly encouraged to use this space for datasets they are
working with, rather than working on data in their home directories.
The increased speed in performance will be a benefit to all. All the
disk areas on the Viper begin with the directory
"/local/data".(1).
Regardless of where you store your data, you should keep
your sets of programs stored in your home directory
area. Depending on the size of your data and whether it is a public use
dataset or one you have developed yourself, you have choices as to
where you will keep it while you do your statistical runs. 1) Publicly readable storage areas -- /local/data/lib-a,
/local/data/lib-b On the previous system we kept some high use or complex datasets
online, e.g. General Social Survey and Panel Survey of
income Dynamics -- the former because of frequent use and the
latter because of the need for random access to household records in
the abstracting process. We will continue this policy and hope to
expand it in useful ways. Look for the subdirectories GSS, PSID,
ICPSR, PUMS and CPS in the public library area, along with the
familiar directory of information files, INF. Note that this area is
"read-only" for users. You won't be able to place datasets there
yourself. With data readily downloadable from source locations like the
Harvard Data Center, we want to provide a common storage location for
public access data which will help users to avoid storing multiple
copies of the same dataset. To do this effectively we will need your
help. When you download public access data, let Cheri Minton know.
She can store your downloaded data in a location designed to make
duplication obvious. You can then draw extracts from this data, make
your variables and store your "value added" system files in more
private space. Only documented public data is kept in the publicly
readable area. If we run low on space, the datasets not used recently
will rotate offline. 2) Public read/write storage --
/local/data/public This area is intended to hold datasets in active use. In general,
establish a directory here with your username and store datasets
within this area. Keep your program files in your home directory area
and from within them point to this directory, e.g. File handle myfile/
name='/local/data/public/crawford/income.sys' Only you and the system manager will have the ability/right to
modify or delete the files you keep in this directory. Because this
is shared space, and finite, we ask you to remove files you do not
expect to use within the next six months. If space gets tight,
impeding the work of current users, we will scan the disk for the
following: 1) files unused in 3 months and not in compressed format. We will
compress them. Files are compressed with the command "gzip filename"
and once compressed, the original filename is terminated with ".gz".
The process is reversed with the command "gunzip filename.gz". 2) files unused in 6 months, in a user directory whose total space
use exceeds 100 megabytes. We will remove the files to a backup area,
leaving a readme in the directory as to where the files have been
moved and method for restoring or requesting restoration. If your work involves large files, so that your total space need
exceeds 100 megabytes, try "gzipping" the files in this area. You can
then expand them into a rotating area with the "zcat" command -- see
details under "File Compression" below. If compressing files still
leaves you needing more than 100 megabytes of online storage, check
with staff about setting up a larger block of storage on a different
disk partition. (See /local/data/research below). Monitoring space in a partition If you are working with large datasets, you will need to become
comfortable assessing your use of space and space available on the
disk area you plan to be using. There are two commands you may use to
monitor space in a particular area of interest. The first will check
on you own use of a particular area. Sitting at the top of the area
you want to check -- e.g. your home directory, or your subdirectory
on /local/data/public, type The second command is useful for learning about space remaining on
a physical device -- i.e. is /local/data/public in danger of filling
altogether? Position yourself on the device of interest, e.g. cd
/local/data/public and then issue the command: NB If you do notice that any particular disk area is approaching
100% capacity, take a moment to write "manager@wjh" so that system
staff can deal with the situation as soon as possible. A disk area
filling is taken quite seriously and your alert will be
appreciated. 3) Rotating Storage, for big datasets in immediate use --
/local/data/rotate/day7 Public read/write areas can get crowded so that there is not space
for really large additions. For this reason we provide a large area
in which data not being used after a specified length of time is
automatically removed. This keeps large blocks of space available for
current work. If you have large datasets, consider storing them in
zipped or compressed format in a secure spot and then expanding them
into a rotating area when you are going to be actively using
them. File compression and expansion: Note that there are choices about compression utilities -- gzip
stores the files with original names and ".gz" appended. There also
are Unix versions of "zip" and "unzip", which read zip archives
created on PC and write zip archives which can be read on PC. For
storing collections of files in compressed format, zip may be a good
option. Selective unzipping of files in the archive could give you
access to the data when you are ready to use it in statistical runs.
(A caveat here -- files zipped on Unix and then unzipped on a PC will
still have Unix end of line formatting, so regular text files may not
transfer as smoothly as you would like). /local/data/rotate actually has two major directories which will
share the same physical partition, that is, the size and number of
files in one directory impact directly on space available in the
other. The difference in the two directories is in the amount of time
which is allowed to elapse before an unused file rotates off the
partition. Note that neither of these areas should be considered
"storage". The area will be backed up so that your data are
protected against disk failure, but you are responsible for keeping a
secure copy of your data, either through a compressed version or
through programs which can recreate the dataset. System staff should
not be expected to restore data which has rotated off the disk due to
running out of time since last access. Either through "unzipping" or
running a program, you should have an easy path to restoring any data
which has run out of time in a rotating area. The recommended style for using /local/data/rotate will be to have
a compressed or zipped copy of your source data in a secure location
and then to expand the data for use with a stat pack. Secure spots
might include: 4) /local/data/research --Large data areas for private
datasets, held for a limited time New to ISR is the idea of providing users with unusually large
amounts of data a specific area for storing this data for the crunch
period of the project. If you find a combination of compressing data
in /local/data/public and available space in /local/data/rotate are
not meeting your needs, talk with system personnel, either Cheri
Minton or Tom Raich. If there is space available you may be assigned
a directory in /local/data/research. The directory would be private
but the general area will be shared among several users, and your
usage monitored so that your work is not impacting on others. You
will be expected to give an expected date for releasing the space you
are using. This is a new arrangement and we hope it will be helpful
to the big dataset users. As we gain experience, we'll be working on
ways to make its use effective and fair. The transition to our new platform is still in progress. Basic
local extracting utilities, i.e. abstr, habstr, nlabstr and spxabs
have been recompiled. The PSID (Panel Survey of Income Dynamics)
extracting facility has been brought over and recompiled, but newer
waves have not been added. This is a significant investment for a
dataset currently in very low use. If you need to make an extract
from the PSID, check with Cheri Minton. You'll likely be best off
with a combination of local facilities and the PSID web system. Check also a listing of the locally
developed programs to see what might fit your needs. Development
is ongoing, so if you see a program that is close, but not what you
need, contact
soc-help
-- we may be able to adjust or expand the facility for you.
A comparison table from Stata Corp:
+-------------------------------------------------------------------+
| | -- Intercooled Stata -- | ------- Stata/SE ------ |
| Parameter | Default min max | Default min max |
|-----------+---------------------------+---------------------------|
| maxvar | 2,047 2,047 2,047 | 5,000 2,047 32,766 |
| matsize | 40 10 800 | 400 10 11,000 |
| memory | 1M 500K ... | 10M 500K ... |
| | | |
| str# | . 1 80 | . 1 244 |
+-------------------------------------------------------------------+
NB Because SE Stata automatically uses more memory, it will be easier on the system and your fellow users if you only use it when your data will require it.
NB There are some compatibility wrinkles to using Stata SE, so be certain to read the help file carefully as you get into it -- "help SpecialEdition"