Using WJH for Statistical Processing

The twin Sun Ultra 10 servers we are using for general logins and statistical processing is "wjh.harvard.edu".

The new systems have 512 megabytes of RAM and when fully configured will be running two processors. The method for running statistical packages is somewhat new: check out general information and also specific sections on Stata, SPSS and SAS.

Disk space is provided through a RAID 5 disk array called a Viper. The Viper provides 20GB of storage space to be used both for common datasets and for private storage of datasets for statistical work. This very fast drive is mounted directly on ISR2 and provides the most efficient access to data for statistical processing. Space for storing data online has been markedly increased, and you will find some new directory names and guidelines for use. Read "Storing and Accessing Data" below.

We will continue to provide some locally developed facilities for extracting data: check out "Local Access Programs" below. And welcome to a faster and more spacious environment!

Running the Statistical Programs

General to all packages:

While we have a fast machine with ample memory, it may still be possible for many users executing large statistics jobs simultaneously to run the machine ragged "sharing time" rather than processing. We want to ask your cooperation in observing the following guideline -- whether your process is interactive or background, please confine your personal use of a statistical package to one process at at time. This means you will not submit five jobs to SPSS simultaneously, log out and go home. However, you are welcome to use the queuing facility described below to submit those 5 jobs in a single command to run sequentially, and then logout and go home. Benchmarks on the new system have shown a significant speed increase, so more often you will find the machine keeping pace with your requests..

On our prior system, a locally written utility intervened when statistical packages were called and placed these jobs in system-wide grocery-store-style queues -- we had an express "check" line, regular and a big job line. There was a definite limit to the number of stat jobs that could run simultaneously. Behind this queuing were a need to protect the machine from overloading, and in some cases, licensing issues with the stat packs.

The queuing mechanism for our new machine is much simpler: licensing is no longer a factor and our increased capability lessens the risk of overloading this machine. For each package, you will have a q____ command (qstata, qspss, qsas) for submitting one or more background jobs in sequence -- see specific package writeups for detail. Some general examples will give the flavor:

qspss myjob &

Suppose you have four jobs ready to go at once (a Stata example this time)

qstata job1 job2 whatever whatelse &

 

Note: "qstata job1 job2 whatever whatelse &" is entirely different from four separate commands issued one immediately after the other: "qstata job1 &", "qstata job2 &", "qstata whatever &", "qstata whatelse &". The first method runs "job1.do" to completion, then "job2.do" and so on; it follows the one statistical process per user guideline. It also means that any dependency job2 might have on job1's actions would be okay -- providing job1.do ran correctly. Submitting these four backgrounds separately would place them in the system nearly simultaneously. There would be no guarantee the job1 would complete before job2 started and, more important from other users' perspective, one user would have 4 jobs in the system at once..

If need be we will monitor and prevent this type of use through local utilities, but we will first see if a simple "one user: one process" guideline can work. That way we will have no "monitoring process" using system resources and users can employ the system differentially depending on load -- i.e. are there ten users running Stata at 3 p.m.,or one user alone on the system at 4 a.m.?

Monitoring Jobs

1) Your Own: you may often find jobs running so quickly that there is no need to monitor, but here's how, just in case.

2) General We have a new flavor of Unix and new arguments for the "ps" command, so don't be surprised if you have difficulty initially -- type "man ps" for detail. Meanwhile, you may find the following two aliases useful to add to your .cshrc file for job and usage monitoring:

The odd string "\!$" allows you to put an argument in your request to the alias; it picks up the last term in your request as the argument, so in the examples above:

Play around with them until you see what combination gets you the most useful information. Do be careful anytime you edit your ".cshrc" file -- a mistake here can cause your login sessions to behave peculiarly. Ask for help the first time or two if you haven't created new aliases before.

Killing a job.

If you know a job has an error, you may want to remove it from the system rather than wait for it to complete. Once you have its process id from a "ps" command, you can use the command "kill -9 pid#". Or, if you have gotten a job # from the "jobs" command, you can kill that background job with the command "kill %jobnumber".

Note that you can only kill your own job. If you see a process that seems to have run amok, you could write "help" with the process id and your thought -- a system manager might well look into the situation and concur.

N.B. If you are running a statistical job with very large datasets and wind up killing the job, it will be a courtesy to let a system manager know in case there are large temporary files left to be dealt with.

Specifics on Running the Basic Packages Available on ISR

1) Stata

As of October 22, 2002, Stata is available only on the server WJH2 -- this is due to a licensing restriction. Also new as of October 22, 2002, we are running Stata 7.0 as the default version, and we have Stata "special edition" available on WJH2 -- see note below.

Interactive:

Still a new kid on the block, Stata is designed for interactive use. You must have the line

set path = ( $path /usr/local/stata )

in your .cshrc file to access Stata interactively. Just type "stata" and you'll have your interactive session.

Keep in mind that you can return to a Unix shell to edit a .do file or execute other Unix commands. From within Stata, type ! on the command line. You'll have your Unix shell; do what you like and then type "exit" to return to your Stata session.

Background:

Background processing is an extension of the .do facility in Stata. Your prepare your stata program as "whatever.do". This must be an ordinary text file, nothing fancy like an uploaded Word or Word Perfect document. ".do" files can be executed from within an interactive Stata session and can save a lot of retyping in interactive work. However, they can also be run from the Unix prompt:

  • qstata myjob & goes looking for "myjob.do"
    • runs it and stores results in "myjob.log"
  • qstata myjob1 hisjob2 &
    • Runs these "myjob1.do" and "hisjob2.do" sequentially and stores the output in correspondingly named .log files.

It may be good work style to have a big do file executing and be setting up new statements in an interactive session. Be sensitive as to system load on this -- as you see in monitoring jobs above, you can tell how many stats processes are going and can govern your own use accordingly.

Special Edition Stata Version 7.0 is called with the commands

  • "stata-se" for interactive Stata
  • "qstata-se" for batch mode Stata
A comparison table from Stata Corp:
     +-------------------------------------------------------------------+
     |           |  -- Intercooled Stata --  |  ------- Stata/SE ------  |
     | Parameter |  Default     min     max  |  Default     min     max  |
     |-----------+---------------------------+---------------------------|
     | maxvar    |    2,047   2,047   2,047  |    5,000   2,047  32,766  |
     | matsize   |       40      10     800  |      400      10  11,000  |
     | memory    |       1M    500K     ...  |      10M    500K     ...  |
     |           |                           |                           |
     | str#      |        .       1      80  |        .       1     244  |
     +-------------------------------------------------------------------+


NB Because SE Stata automatically uses more memory, it will be easier on the system and your fellow users if you only use it when your data will require it.
NB There are some compatibility wrinkles to using Stata SE, so be certain to read the help file carefully as you get into it -- "help SpecialEdition"

2) SPSSx

Interactive:

While a limited interactive session via regular terminal emulations is available with SPSS, this facility is not SPSS for Windows with drop-down menus. It is a modified command line session which could be useful for data exploration. If you are interested, check the manual in 544/548 -- we'll have a few copies of the manual relevant section available. There are commands to learn in order to use this facility effectively, but if SPSS is your primary tool and ISR your main access, you may want to get comfortable with this mode.

Background:

Program files for SPSSx background processing should be given a file type of .sps for easy file management and to fit easily with PC naming conventions.

  • qspss pgmname &
  • qspss pgm1 pgm2 whatever &

Note that the names are arbitrary and can be long and include subdirectory paths. These program names are checked by "qspss" and may either be the actual full name, or "pgmname.sps" or "pgmname.spx". If more than one file in the directory fills the bill, qspss will let you know and ask you to specify the file completely.

Given the popularity of SPSS for Windows, we strongly recommend that you adopt the custom of naming program files with a filetype of .sps. These files downloaded to a PC on which SPSS is installed (all our lab NT machines) will be immediately recognized as an SPSS program. (Note that if you also get in the habit of ending each command in your program file with a period, the programs written on ISR2 will execute fine in a syntax window of SPSS on PC.)

Temporary files:

On our previous machine, queued SPSS automatically placed any files which had no directory path in the directory /tmp. This is no longer true. Now files with no path will be stored in your current directory, i.e. wherever you are located when you issue the qspss command. If you want to place files somewhere "temporary" so that you needn't worry about erasing them, use the directory "/datatmp", e.g.

xsave outfile='/datatmp/mytmp1.sav'

Files in "/datatmp" will be available until they are found to have gone unused for 24 during the nightly system cleaning. The area "/tmp" is used for a wide variety of system functions and users are expressly requested not to write files there as they may interfere with system activity and adversely affect the work of other users.

Memory:

The qspss facility provides no mechanism for increasing the memory available to your job. Your default is high and so far the issue has not arisen. If you get a failed job due to insufficient memory, please check with Cheri Minton -- there may be a specification problem we can work through. If you do have an unavoidably giant job, we'll set up a way to increase memory for a specific job without increasing the general default.

3) SAS

Interactive

There is no facility for using SAS interactively on ISR2. Check with "manager@wjh" if SAS is your primary package and you want to explore possibilities for interactive user on some public machine.

Background

Program files for background use of SAS must have ".sas" as the final 4 characters. The remainder of the name can be long, contain dots or underscores or a directory path.

  • qsas job1 &
    • will execute "job1.sas" and create "job1.log" and "job1.lst"
  • qsas job1 job2 whatevr &
    • will execute "job1.sas", "job2.sas" and "whatever. sas" sequentially, producing correspondingly named .log and .lst files
Data Access and Storage on ISR

ISR will provide several options for online data storage and access. Because the access to the Viper disk is direct and very fast, users are strongly encouraged to use this space for datasets they are working with, rather than working on data in their home directories. The increased speed in performance will be a benefit to all. All the disk areas on the Viper begin with the directory "/local/data".(1). Regardless of where you store your data, you should keep your sets of programs stored in your home directory area.

Depending on the size of your data and whether it is a public use dataset or one you have developed yourself, you have choices as to where you will keep it while you do your statistical runs.

1) Publicly readable storage areas -- /local/data/lib-a, /local/data/lib-b

On the previous system we kept some high use or complex datasets online, e.g. General Social Survey and Panel Survey of income Dynamics -- the former because of frequent use and the latter because of the need for random access to household records in the abstracting process. We will continue this policy and hope to expand it in useful ways. Look for the subdirectories GSS, PSID, ICPSR, PUMS and CPS in the public library area, along with the familiar directory of information files, INF. Note that this area is "read-only" for users. You won't be able to place datasets there yourself.

With data readily downloadable from source locations like the Harvard Data Center, we want to provide a common storage location for public access data which will help users to avoid storing multiple copies of the same dataset. To do this effectively we will need your help. When you download public access data, let Cheri Minton know. She can store your downloaded data in a location designed to make duplication obvious. You can then draw extracts from this data, make your variables and store your "value added" system files in more private space. Only documented public data is kept in the publicly readable area. If we run low on space, the datasets not used recently will rotate offline.

2) Public read/write storage -- /local/data/public

This area is intended to hold datasets in active use. In general, establish a directory here with your username and store datasets within this area. Keep your program files in your home directory area and from within them point to this directory, e.g.

File handle myfile/ name='/local/data/public/crawford/income.sys'

Only you and the system manager will have the ability/right to modify or delete the files you keep in this directory. Because this is shared space, and finite, we ask you to remove files you do not expect to use within the next six months. If space gets tight, impeding the work of current users, we will scan the disk for the following:

1) files unused in 3 months and not in compressed format. We will compress them. Files are compressed with the command "gzip filename" and once compressed, the original filename is terminated with ".gz". The process is reversed with the command "gunzip filename.gz".

2) files unused in 6 months, in a user directory whose total space use exceeds 100 megabytes. We will remove the files to a backup area, leaving a readme in the directory as to where the files have been moved and method for restoring or requesting restoration.

If your work involves large files, so that your total space need exceeds 100 megabytes, try "gzipping" the files in this area. You can then expand them into a rotating area with the "zcat" command -- see details under "File Compression" below. If compressing files still leaves you needing more than 100 megabytes of online storage, check with staff about setting up a larger block of storage on a different disk partition. (See /local/data/research below).

Monitoring space in a partition

If you are working with large datasets, you will need to become comfortable assessing your use of space and space available on the disk area you plan to be using. There are two commands you may use to monitor space in a particular area of interest. The first will check on you own use of a particular area. Sitting at the top of the area you want to check -- e.g. your home directory, or your subdirectory on /local/data/public, type

du .
This will outline the directory you are in plus any subdirectories, showing storage in multiples of 1024 (You can only do this in areas that are your own.)
du -s
skips the subdirectory detail and simply shows total space use in the current area.

The second command is useful for learning about space remaining on a physical device -- i.e. is /local/data/public in danger of filling altogether? Position yourself on the device of interest, e.g. cd /local/data/public and then issue the command:

df .
This will show space available and total space on the current physical partition. It will be particularly useful in planning runs involving the rotating space described next.

NB If you do notice that any particular disk area is approaching 100% capacity, take a moment to write "manager@wjh" so that system staff can deal with the situation as soon as possible. A disk area filling is taken quite seriously and your alert will be appreciated.

3) Rotating Storage, for big datasets in immediate use -- /local/data/rotate/day7

Public read/write areas can get crowded so that there is not space for really large additions. For this reason we provide a large area in which data not being used after a specified length of time is automatically removed. This keeps large blocks of space available for current work. If you have large datasets, consider storing them in zipped or compressed format in a secure spot and then expanding them into a rotating area when you are going to be actively using them.

File compression and expansion:

  • To compress a file: g
    • gzip filename[s]
    • files compressed in this fashion are stored individually with their original names with a .gz appended
  • To expand a "gzipped" file in place:
    • gunzip filename[s]
  • To expand a "gzipped" file in a different location, leaving the "gzipped" file intact:
    • zcat filename.gz > /local/data/rotate/day7/filename
    • (Don't forget the > -- the same command with the > left out will kindly expand your file to your terminal screen!)

 

Note that there are choices about compression utilities -- gzip stores the files with original names and ".gz" appended. There also are Unix versions of "zip" and "unzip", which read zip archives created on PC and write zip archives which can be read on PC. For storing collections of files in compressed format, zip may be a good option. Selective unzipping of files in the archive could give you access to the data when you are ready to use it in statistical runs. (A caveat here -- files zipped on Unix and then unzipped on a PC will still have Unix end of line formatting, so regular text files may not transfer as smoothly as you would like).

/local/data/rotate actually has two major directories which will share the same physical partition, that is, the size and number of files in one directory impact directly on space available in the other. The difference in the two directories is in the amount of time which is allowed to elapse before an unused file rotates off the partition.

  • /local/data/rotate/day7 an expanded version of the old /databases/sevenday. Files are removed if they have not been accessed for seven days.
    • (This area replaces and expands /databases/sevenday and you can still reference it by that name if you prefer.)
  • /local/data/rotate/day1 files are removed if they have not been accessed in 24 hours -- check is run around midnight, so tenancy might approach 2 days of non-use. This area is also usable via a symbolic link as "/datatmp". It is intended for datasets you create as intermediate, temporary files and that you would like to "go away" without your having to track and delete them. This area would also be a candidate for really outrageous space demands -- fellow users would be reassured that the data could not sit fallow for very long in this area.

Note that neither of these areas should be considered "storage". The area will be backed up so that your data are protected against disk failure, but you are responsible for keeping a secure copy of your data, either through a compressed version or through programs which can recreate the dataset. System staff should not be expected to restore data which has rotated off the disk due to running out of time since last access. Either through "unzipping" or running a program, you should have an easy path to restoring any data which has run out of time in a rotating area.

The recommended style for using /local/data/rotate will be to have a compressed or zipped copy of your source data in a secure location and then to expand the data for use with a stat pack. Secure spots might include:

  1. your home directory as long as your total storage is less than 25 megabytes.
  2. a publicly readable area, e.g." /local/data/lib-a", where the source data are stored , e.g. a zip file of the ICPSR dataset you are using. Note that you will not be able to write files into this directory yourself -- check with Cheri Minton to have your zipped dataset stored here.
  3. /local/data/public -- good for six month of storage of up to 100 megabytes of data. This limit is flexible, but will be enforced if space gets tight and current work is being impeded..
  4. a separate, dedicated directory in /local/data/research -- see section below
  5. your own hard drive, especially if you have a high speed connection for uploading
  6. a zip disk -- 100 megabytes capacity and lots of zip drives available for uploading.
  7. your own CD of data for a project -- 660 megabytes of storage and a newly available CD burner in the WJHCS facility on 13.

4) /local/data/research --Large data areas for private datasets, held for a limited time

New to ISR is the idea of providing users with unusually large amounts of data a specific area for storing this data for the crunch period of the project. If you find a combination of compressing data in /local/data/public and available space in /local/data/rotate are not meeting your needs, talk with system personnel, either Cheri Minton or Tom Raich. If there is space available you may be assigned a directory in /local/data/research. The directory would be private but the general area will be shared among several users, and your usage monitored so that your work is not impacting on others. You will be expected to give an expected date for releasing the space you are using. This is a new arrangement and we hope it will be helpful to the big dataset users. As we gain experience, we'll be working on ways to make its use effective and fair.

Local Access Programs

The transition to our new platform is still in progress. Basic local extracting utilities, i.e. abstr, habstr, nlabstr and spxabs have been recompiled. The PSID (Panel Survey of Income Dynamics) extracting facility has been brought over and recompiled, but newer waves have not been added. This is a significant investment for a dataset currently in very low use. If you need to make an extract from the PSID, check with Cheri Minton. You'll likely be best off with a combination of local facilities and the PSID web system.

Check also a listing of the locally developed programs to see what might fit your needs. Development is ongoing, so if you see a program that is close, but not what you need, contact soc-help -- we may be able to adjust or expand the facility for you.

  

 

Soc_help
Sociology
WJHCS

Comments or questions? Write soc-help! or phone 5-4751 or drop by 544/548.