The General Inquirer User's Guide
Vanja Buvac and Philip Stone
(buvac,pjstone)@fas.harvard.edu
December 1, 2001
Abstract:
The General Inquirer is a content analysis program described in
[
3] and [
1]. This document
explains how to use the Java language implementation of the General
Inquirer (version j1.0). Familiarity with the web version of the
General Inquirer (
http://inquirer.wjh.harvard.edu/GI) is
a prerequisite to this document.
The General Inquirer is distributed as a zip archive. The contents of
the archive together with links to tools for unpacking it
on various platform is described in Appendix A. Uncompress
the archive in a directory of your choice.
The current version of the General Inquirer is written in Java which
makes it usable on any operating systems with a Java Virtual Machine
(JVM). In case that a JVM is not pre-installed on your computer,
instructions on obtaining and
installing a JVM are available at Sun's website (http://www.java.sun.com). The General Inquirer has been successfully
used on the following operating systems: Linux, Windows9x, MacOs,
Solaris, MkLinux, and a wide variety of other Unix based systems.
Starting the program from the command prompt
Once you have successfully installed the JVM on your computer, you
need to start the giGui class. If a command prompt is available
on your operating system you can start this class with the following
command.
java -mx128m giGui
The option -mx128m asks the JVM to allocate 128 megabytes to
the execution of General Inquirer. Depending on the size of your text files
and computer memory you might want to change this number.
On Windows the General Inquirer can be started by executing the
batch file clickme.bat. In case the JVM is not pre-installed you
will need to follow the instructions in the previous subsection
(1.1).
On the MacOs the General Inquirer can be started by executing the
batch file clickmemac. In case the JVM is not pre-installed or
you get some other error please consult Apple's web site for alternative
ways of starting the program (http://developer.apple.com/java/).
Once you have successfully started the program you will see a window
similar to the one shown in Figure 1 displayed on
your screen.
Figure 1:
The General Inquirer main window.
 |
To test the program simply click on the Run button without
changing any arguments. This will execute the General Inquirer on the
directory testdir which is included in the distribution, the
results will be printed to the standard output which is most probably
your command prompt or shell window, and the dictionary used will be
the inquirerbasicttabsclean. First you will have to wait for
the dictionary to load. The status bar will show the message
``Loading dictionary ... please wait.'' In a minute or two, once the
dictionary is loaded, the files in the directory will be processed one
by one and the output will probably race by your command prompt window.
the status bar will tell you what the General Inquirer is doing at
every point.
The basic function of the General Inquirer is to generate a count of
words falling into a given semantic category. There are a few options
that which enable you to tailor this process to your particular needs.
All the options are available through the main window. Here is the
explanation of all the items on General Inquirer's control window.
- Input
- is the name of the text file that you want content
analyzed. If the name refers to a directory, all the files in the
directory are analyzed. The output in this case is a matrix where
files are represented by row entries. If the name refers to a single
file then only one file is analyzed, and the output is a single row.
The input files should be in plain text format; no special formating
is necessary.
- Output
- is the name of the file where you want the output to
go. If this field is left blank, the output goes to standard output
which is probably the command prompt.
- Dictionary
- is the name of the file containing the dictionary.
If you are not interested in changing the tags and entries in the
dictionary, you won't need to change this text field.
- Browse
- The browse buttons located to the right of the text
areas let you browse through the your files to choose to one you want.
- Tags
- If this option is selected the output of the analysis
will be a summary of the tags assigned to the file(s) processed.
- Words
- With this option you can get a count of the words
appearing in the processed file(s). In this output format the rows in
the output matrix contain the words, and the columns contain the
file(s). In other words, the output matrix is a ``transposed''
version of the tags matrix.
- Parse Filenames
- When processing multiple files it might be
useful to encode some variable information directly into the
filename. Selecting this option will parse the filenames and
include the variable names it finds in the output matrix. Details of
how the filenames are parsed are covered in section 2.5.
- Run
- This button is used to start executing the General Inquirer.
- Stop
- This button halts the execution.
- Quit
- This button exits the General Inquirer.
- Status
- This area is used to tell you what the current status
of the General Inquirer is. Specifically, it will tell you when it
is loading the dictionary, processing a given file, generating output,
or ready for the next task.
When the tag option is selected the General Inquirer outputs a matrix
containing the counts and percentages of words tagged with a given
semantic category. Table 1 shows the results from the
texts in the testdir directory with the filenames parsed for
variables. The first four columns of the matrix contain the variables
extracted from the filenames of the processed files. The fifth, format,
column contains the ``format'' of the numbers contained in the given
row. Each file contains two rows. The first is the raw count output which
is simply the count of words in a given semantic category. This format is
labeled with a r in the format column. The second is a scaled count
which represents the percentage of words appearing in a given category. This
number is computed as
.
This format is
labeled with a s in the format column.
The sixth, wordcount, column shows the total number of words
present in the given document. The seventh, leftovers, column
shows the number of words that were not fount in the dictionary.
The remaining columns contain the counts (or percentages) of words in
a given tag. For a full description of the tags used in the current
dictionary see http://www.wjh.harvard.edu/~inquirer.
Table 1:
Tag output matrix for the files in testdir.
| file1 |
file2 |
file3 |
file4 |
format |
wordcount |
leftovers |
Positiv |
... |
|
| gore |
speech |
announce |
1 |
r |
1239 |
52 |
96 |
... |
|
| gore |
speech |
announce |
1 |
s |
1239 |
4.196933 |
7.748184 |
... |
|
| gore |
speech |
announce |
2 |
r |
1505 |
41 |
112 |
... |
|
| gore |
speech |
announce |
2 |
s |
1505 |
2.7242525 |
7.4418607 |
... |
|
| bush |
speech |
announce |
1 |
r |
2036 |
53 |
187 |
... |
|
| bush |
speech |
announce |
1 |
s |
2036 |
2.6031435 |
9.184676 |
... |
|
| bush |
speech |
announce |
2 |
r |
1048 |
13 |
92 |
... |
|
| bush |
speech |
announce |
2 |
s |
1048 |
1.240458 |
8.7786255 |
... |
|
| gore |
speech |
education |
1 |
r |
4009 |
149 |
316 |
... |
|
| gore |
speech |
education |
1 |
s |
4009 |
3.7166376 |
7.882265 |
... |
|
| gore |
speech |
education |
2 |
r |
1689 |
63 |
166 |
... |
|
| gore |
speech |
education |
2 |
s |
1689 |
3.7300177 |
9.8283 |
... |
|
|
When the word option is selected the General Inquirer outputs a matrix
containing the counts of the words contained in a give document.
Table 2 shows the word output matrix for the texts
found in the testdir directory. In contrast to the tag output
format, the columns of the matrix represent the files, and the rows
contain the given words. In this example the parse filenames option
was selected, so the variables extracted from the filenames are found
in the first four rows. The fifth, TOTAL row contains the total
count of words in the document. The following rows contain the
wordcounts for the words in the dictionary. Once this list is
exhausted, a count of the words not in the dictionary (leftover words)
is displayed.
Table 2:
Word output matrix for the files in
testdir.
| word |
gore |
gore |
bush |
bush |
gore |
| |
speech |
speech |
speech |
speech |
speech |
| |
announce |
announce |
announce |
announce |
education |
| |
1 |
2 |
1 |
2 |
1 |
| TOTAL |
1239 |
1505 |
2036 |
1048 |
4009 |
| A |
22 |
39 |
55 |
28 |
82 |
| ABILITY |
0 |
1 |
0 |
0 |
1 |
| ABLE |
0 |
0 |
0 |
1 |
1 |
| ABOUT |
3 |
3 |
3 |
2 |
9 |
| ABOUT#1 |
3 |
3 |
3 |
2 |
9 |
| ABOUT#2 |
0 |
0 |
0 |
0 |
0 |
| ABUNDANT |
1 |
0 |
0 |
0 |
0 |
 |
 |
 |
 |
 |
 |
| appliance |
0 |
0 |
0 |
0 |
0 |
| up-to-the-minute |
0 |
0 |
0 |
0 |
0 |
| $60 |
0 |
0 |
0 |
0 |
0 |
| website |
0 |
0 |
0 |
0 |
0 |
| providers |
0 |
0 |
0 |
0 |
0 |
| enrollees |
0 |
0 |
0 |
0 |
0 |
| labor-management |
0 |
0 |
0 |
0 |
0 |
|
Parsing variables out of the filenames
When the Parse Filenames option is selected the General Inquirer
creates creates columns from this file name identification
information in the output spreadsheet according to the following
procedure:
- 1.
- All the characters in a file name starting with the first period
are removed. For example, .txt and .doc will be removed.
- 2.
- Each word in a file name (separated by spaces) is given a
separate ID field.
- 3.
- The ID fields are labeled ID1, ID2, etc. It may be helpful to
rename these columns with more descriptive labels later on the
statistical spreadsheet.
- 4.
- For the last word in the file name (which may be the only word
if there is but one): The computer tests to see if it begins with a
character. If it does, it then looks for a digit in the word. If a
digit is found, then all characters up to the digit are made into one
ID field and the characters starting with the digit are made a second
ID field.
Here are some examples of how variables are parsed out of the filenames.
- The filename bush speech defense1 is made into 4 ID fields
for the candidate name, the type of document, the topic, and the
serial number within that group: (1) bush, (2) speech, (3) defense,
(4) 1.
- The filename UMIN 0225.txt will have the .txt removed
and be made into two fields, one for Univ. of Minnesota, the second
for the newspaper date (February 25). The date field may be further
recoded into groupings by the statistical software.
- The filename DH134.TXT will have the .TXT removed
and separated into two fields, DH for a high performer and 134 for the respondent's ID number.
- The filename C87 will similarly be two fields, with the
C for conservative party and 87 indicating the year of the party
manifesto.
This section describes the output format used to generate a detailed
output of the analyzed text. Regular expressions can be used on this
output in order to code parts of the text or the whole text in case of
short responses [2]. These regular expressions, or other
programs, which enable the computer to code the text are called production rules.
Here is an example of the output from a single word -- swam.
<swam~SWIM>1`DAV`SUPV`Travl`ED`Exprs`Actv`+`ROOT/
The above output contains all the useful information from the
General Inquirer separated by various metacharacters that make
the data easily parsed with regular expressions.
Here is an explanation of the above output.
< The less than symbol serves as
the word separator. The word immediately following this symbol
is the original word from the text. Swam is the original word in
the example above.
~ The tilde symbol immediately precedes the root of the
original word. In our example, swim is the root of the word swam.
> The greater than symbol precedes the sense number of
the word. In this case the sense number is one, which is the only
sense of the word swim.
` The backquote symbol precedes each instance of
a semantic tag of the word.
<< Two less than symbols serve as a sentence boundary.
There is no sentence boundary in the examples above.
This detailed output can only be generated when starting the
General Inquirer from the command line; the graphical interface
does not support this feature. The input file in this case is
a tab delimited file. By default the text in the
first column is processed. A different value for the column
number containing the text to be processed can be specified when
starting the General Inquirer; the number of the column is specified
after the command.
Here is an example. The original file, tagtest, contains the following
two lines where the two fields are separated by a tab sign.
3 Strategies that work for discipline or curriculum.
4 Whatever I have to offer. I'd make a great wealthy person!
To process this small table of text we can use the following
command. In this case we will have to specify that the text is
located in the second column of the table.
java GeneralInquirer 2 < tagtest > tagtest.gi
The new file, tagtest.gi, would simply add the processed
column to the original file. This file would then be processed
by the production rules and the appropriate code would be appended
to the output.
Here are some examples on how to use the above output for
coding tasks using perl regular expressions and programs.
-
To match within or across sentences, we would use a reference to a whole
response (here the 4th field in the input spreadsheet). For example:
if ( $a[3] =~ /~SMILE.*~FACE/)
matches any co-occurrence in the response of the entries SMILE (includes
smiling, etc) and FACE (includes faces, etc.).
-
To look for a word sense anywhere in response:
if ( $a[3] =~ /~PRODUCE>2/)
finds occurrences of sense two of dictionary entry ``PRODUCE''.
The General Inquirer separates sentences by a
<<. Within-sentence checks are then part of a ``while'' loop
that scans all sentences separately, with $ct variable incremented
from zero with each loop. In perl code:
@b = split(/<</,$a[3]);
$ct = 0;
while (defined($b[$ct])) { if-match-then-action ; $ct++; }
The following matches are then all within the same sentence:
-
if ( $b[$ct] =~ /~SMILE.*~FACE/)
matches dictionary entries SMILE and FACE, so should match "smiles on the
children's faces" etc. even if not adjacent words but in same sentence.
-
if ( $b[$ct] =~ /`Positiv.*<.*`ComForm/)
matches Positive and ComForm tags if not on same word, such as "kind
words" occurring in the text.
-
if ( $b[$ct] =~ /`Positiv[^<]*`ComForm/)
matches Positive and ComForm tags only if on same word (such as occurs for
words "thanks" and "praise").
-
Test types can be combined:
if ( $b[$ct] =~ /~ACE.*`Academ/)
Looks for entry word "ACE" followed by a word with an "Academ" tag, as in
"aces the test".
Here are a few general rules that might come in handy when constructing production
rules.
- The regular expression
[^~]*~ is used to test the occurrence of two
adjacent words. For example, ~COME>[^~]*~BACK matches on any string
that contains the word ``come'' followed directly by the word ``back.''
- The regular expression
([^~]*~){0,3} is used to test the occurrence
of two words within a given range. For example, ~MAKE([^~]*~){0,2}DIFFERENCE
matches on any string that contains the word ``make'' followed within up to
two words by the word ``difference.''
- The regular expression
([^~]*~){0,3}[^~]* is used to test if some words
are followed by a word with a specified tag within a given range.
This version of the General Inquirer is made available exclusively for
educational and research purposes. In publications please reference
this document and [3]. Harvard University and The
Gallup Organization have supported the development of this version of
the General Inquirer; please consider acknowledging their support.
Please do not distribute the General Inquirer on your own. We are
more than happy to give copies to other researchers, but we would like
to know who is using it.
This program is provided ``as is'' and carries no warranties of any
kind. Comments and questions are always welcome.
Description of the distribution
The General Inquirer is currently distributed as a zip archive containing
the following files and directories.
| filename |
size |
description |
| clickme.bat |
|
Startup file for Windows9x. |
| clickmemac |
|
Startup file for MacOs. |
| manual.pdf |
|
This document. |
| inquirerbasicttabsclean |
2911237 |
The dictionary. |
| DisambigRules |
277167 |
This disambiguation rules. |
| GeneralInquirer.class |
61520 |
Main Java class. |
| giGui.class |
7943 |
Java class for the user interface. |
| giThread.class |
6187 |
Java class for concurency. |
| testdir |
directory |
Contains a few text files. |
To uncompress this archive you will need to have a zip archiving
program available on your computer. On Unix based computers the unzip command will uncompress a zip archive into the current
directory. On the Windows platforms the winzip utility (http://www.winzip.com) is widely used for this purpose. The free
evaluation copy is likely to be sufficient. On the MacOs platforms
the maczip (http://www.sitec.net/maczip/ utility is will do the
job.
- 1
-
Edward F. Kelly and Philip J. Stone.
Computer Recognition of English Word Senses.
North-Holland Publishing, 1975.
- 2
-
Andrew Perrin.
Coderead: A mutiplatform coding engine for text-based data.
In American Sociological Association annual meeting, August
2000.
- 3
-
Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, Daniel M. Ogilvie,
and associates.
The General Inquirer: A Computer Approach to Content Analysis.
MIT Press, 1966.
The General Inquirer User's Guide
This document was generated using the
LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 manual.tex.
The translation was initiated by Vanja Buvac on 2001-12-01
Vanja Buvac
2001-12-01