Note Introducing Server Version of General Inquirer -- from inquirer blog
The note below was extracted from a blog set up at the time the General Inquirer server was being introduced. The blog is no longer in use and this major content has been put here instead.
Philip Stone's Weblog
Since its original development in the early 1960’s, the General Inquirer content analysis software has been adapted again and again to the changing landscape of computing resources. The earlier stages in this adaptation saga are roughly as follows:
1) Originally programmed in several languages (COMIT, BALGOL) for the IBM 704-709-7094 mainframe series, the system was reprogrammed in the more powerful PL/I that became available with the large IBM 370 mainframes. These mainframe programs were then supplemented with AUTOCODER programs operating on the smaller, more accessible and hands-on IBM 1401 computer for searching and retrieving text stored on magnetic computer tapes.
2) When time-shared resources first became available at MIT’s Project Mac and then on commercial time-shared systems in the late 1960’s, the PL/I General Inquirer programs were adapted to supporting real-time content analyses. Time-shared computing, usually by connecting a teletype or Selectric typewriter to a mainframe computer over a phone line, enabled us, for example, to perform content analyses of TAT stories as soon as they were typed, giving immediate scores for such categories as “need achievement.”
3) In the 1970’s and 1980’s, the General Inquirer might have been be adapted for the growingly popular Berkeley UNIX platforms, which at the time ran mainly on DEC and Data General computers. However, IBM mainframes continued to be available and offered continuously improving computing capabilities in speed and available memory size, while the skimpy PL/I versions eventually available on UNIX machines never did facilitate a straightforward implementation of the General Inquirer on a UNIX platform.
4) When PC and MAC computers increased their RAM and hard-disk storage capacities enough by the mid-1990’s to handle the General Inquirer content-analysis procedures, the Inquirer was reprogrammed for them using the Dartmouth College “TrueBasic” language with its pioneering capabilities in effectively handling very long text strings. A “run-time” PC or Mac compilation of the TrueBasic General Inquirer was distributed that did not require users to have the TrueBasic language on their computers. General Inquirer users could then complete an entire content-analysis project on their personal computer, often also enlisting desktop statistical software packages that also by then had become available on personal computers, such as SPSS and JMP, to complete their analyses.
5) By the late 1990’s, Java’s widespread distribution on personal computers supported large Java programs being run on them. The General Inquirer was therefore reprogrammed in Java for personal computers as well as other platforms. This adaptation of the General Inquirer was made available without cost to academic users by downloading a zipped package containing the Inquirer’s Java procedures, its dictionaries, and its disambiguation rules. In addition, a General Inquirer website was created to provide detailed information that might be helpful in using the system.
Although the TrueBasic and Java programs proved stable and practical, the wisdom of continuing to have the Inquirer as a client program operating on the user’s computer by now seems less than optimal. As long as people were using modems, data-transfer was so slow as to make it impractical to transmit large amounts of textual data back and forth. Therefore, it made sense do an entire content analysis project on one’s computer, analyzing text files located on that computer. By now, however, broadband connections have become so widespread in offices, student dorms, as well as many homes that the computing environment has once again “tipped” to open new and better options. A recent grant from Harvard University’s technology program makes it possible for us to adapt the General Inquirer yet once again take advantage of new circumstances.
Disadvantages of stand-alone content analysis on a personal computer.
Although scores of users have successfully been able to download and use our Java software on their own computers, this strategy has several disadvantages.
First, it yields very little collaboration and data sharing. Each person is doing his or her own thing on his or her own computer. This becomes especially inappropriate for team projects, whether at school or work, for it makes sharing unnecessarily burdensome when it should be seamless.
Second, it has yielded little towards the development of publicly available textual archives or of publicly available enhanced content-analysis capacities such as expanded dictionary vocabularies, additional tag categories and additional disambiguation routines. An exciting climate of cumulative capabilities and resource sharing that many “open source” software projects have realized has not characterized General Inquirer users isolated from one another.
Third, many students today have become dependent on using their browser as their computer interface to the world and have Microsoft Office as their main independent program suite on their machine, supplemented perhaps with some games. The consequence has been that many students are decidedly uncomfortable about downloading and using our Java implementation. For them, explicitly downloading and running a Java program is an unfamiliar task, not like anything else they do. Indeed, the Sun Corporation’s early vision of having a wide variety of client Java software running on each person’s computer was never in this sense fully realized. The fact that downloading and running our Java-based system is quite a unique experience for most students today is stark testimony to that.
Fourth, many people, including students, maintain minimal, if any, statistical processing capabilities other than Excel on their personal computers, and many of them are inexperienced in using even Excel’s rudimentary statistical capabilities. After they finish content-analyzing texts on their computers, they then encounter the problem, while isolated to their own computer, of how to analyze the count spreadsheets that the Inquirer produces.
Fifth, most texts that students as well as researchers now want to content-analyze are not on their computers. They are out there on the web. Whether college newspaper editorials, country and western song lyrics, protest web sites, or media information on Nexis, there has been an explosion on how much textual information is web-available, not only current information but increasingly for information archived in historical depth.
Textual data-gathering options for a server-based Inquirer.
We envision once again adapting the General Inquirer by having it accessible on servers, rather than client computers, and accessing it in two ways. Path A will be access to entire General Inquirer resources through a browser, with many checks provided by the server to help the user avoid making mistakes and running into problems. Path B will be access using a server’s Unix or Linux line editor. Using Path B, a General Inquirer user has more options, but can easily make mistakes.
Most students would use Path A exclusively. Researchers, as well as enterprising students who want to put in the extra necessary effort, could also enlist Path B when they want.
General Inquirer users would have several options for creating text files on a server for content analyses:
One Path-A option is to upload a file or a folder of files from one’s personal computer to a server. A browser-based version of this procedure, for example, is available to course instructors at Harvard. To do this upload, they bring up a browser window that features a pull down menu of files available on the their computer and another pull-down menu of folders available on the server. Simple clicks enable the instructor (or the teaching fellow, if the instructor is really computer challenged) to specify which files from one’s computer should go where on the server that supports the course web site.
A second Path-A option is to cut and paste text from websites or personal computer files directly into a browser window’s scrolled text box. The browser window can then also feature radio buttons as well as a title text field and a URL-source text field to specify further information about the file. Several MIT artificial-intelligence colleagues have acquired considerable experience in developing this browser-based, cut-and-paste approach for a “clipping” service used by volunteers for political candidates. They have found that human inspection of the text, as well as using pull-down menus for volunteers to describe information about the text source, is much preferable for this task to using autonomous web crawlers.
The use of such a browser-based data-gathering approach would have been an appreciated improvement over a cut-and-paste approach we employed in several seminars at Harvard as well as at the Essex European consortium summer-school program. For those projects, students copied and pasted text files from the web pages into MS Word text files, with information about the text source identified in the file names. These files were then manually assembled into folders for Inquirer processing. The results were similar, but the process was more tedious. Also, there was a limit to how much descriptive information can be put into a file name. The MIT browser-based approach is not only much better, but we can draw upon their extensive implementation experiences from having volunteers cut and paste thousands of documents.
Rather than assume that one-format fits all, a project leader might create special webpage formats for gathering information for the project. Template web pages and instructions could be provided so a project leader could easily create descriptor variables and modify the web page HTML. Of course, those who choose the Path-B, line-editor access to Unix on the server can readily select from various open-source web crawlers to port text files to a server for content analyses. Vanja Buvac has successfully used such an approach in searching such sites as “E-pinions” for suitable evaluation commentaries for a research project.
In addition, probably there also is or soon will be at least some browser-based procedures available for instructing a web crawler. If appropriate, these can then be added to our Path A package of procedures as they become available.
Whatever method of gathering textual data is used, pointers to the text files would be maintained in an SQL relational database, probably by utilizing an open-source version such as the popular “mySQL.” Each Path-A user would begin sessions by completing a log-in on the first screen, which would then initiate a tracing of the user’s activities for the session in the database. As part of the login webpage, the user would find radio buttons for each “project” for which he or she is an authorized member. Added text files might be noted not only in the user’s database, but also in the “project” database if the user is participating in a collaborative activity. Any Path B user, of course, would need to have a regular Unix user account and would have access to the usual UNIX file-sharing options.
Editing dictionaries and disambiguation rules.
Special browser-based interfaces will be developed, probably mostly using Dreamweaver/Flash, to provide for (1) the addition of entries to dictionaries, (2) the development of new tag categories, (3) the editing of tag assignments, (4) the editing and amplification of existing disambiguation rules, and (5) the development of new sets of disambiguation rules for entry words not currently disambiguated.
Users who want to modify a dictionary and/or disambiguation rules will be given a copy of the masterfile in their SQL database. If participating in a project and authorized, they may be given access to a copy of the masterfile that project members share.
The tag category assignments for any entry word or word sense will be represented as a list. Tags will be added from a pull down menu. A tag assignment will be removed by clicking a radio button next to that tag name in the list. A special web page will be used to create new tag names or to remove a tag name from that copy of the dictionary.
Disambiguation rules can become quite complicated, branching to several rules ahead and incorporation several different types of tests. If an “or” test, the match of any of the test tags or words satisfies the match. If an “and” test, then all of the tests much match. For some test types, all the tests must not only match, but match in the specified order.
The web-browser page would provide a scrollable listing of all the rules of a specified entry word. By clicking on a sense number, a pop-up window might show the meaning of that sense and the tags currently assigned to it. Clicking a box would also allow the contents of that box to be modified. Further assists may be provided to make changes easier. For example, a clickable “step” button could appear to increase or decrease the range of the “to” and “from” windows. In addition, a pull-down menu of legal tests might be used to specify the test type. As new rules are inserted or added, the arrows linking the rules would be redrawn accordingly.
Finally, the web page might feature some macro options that would facilitate the addition of common verb phrasals or idioms. By clicking on these options, a template rule might be inserted that would check for a preposition following the target word. The user might then just select the preposition from a pull-down menu and not even have to type it.
By making additions to the dictionary and disambiguation rules as easy and fool-proof as possible, it may be become much more feasible to increase significantly the comprehensiveness of the General Inquirer’s coverage of the English language. Compared to when the General Inquirer was first created, there are now extensive published collections of verb phrasals and idioms that could easily be incorporated into our existing disambiguation rules. The current General Inquirer dictionaries and disambiguation rules indeed reflect the limits of what once could fit on a readily available partition of a mainframe computer. The current General Inquirer’s 10,000+ dictionary entries and 6,600 disambiguation rules could each be increased several fold without creating any processing problems on today’s servers.
Still another new resource for improving disambiguation quality would be to run the text against a full-syntax parser as a first pass, making part of speech assignments available to the disambiguation routines. In earlier times such complex syntax parsers had so-so accuracy and the mainframe computing costs to do the parsing were too high to justify our using them for large content-analysis projects. Today, of course, the computing costs are near zero. Disambiguation rules could then be modified to take advantage of more specific as well as more accurately assigned syntax markers.
At the end of the editing session for any word, the server computer would check to make sure that the logic is complete. If errors were detected, the user would be asked to address and fix them before the server converts the edited version back to the master file notation and inserts it into the copy of the master file. By enlisting an easily comprehended visual layout like that proposed here, we anticipate that disambiguation rule editing will be greatly facilitated compared to editing the General Inquirer’s disambiguation notation.
By simplifying the editing and expansion of dictionary entries and disambiguation rules as well as the addition of new tag categories and by making these procedures publicly available, we hope that this would encourage language buffs to add to the system, creating “open system” cumulative products that are both more comprehensive in covering the English language, but also are sophisticatedly targeted to identify specific themes in texts.
Batch-mode processing. Once having gathered the texts to be analyzed and edited the dictionary and disambiguation rules, the user is now ready to process a folder of text files. This can be invoked using screens with pull-down menus, much like the current Java version does on the client machine. These menus would specify which dictionary is to be applied to which folders of files and where the results are to be stored. Menus might also be used to specify what background information about each file should appear as variables on the output spreadsheets. For output, some users might only want counts of tag assignments. Some might also want to know how many times each word or word sense in the dictionary was matched. Others might want a marked up version of the text showing where all the tags have been assigned. If there is much text to be processed, the Path A user could check back later to a “job status” web page.
General Inquirer background processing can then take place in Java or any other language operating on a Unix or Linux platform. One factor in the choice of language, in addition to appropriateness and availability, is what fits best with what others in our community are using. Here we find little consensus. At Harvard, the various programming groups that are developing software for classroom applications divide into advocates of Perl, Java, and, at the medical campus, Visual Basic. At MIT, our colleagues developed the clipping service background programs in a version of LISP that operates on a Linux platform.
Key to the success of the system, especially for successful project collaboration that supports concurrent users, will be the database management system. This system will have to coordinate changes and processing requests from multiple members of a project team. Some projects will be using standard dictionaries and disambiguation rules. At the other extreme, some projects may be developing their content analysis categories and disambiguation rules from scratch, perhaps with the aid of inductive procedures such as correspondence analysis or latent semantic structure tools.
After a batch processing is completed, the user should be able to download the General Inquirer word count spreadsheets or draw upon statistical analysis software on the server to complete the analysis. Some users might want to switch from Path A to a Path B mode at this point, depending on which statistical-analysis strategy is chosen.
Another realm yet for further General Inquirer development is the creation of better “visualization” tools for seeing relationships in content analysis results. In this regard, we have been particularly interested in the various visualization tools for information in spreadsheets developed by Professor Jeffrey Huang’s group at Harvard, using Dreamweaver/Flash technology.
By having as one option the creation of a marked-up file showing all tag assignments, it is possible to go back and retrieve instances of any assignment, much as was originally accomplished using the IBM 140l. This file also can serve as input for the use of regular expressions to score higher order themes such as need achievement or need affiliation, as described in the original MIT Press General Inquirer book.
As part of our investigations in using regular expressions for projects at Gallup, we devised a mark-up notation for content analyzed text to facilitate the unique identification of inflected text word forms, root text word forms, as well as all tags assigned to them. This notation has been used both for retrieval purposes and as well as input to scoring for the presence and absence of particular themes by using regular expressions written in PERL. Our experience in scoring some of Gallup’s large collections of open-ended responses showed that for many questions the computer scoring of themes based on patterns found by a set of regular expression tests (what we call “production rules”) matched the agreement of Gallup’s professional scorers with each other. Such automatic scoring, of course opens up the possibilities of making much greater use of open-ended questions in web-based interviews, for it greatly lowers the costs of analyzing responses. The choice between the use of open and closed ended questions is not determined by the analysis cost, but by the amount of time the respondents need to write out their thoughts, compared to checking a box.
For Path B users, specifying complex retrieval searches is straightforward by writing regular expressions in PERL, PHP or other Unix/Linux-based languages that support regular expressions. Path A, browser-based basic retrieval procedures probably could be easily developed. Path A procedures for specifying the equivalent of regular expressions, however, is a more daunting challenge. In our experience, many people do not readily become fluent in writing regular expressions in currently available notations. If a path A procedure were created, it probably also should feature a representation that would allow users to write production rules that are the equivalent of regular expressions in a less daunting way. Perhaps a suitable visualization somewhat similar to what we propose for in the representation of disambiguation rules could be part of a browser-based representation of production rules. To our knowledge, such a more user-friendly representation for developing production rules is yet to be created. Vanja Buvac has considered further some of the issues in creating such capabilities. The availability of such an easily used tool, of course, could be very useful in taking content analysis for many users to a new stage of thematic-scoring sophistication.
Real-time processing. Much as we once scored themes on stories typed on a teletype connected to Project Mac’s mainframe and gave immediate feedback to the person at the terminal, it now becomes possible to score immediately and give feedback regarding the many kinds of text that people type into web browsers. This has two important purposes. One is to evaluate the adequacy of what is typed and to make instant probes for further information if there is inadequate information to make a scoring. The second is to provide useful feedback to the person doing the typing, an option that may be particularly important for exercises designed to provide self-insight Immediate feedback, as any behaviorist knows, is much more effective than feedback much later. Such immediate feedback also could be of use in helping a person write documents that do not have unintended messages, such as a latent hostility. A content analysis could offer more sophisticated analyses and feedback than some systems now provide when one employs flagrant terms commonly associated with flaming.
We anticipate that the development of real-time procedures usually will be a “second step” that follows extensively testing scoring adequacy on a lot of data using a batch processing. Because of this, we anticipate the management of real-time operations usually being a Path B procedure, at least for some time to come.
Applications to other languages.
While all the developments described here focus on English, we anticipate that these strategies would also be useful for content-analysis capabilities in other languages. Current projects at Gallup-Europe provide some insight about the problems of providing an equivalent to the General Inquirer in languages such as Hungarian, where issues such as stemming or disambiguation take very different forms.
An additional problem we have addressed elsewhere is that of calibrating sentiment intensities across the languages of the European Union, a challenge that we propose may be best addressed in part inductively. This, however, is an additional agenda entirely beyond that of our current project support and will require significant resources to address it effectively.
Philip Stone March, 2004
Comment On This Page
Comment On This Page
I am currently senior lecturer at the Department of Mass Media and Communication Stuides, University of Madras. My current research includes how news media covers issues pertaining to development and enviornment. As such, I intend to carry out content analysis of Indian newspapers in English. I have read about your software and have also learnt a little about it from your site I think the software would meet a lot of my requirements as well as the Departments' research program in journalism and news media. I would like to know how we could get a copy of General Inquirer for academic use.Are their any schemes or special arrangements for educational institutions? Can we download the program? Looking forward to your reply.