Origins of the General Inquirer marker categories.


The following text from the Kelly/Stone book, beginning on page 16, provides general background about the marker origins and their use. Further detailed information is supplied in the book. Note that this marker system was developed to operate under the computer-memory constraints at the time, limiting searches to a narrow text window (the sentence) for word meaning. Some of our semantic markers might be better assigned depending upon the characteristics of a wider window (such as the whole document), which is now quite feasible. Indeed, such a wider window disambiguation strategy is used by CL Research.

Disambiguation rules were created from a KWIC database ("Key word in context") showing instances of each word surrounded by four or five words on each side as they occurred in a text corpus. This corpus was drawn from the texts of various early General Inquirer projects. The accuracy of the rules (both type one and type two errors) was assessed by testing the rules on a second corpus drawn from the same source and is reported by a "Kappa" score for each word in the large appendix to the Kelly/Stone book.

The basic structure was set up at the beginning of the project, essentially by shrewd guesswork. Although there have been numerous minor revisions, the system as a whole has fortunately proved rather serviceable, and just manageably complex for our purpose. We found it to be intractable in the sense that the only changes we could conveniently make were subdivisions of existing categories and whole-cloth additions. Rearrangements are hideously complex, from a purely clerical standpoint, the more so as the collection of entries grows. Our experience makes it abundantly clear that sustained growth of this system or any system like it (e.g., that contemplated by Katz and Fodor) will require careful attention to mechanization of many of these clerical functions. At the very least there should be a marker directory showingwhich entries have/use which markers. This is one of several points at which our dictionary threatens to defy control through sheer bulk -- yet the current system can only be regarded as impoverished relative to anything resembling realistic dimensions for a general and powerful languageprocessing system...

As mentioned earlier, the bulk of our sense distinctions follow part-of-speech breaks. Such distinctions naturally correspond to divergent syntactic environments. The set of syntactic markers, crude though it is, gives us enough power to mark these differences quite sharply. They implicitly supply rudimentary constituent analysis, telling us simply where clause boundaries lie in relation to the keyword. For example, a PRON is a one word noun phrase; a DET is likely to be the leftmost element of a noun phrase; and an immediately preceding MOD, TO or NEG probably implies verb. Such clues are of course not perfectly reliable, but they are usually reliable enough for our purposes, and where not, can always be supplemented by further (conditional) analysis. Within-part-ofspeech distinctions of course fall within syntactically comparable environments, and therefore depend heavily on the use of semantic markers for disambiguation. This part of the system is much less satisfactory, reflecting the general chaos in semantic theory. Many plausible candidate areas are not represented at all in the system, and some of the marker categories we do include are broad and ill-defined (particularly ABS). Nevertheless, these categories have proved useful, when liberally supplemented with tests for specific words. If we were to redesign the system, however, this is the part that would merit the most effort.

This then is the basic weaponry of our system. There are no hard-and-fast principles governing its deployment in the construction of rule sets; the process is severely underdetermined by the data. In a general way we tried to give efficiency its due -- for example, high frequency rules tend to be ordered first, other things being equal, and idioms are normally identified from the leftmost content word -- but the logic of the rules is our principal concern. There is an interesting parallel with the problem of curve-fitting. Just as any set of data points can be fit arbitrarily closely by constructing a polynomial of appropriate degree, so any set of KWIC tokens can be handled perfectly by allowing sufficiently cumbersome rule sets, for example consisting vacuously of one rule for each distinct environment of the entry-word. Such a solution, however, would possess little generality; the craft in writing rules is learning to pitch them at a level which will optimize transfer to new text. Among our disambiguators there was considerable and stable individual variation in this "feel" for regularities in word usage. There are other stylistic differences as well -- for example, some regularly produced rule sets with great logical depth, whereas others used branches relatively rarely; a few people delighted in tests for the absence of items, whereas most of us bowed to the cognitive psychologists in shunning negative information; and so on. In general there seemed to be almost as many ways of attacking an entry as there were disambiguators; and there is no dependable way of gauging the quality of a construction in advance of a test base on new data.

In practice, we worked as follows: A disambiguator (there have been some 25 in all, of whom about 8 did the bulk of the useful work) would pick an entry from the master list and check it off as "in progress." Then, working directly on the corresponding section of the KWIC, in consultation with a dictionary as described above, he would write in next to each token of the entry the appropriate sense number. This allowed the accumulation of sense totals and set the stage for rule writing. As rules were successively devised, the corresponding totals were accumulated and the cases thereby handled stricken from the listings with colored markers, using different colors for different rules to facilitate possible recount. The output of this process was a recording sheet in standard format containing all the basic information about the entry. These sheets were then reviewed for glaring problems with senses and/or rules, plus clerical errors. Otherwise we necessarily relied on the competence of the disambiguator.

