New Inquirer Feature — Multiword Dictionary Entries

Students often want to add idioms (like "freak out") and multiword names (like "Burger King") to the dictionary without getting involved in writing disambiguation rules. We find this is especially important for research on political or organizational documents where multiword noun strings (for example, "Sierra Club", "Millennium Dome") may be grouped into categories, as well as for adding the idioms of a particular topic area (for example, "hard drive") or idiomatic language of a subculture ("Give me five").

In order to give more attention to the logic of content-analysis research and the employment of the General Inquirer in studying text samples, our workshops usually have not covered how to write disambiguation rules. These rules must be written exactly because just a missing comma can create havoc. When we once tried to teach writing disambiguation rules in a semester seminar, we found that most students do not relate well to having to produce perfectly correct complex syntactic forms, although a few do thrive on it.

To provide an easier way to add simple multiword strings, including names, idioms, and verb phrasals, we are modifying the Inquirer to support for multiple word strings in the dictionary entry column of the Excel spreadsheet and to assign appropriate tags specified in the row for that entry. A initial version of the new software is now in operation.

These additions to the Excel "entry" column should always be by the first word of a multiword string. The placement should conform to the Excel sort order. This means that if a word is not disambiguated, then the multiword string would come after the word. For example, the word "freak" is not currently disambiguated. The dictionary entry column should thus read:




If a word is not in the dictionary and only appears in the multiword form, then the occurrence of the word apart from that multiword form will appear in the leftover list. For example, the word "burger" is not currently in the dictionary. Therefore




would match on "burger king" but not recognize the word "burger" by itself.


If a multiword entry is being added to a word that is disambiguated, then it should come before the first sense number in order to conform to the sort ordering. For example, there are four senses of "easy", but none of them represent "easy going". Therefore, the addition should be in this order.






There is no limit to the number of words in a string. An longer entry like "GO WITH THE FLOW" works just fine.

Note that the first word will be suffix-chopped, if necessary, so that the entry for "freak out", for example, will also match "freaked out" and "freaking out". The system also handles most strong verbs, so that the entry "go with the flow" not only matches "goes with the flow", but also "went with the flow". The entry should be by the present tense verb form.

At present, the matching of the words in the entry does not allow for extra intervening words. There is no "wildcard" option for intervening words as offered in some systems like ISYS. Thus, "change direction" will matching "changing direction", but not "changing my direction". A wildcard option could be added if there proves to be a strong demand for it.

At present, the software is implemented to perform an exact match on words after the first word. However, these words in the entry are just matched against being the initial part of the words in the text. The entry word may therefore have a suffix which then will also match root forms occurring in the text. Thus, the entry "change directions" will match "changing directions" and "changing direction". But the entry "change direction" will not match "changing directions".

Note too that after a multiword string is matched, the remaining words in the string are not tagged, but instead the string is treated as an idiom. No tags will be assigned to "king", "going", "out" or "with the flow" in the above examples. Content analysis resumes with the next text word after the occurrence of the string. Often, a user may want to include a string in order to have the words in the string not counted. For example, an annual report on "Burger King" should not score high on royalty.

For further capabilities in developing General Inquirer dictionaries, it is necessary to understand the disambiguation rule writing procedures in detail. Their logic is detailed in the Kelly/Stone book. Now that the Inquirer is no longer constrained by memory storage, we intend to develop a less compact notational scheme that is less prone to error, as well as to offer several new options based on windows wider than the sentence.

Until we gain more experience with testing these new multiword options, including the possible necessity for further programming changes, these capabilities are not included in the Java software that is distributed. We appreciate receiving Excel spreadsheets with multiword entries as illustrated here so we can use them to test this new software, including applications to the user's data.

Return to Home Page