New Inquirer feature:
multiword dictionary
entries
Students often want to add idioms (like
"freak out") and multiword names (like "Burger King") to the
dictionary without getting involved in writing disambiguation rules.
We find this is especially important for research on political or
organizational documents where multiword noun strings (for example,
"Sierra Club", "Millennium Dome") may be grouped into categories, as
well as for adding the idioms of a particular topic area (for
example, "hard drive") or idiomatic language of a subculture ("Give
me five").
In order to give more attention to the
logic of content-analysis research and the employment of the General
Inquirer in studying text samples, our workshops usually have not
covered how to write disambiguation rules. These rules must be
written exactly because just a missing comma can create havoc. When
we once tried to teach writing disambiguation rules in a semester
seminar, we found that most students do not relate well to having to
produce perfectly correct complex syntactic forms, although a few do
thrive on it.
To provide an easier way to add simple
multiword strings, including names, idioms, and verb phrasals, we are
modifying the Inquirer to support for multiple word strings in the
dictionary entry column of the Excel spreadsheet and to assign
appropriate tags specified in the row for that entry. A initial
version of the new software is now in operation.
These additions to the Excel "entry"
column should always be by the first word of a multiword string. The
placement should conform to the Excel sort order. This means that if
a word is not disambiguated, then the multiword string would come
after the word. For example, the word "freak" is not currently
disambiguated. The dictionary entry column should thus
read:
FREAK
FREAK OUT
FREE#1
If a word is not in the dictionary and only
appears in the multiword form, then the occurrence of the word apart
from that multiword form will appear in the leftover list. For
example, the word "burger" is not currently in the dictionary.
Therefore
BUREAUCRATIC
BURGER KING
BURGLAR
would match on "burger king" but not
recognize the word "burger" by itself.
If a multiword entry is being added to a
word that is disambiguated, then it should come before the first
sense number in order to conform to the sort ordering. For example,
there are four senses of "easy", but none of them represent "easy
going". Therefore, the addition should be in this order.
EASTERNER
EASY GOING
EASY#1
EASY#2
There is no limit to the number of words in
a string. An longer entry like "GO WITH THE FLOW" works just fine.
Note that the first word will be
suffix-chopped, if necessary, so that the entry for "freak out", for
example, will also match "freaked out" and "freaking out". The system
also handles most strong verbs, so that the entry "go with the flow"
not only matches "goes with the flow", but also "went with the flow".
The entry should be by the present tense verb form.
At present, the matching of the words in
the entry does not allow for extra intervening words. There is no
"wildcard" option for intervening words as offered in some systems
like ISYS. Thus, "change direction" will matching "changing
direction", but not "changing my direction". A wildcard option could
be added if there proves to be a strong demand for it.
At present, the software is implemented to
perform an exact match on words after the first word. However, these
words in the entry are just matched against being the initial part of
the words in the text. The entry word may therefore have a suffix
which then will also match root forms occurring in the text. Thus,
the entry "change directions" will match "changing directions" and
"changing direction". But the entry "change direction" will not match
"changing directions".
Note too that after a multiword string is
matched, the remaining words in the string are not tagged, but
instead the string is treated as an idiom. No tags will be assigned
to "king", "going", "out" or "with the flow" in the above examples.
Content analysis resumes with the next text word after the occurrence
of the string. Often, a user may want to include a string in order to
have the words in the string not counted. For example, an
annual report on "Burger King" should not score high on royalty.
For further capabilities in developing
General Inquirer dictionaries, it is necessary to understand the
disambiguation rule writing procedures in detail. Their logic is
detailed in the Kelly/Stone book. Now that the Inquirer is no longer
constrained by memory storage, we intend to develop a less compact
notational scheme that is less prone to error, as well as to offer
several new options based on windows wider than the sentence.
Until we gain more experience with testing
these new multiword options, including the possible necessity for
further programming changes, these capabilities are not included in
the Java software that is distributed. We appreciate receiving Excel
spreadsheets with multiword entries as illustrated here so we can use
them to test this new software, including applications to the user's
data.