US4862408A - Paradigm-based morphological text analysis for natural languages - Google Patents

Paradigm-based morphological text analysis for natural languages Download PDF

Info

Publication number
US4862408A
US4862408A US07/028,437 US2843787A US4862408A US 4862408 A US4862408 A US 4862408A US 2843787 A US2843787 A US 2843787A US 4862408 A US4862408 A US 4862408A
Authority
US
United States
Prior art keywords
paradigm
computer system
word
lemma
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US07/028,437
Inventor
Antonio Zamora
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US07/028,437 priority Critical patent/US4862408A/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, A CORP OF NY. reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION, A CORP OF NY. ASSIGNMENT OF ASSIGNORS INTEREST. Assignors: ZAMORA, ANTONIO
Priority to JP63008602A priority patent/JPH0724056B2/en
Priority to DE3853894T priority patent/DE3853894T2/en
Priority to EP88101694A priority patent/EP0282721B1/en
Application granted granted Critical
Publication of US4862408A publication Critical patent/US4862408A/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99943Generating database or data structure, e.g. via user interface

Definitions

  • the invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.
  • Text processing and word processing systems have been developed for both stand-alone applications and distributed processing applications.
  • text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text.
  • a particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled “Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., now U.S. Pat. No. 4,731,735, assigned to IBM Corporation.
  • the figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.
  • the morphological text analysis invention disclosed herein finds application in the parser for natural language text which is described in the copending U.S. patent application Ser. No. 924,670 filed Oct. 29, 1986, by Antonio Zamora, et al., assigned to IBM Corporation.
  • the figures and specification of the Zamora, et al. patent application are incorporated herein by reference as an example of a larger scale language processing application which the subject invention herein can be applied.
  • DESINENCE--An ending or affix added to a word and which specifies tense, mood, number, gender, or other linguistic attribute.
  • PARADIGM--A model that provides all the inflectional forms of a word (conjugations or declension).
  • Morphology is an important aspect of the study of languages which has many practical applications in data processing. Of particular relevance to this invention is the study of the alteration of words by the use of regular changes.
  • romance languages spanish, French, Italian, etc.
  • Indo-European languages has placed emphasis on the "stem” and "desinence” of verbs.
  • the "stem” is the portion of the verb that remains invariant during conjugation whereas the "desinence” is the portion that changes according to some paradigm or pattern.
  • the number of desinences of a verb are relatively few ("-ed,” "-s,” and “-ing" for the past tense, the third person, and the present participle, respectively), but in European languages there may be 50 or more verb forms.
  • classification schemes have been devised for natural languages, most are incomplete and not systematic enough for use in computer-based systems for a variety of languages.
  • a prior art technique associates ending table references with the entries of an existing dictionary and is able to identify the grammatical form of an input text word and its invariant root form.
  • the technique does not provide a mechanism for using the ending tables as part of the dictionary structure, and this results in unnecessary redundancy of data which does not yield a very compact representation.
  • Another consequence of the organization of the data is that an additional access is required to find the ending table associated with each dictionary word.
  • a computer method for analyzing text by employing a model known as a paradigm, that provides all the inflectional forms of a word.
  • a file structure is created consisting of two components, a list of words (a dictionary), each word of which is associated with a set of paradigm references, and the file of paradigms consisting of grammatical categories paired with their corresponding ending or affix portions (known as the desinence) specifying tense, mood, number, gender or other linguistic attribute.
  • a computer method for generating the file structure of the dictionary by generating all forms of the words from a list of standard forms of the words (known as the lemma) which is generally the infinitive of a verb or the singular form of a noun, the lemmas being generated with their corresponding paradigms.
  • the method sorts and organizes the resulting word list into a dictionary.
  • An input data stream of natural language words can then be processed by generating a lemma for each input word. This is done by matching the input word against the dictionary and using the resulting paradigm references to access a set of paradigms.
  • the ending or affix (desinence) of the paradigm is matched against the input word and the corresponding grammatical category for each matched desinence is recorded and the standard form of the word (the lemma) is generated by replacing the matching desinence of the input word with the desinence of the lemma.
  • the specific grammatical form of an input word can be generated from the standard form of the word (the lemma) and the grammatical category, by matching the lemma against the dictionary and using its paradigm references to access a set of paradigms. Then the desinences of the paradigms are matched against the lemma and the desinence corresponding to the specified grammatical category is selected.
  • the specific grammatical form is generated by replacing the desinence of the lemma with the desinence of the desired grammatical form.
  • the computer method disclosed can be applied to compact dictionary representation, grammatical analysis, automatic indexing, synonym retrieval, and other computational linguistic applications. It is shown that incorporation of the paradigms as an integral representation of the dictionary results in efficient access to the grammatical information associated with the desinences, since traversal of the paradigms is equivalent to scanning the dictionary.
  • FIG. 1 is a functional block diagram of the basic paradigm process.
  • FIG. 2 is a representation of an English regular verb paradigm.
  • FIG. 3 is a representation of an English regular noun paradigm.
  • FIG. 4 is a representation of an English irregular verb paradigm.
  • FIG. 5 is a representation of a Spanish regular verb paradigm.
  • FIG. 6 is a representation of a Spanish regular noun paradigm.
  • FIG. 7 is a functional block diagram illustrating the association of paradigm numbers with word forms.
  • FIG. 8 is a representation of a regular verb paradigm.
  • FIG. 9 is a representation of a paradigm for regular verbs ending in "ed.”
  • FIG. 10 illustrates the levels of synonym support.
  • a paradigm is a model that shows all the forms that are possible for a word. This may include all forms that can be generated by suffixation as well as those that can be generated by prefixation.
  • a distinction is made between a "basic paradigm process” that may be used to generate a standard word form (or lemma) from any form of the word, and a "generative paradigm process” that can be used to generate all word forms from the lemma and the paradigm.
  • the basic paradigm process is illustrated in FIG. 1. An input word is processed against a reference paradigm to produce a standard word form (or lemma) and a list of grammatical categories.
  • FIG. 2 is a paradigm used for many regular English verbs and FIG. 3 is a paradigm for regular English nouns.
  • the grammatical classes are given in the left column, and the morphological characteristics (desinences) are indicated in the right column.
  • the heading of the paradigm contains an identifying number and an example.
  • the basic paradigm process consists of matching the desinences of the paradigm against an input word (to which the particular paradigm is applicable).
  • the basic paradigm process is applied as follows: Given an input word such as "books” and a reference paradigm number "N27" (FIG. 3), we match the desinences of the paradigm against the word. In this example, the final "s" matches, indicating that the input word is a plural noun; this information is placed in the list of grammatical classes.
  • the lemma is generated by replacing the "s” with a blank (represented in the paradigm table as an underscore). The resulting lemma is "book” which is the singular form of the noun.
  • V45 (FIG. 2)
  • the basic paradigm process indicates that the word is the present tense of a verb applicable to the third person. This is illustrated in the sentence "He books a flight to Boston.”
  • the lemma is generated as for the previous example, but it corresponds to the infinitive form of the verb.
  • FIG. 4 illustrates a paradigm for an English irregular verb form.
  • FIGS. 5 and 6 are paradigms for examples of Spanish regular verbs and Spanish regular nouns. Although the morphology for Spanish is more complicated, the same basic procedure described above is used.
  • the replacement mechanism used to generate the lemma is very general. It is applicable even for cases where complete word replacement is required because no morphological features are shared (as in the forms of the verb "be:" be, am, is, etc.).
  • FIG. 7 illustrates how a file of lemmas and their corresponding paradigm numbers (references) can be processed against a file containing the paradigms to produce a file of words and paradigm numbers. After sorting and removing duplicates, this file can be used as a reference dictionary against which vocabulary from any text can be matched to retrieve the paradigm numbers. It is then possible to obtain the lemma for each text word by application of the basic paradigm procedure.
  • the resulting dictionary with paradigm numbers would have the following entries, and the asterisk (*) indicates the lemma.
  • This dictionary by the way, can be used for other word processing functions such as spelling error detection and correction, even though it contains the additional information.
  • FIG. 8 illustrates a paradigm applicable to many regular English verbs such as "talk,” “paint,” and “remind” which end in consonants.
  • the definition of the endings underscore indicates blank
  • the corresponding grammatical categories are listed (where the numbers 1 through 3 apply to the singular person of the present tense and 4 through 6 to the corresponding plural person).
  • This problem can be corrected by: (1) recognizing the situation, and (2) defining a new paradigm with longer substrings in the paradigm. These are created by adding letters to the desinences from the end of the lemma for which the problem occurs.
  • a paradigm that resolves this problem is illustrated in FIG. 9. This paradigm applies to verbs whose lemma ends in "ed” such as "speed,” “seed,” “need,” etc.
  • a list of lemmas and their corresponding paradigms is a compact way of representing a dictionary. This is particularly advantageous for languages that have verbs with a large number of declensions.
  • This invention represents a dictionary as a set of lemmas and their corresponding paradigms.
  • the information obtained as part of the basic paradigm process contains grammatical information that not only identifies the part of speech, but also gives the grammatical roles of the words. This makes it possible to use the paradigm procedure for grammar checking tasks such as subject/verb agreement, article/noun agreement, and other gender, number, and verb-form agreement tasks.
  • the lemma obtained from the basic paradigm process can be used as an index point for natural language text which is independent of the word forms encountered in the text. Similarly, retrieval becomes easier when only the lemmas need to be searched without needing to cope with the uncertainties in vocabulary in unrestricted natural language text.
  • a data base query may be expanded by generating the plural forms of the query terms (through the paradigm process); this will improve recall when searching data bases.
  • FIG. 10 illustrates three levels of synonym support which are possible using the basic and generative paradigm processes.
  • Level-1 which is the most fundamental level is a traditional look-up based on the lemma. This is how a manual or unassisted synonym support process works. A person has to provide the lemma.
  • Level-2 uses the basic paradigm process to convert text words automatically to the lemma. This makes it possible to reference a synonym dictionary automatically in a text-processing task by simply positioning a cursor on a word and pressing a function key. The synonyms retrieved are those that would be found in a synonym dictionary in lemma form.
  • Level-3 refines the output of the synonym dictionary by generating the forms of the synonym that correspond to the input word. At this stage the user of a text processing system only needs to select a word form to replace it in the text.
  • FIG. 7 presented a method for associating paradigm numbers with the entries of a dictionary in alphabetical sequence.
  • strict alphabetical sequence is not necessary.
  • the program that compares the input word against the dictionary scans only the section of the dictionary which has the same first three letters.
  • the lemmas are associated with their corresponding paradigms or "0" if they have none.
  • the entry for "went” is placed in alphabetical sequence with its associated paradigm number, but the "@” indicates that it is a cross-reference rather than a lemma.
  • the paradigms are given here in their symbolic representation. In the actual dictionary, they would be encoded as binary numbers identifying the paradigm tables. There may be more than one paradigm per entry and in some cases it will be necessary to continue scanning after a match has been found to obtain all relevant matches. For example, the word "types" will be found as the third person form of the verb and as the plural of the noun; two different paradigms have to be scanned.
  • the dictionary can be further compacted by front encoding. This is a technique which specifies a count indicating how many leading characters are the same as the previous entry. Thus, “goad” would be coded as “2ad” since the first two characters are the same as the word “go” which precedes it; “goat” would be coded as "3t” since three leading characters are the same as the preceding word. If there are no characters in common with the preceding word, the count is omitted.
  • the front-encoded dictionary might look like:
  • a paradigm has been described as consisting of a set of desinences (endings) associated with grammatical categories.
  • the paradigm reference number itself provides information about part of speech since it identifies a verb model, noun model, etc. Additional information may be added to the paradigms to speed up access to a compact dictionary based on paradigms.
  • Length screens and content screens are particularly useful.
  • a length screen indicates the minimum and maximum length of the words generated by a paradigm and can be used to avoid scanning the paradigm for possible matches when there can be none. As an example, let us say that we are looking for the word "gobbles" in the dictionary.
  • a content screen gives an indication of the characters contained in the desinences. If the word to be matched contains some characters which are not in the content screen, the match will fail and there is no need to go further. Both length and content screens can be combined for efficient use of the paradigms as a compact dictionary representation.

Abstract

A computer method is disclosed for analyzing text by employing a model known as a paradigm, that provides all the inflectional forms of a word. A file structure is created consisting of two components, a list of words (a dictionary), each word of which is associated with a set of paradigm references, and the file of paradigms consisting of grammatical categories paired with their corresponding ending or affix portions (known as the desinence) specifying tense, mood, number, gender or other linguistic attribute. A computer method is disclosed for generating the file structure of the dictionary by generating all forms of the words from a list of standard forms of the words (known as the lemma) which is generally the infinitive of a verb or the singular form of a noun, the lemmas being generated with their corresponding paradigms.

Description

BACKGROUND OF THE INVENTION
1. Technical Field
The invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.
2. Related Patent Applications
The following related patents and applications are incorporated herein by reference:
U.S. patent application by B. Knystautas, et al. entitled "Linguistic Analysis Method and Apparatus," Ser. No. 853,490, filed Apr. 18, 1986, abandoned, assigned to IBM Corporation.
U.S. Pat. No. 4,328,561 by D. B. Convis, et al. entitled "Alpha Content Match Prescan Method for Automatic Spelling Error Correction," assigned to the IBM Corporation.
U.S. Pat. No. 4,355,371 by D. B. Convis, et al. entitled "Instantaneous Alpha Content Prescan Method for Automatic Spelling Error Correction," assigned to the IBM Corporation.
3. Background Art
Text processing and word processing systems have been developed for both stand-alone applications and distributed processing applications. The terms text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text. A particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., now U.S. Pat. No. 4,731,735, assigned to IBM Corporation. The figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.
The morphological text analysis invention disclosed herein finds application in the parser for natural language text which is described in the copending U.S. patent application Ser. No. 924,670 filed Oct. 29, 1986, by Antonio Zamora, et al., assigned to IBM Corporation. The figures and specification of the Zamora, et al. patent application are incorporated herein by reference as an example of a larger scale language processing application which the subject invention herein can be applied.
Glossary:
This description uses specialized linguistic terminology; the most common terms are defined here:
DESINENCE--An ending or affix added to a word and which specifies tense, mood, number, gender, or other linguistic attribute.
LEMMA--The standard form of a word which is used in a dictionary (generally the infinitive of a verb or the singular form of a noun).
MORPHOLOGY--The study of word formation in a language including inflections, derivations, and formation of compounds.
PARADIGM--A model that provides all the inflectional forms of a word (conjugations or declension).
STEM--A portion of a word that does not change and to which affixes are added to form words. The stem itself is not necessarily a word.
Morphology is an important aspect of the study of languages which has many practical applications in data processing. Of particular relevance to this invention is the study of the alteration of words by the use of regular changes. For many centuries the study of romance languages (Spanish, French, Italian, etc.) and other Indo-European languages has placed emphasis on the "stem" and "desinence" of verbs. The "stem" is the portion of the verb that remains invariant during conjugation whereas the "desinence" is the portion that changes according to some paradigm or pattern. In English, the number of desinences of a verb are relatively few ("-ed," "-s," and "-ing" for the past tense, the third person, and the present participle, respectively), but in European languages there may be 50 or more verb forms. Although classification schemes have been devised for natural languages, most are incomplete and not systematic enough for use in computer-based systems for a variety of languages.
A prior art technique associates ending table references with the entries of an existing dictionary and is able to identify the grammatical form of an input text word and its invariant root form. However, the technique does not provide a mechanism for using the ending tables as part of the dictionary structure, and this results in unnecessary redundancy of data which does not yield a very compact representation. Another consequence of the organization of the data is that an additional access is required to find the ending table associated with each dictionary word.
A second prior art technique, U.S. Pat. No. 4,342,085 entitled "Stem Processing for Data Reduction in a Dictionary Storage File," assigned to IBM Corporation, provides a compact dictionary representation by encoding common prefixes and suffixes, but the encoding mechanism is only a compacting scheme and does not provide any linguistic information.
OBJECTS OF THE INVENTION
It is therefore an object of the invention to provide a mechanism for classifying natural language text and for generating word forms that can be used in a variety of text processing applications and which can be applied to many natural languages.
It is another object of the invention to provide an improved morphological analysis system which is more compact than the prior art approaches and which can generate the "lemma" of a word (that is, its base form) as well as all the conjugations or linguistic forms that can be derived from the lemma.
It is a further object of the invention to provide an improved computer method to associate grammatical information to classify word forms which are input or generated.
It is another object of the invention to encode the ending tables in such a way that they are an integral part of the representation of the dictionary and provide significant compaction and speed over the prior art.
SUMMARY OF THE INVENTION
A computer method is disclosed for analyzing text by employing a model known as a paradigm, that provides all the inflectional forms of a word. A file structure is created consisting of two components, a list of words (a dictionary), each word of which is associated with a set of paradigm references, and the file of paradigms consisting of grammatical categories paired with their corresponding ending or affix portions (known as the desinence) specifying tense, mood, number, gender or other linguistic attribute. A computer method is disclosed for generating the file structure of the dictionary by generating all forms of the words from a list of standard forms of the words (known as the lemma) which is generally the infinitive of a verb or the singular form of a noun, the lemmas being generated with their corresponding paradigms. The method sorts and organizes the resulting word list into a dictionary. An input data stream of natural language words can then be processed by generating a lemma for each input word. This is done by matching the input word against the dictionary and using the resulting paradigm references to access a set of paradigms. Then the ending or affix (desinence) of the paradigm is matched against the input word and the corresponding grammatical category for each matched desinence is recorded and the standard form of the word (the lemma) is generated by replacing the matching desinence of the input word with the desinence of the lemma. The specific grammatical form of an input word can be generated from the standard form of the word (the lemma) and the grammatical category, by matching the lemma against the dictionary and using its paradigm references to access a set of paradigms. Then the desinences of the paradigms are matched against the lemma and the desinence corresponding to the specified grammatical category is selected. The specific grammatical form is generated by replacing the desinence of the lemma with the desinence of the desired grammatical form. The computer method disclosed can be applied to compact dictionary representation, grammatical analysis, automatic indexing, synonym retrieval, and other computational linguistic applications. It is shown that incorporation of the paradigms as an integral representation of the dictionary results in efficient access to the grammatical information associated with the desinences, since traversal of the paradigms is equivalent to scanning the dictionary.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects, features and advantages of the invention will be more fully appreciated with reference to the accompanying figures.
FIG. 1 is a functional block diagram of the basic paradigm process.
FIG. 2 is a representation of an English regular verb paradigm.
FIG. 3 is a representation of an English regular noun paradigm.
FIG. 4 is a representation of an English irregular verb paradigm.
FIG. 5 is a representation of a Spanish regular verb paradigm.
FIG. 6 is a representation of a Spanish regular noun paradigm.
FIG. 7 is a functional block diagram illustrating the association of paradigm numbers with word forms.
FIG. 8 is a representation of a regular verb paradigm.
FIG. 9 is a representation of a paradigm for regular verbs ending in "ed."
FIG. 10 illustrates the levels of synonym support.
DESCRIPTION OF THE BEST MODE FOR CARRYING OUT THE INVENTION
A paradigm is a model that shows all the forms that are possible for a word. This may include all forms that can be generated by suffixation as well as those that can be generated by prefixation. In this document a distinction is made between a "basic paradigm process" that may be used to generate a standard word form (or lemma) from any form of the word, and a "generative paradigm process" that can be used to generate all word forms from the lemma and the paradigm. The basic paradigm process is illustrated in FIG. 1. An input word is processed against a reference paradigm to produce a standard word form (or lemma) and a list of grammatical categories.
FIG. 2 is a paradigm used for many regular English verbs and FIG. 3 is a paradigm for regular English nouns. The grammatical classes are given in the left column, and the morphological characteristics (desinences) are indicated in the right column. In addition, the heading of the paradigm contains an identifying number and an example. The basic paradigm process consists of matching the desinences of the paradigm against an input word (to which the particular paradigm is applicable).
The basic paradigm process is applied as follows: Given an input word such as "books" and a reference paradigm number "N27" (FIG. 3), we match the desinences of the paradigm against the word. In this example, the final "s" matches, indicating that the input word is a plural noun; this information is placed in the list of grammatical classes. The lemma is generated by replacing the "s" with a blank (represented in the paradigm table as an underscore). The resulting lemma is "book" which is the singular form of the noun.
If we apply paradigm number "V45" (FIG. 2) to the same input word, the basic paradigm process indicates that the word is the present tense of a verb applicable to the third person. This is illustrated in the sentence "He books a flight to Boston." The lemma is generated as for the previous example, but it corresponds to the infinitive form of the verb.
FIG. 4 illustrates a paradigm for an English irregular verb form. FIGS. 5 and 6 are paradigms for examples of Spanish regular verbs and Spanish regular nouns. Although the morphology for Spanish is more complicated, the same basic procedure described above is used.
The replacement mechanism used to generate the lemma is very general. It is applicable even for cases where complete word replacement is required because no morphological features are shared (as in the forms of the verb "be:" be, am, is, etc.).
ASSOCIATION OF PARADIGM REFERENCES WITH WORDS
The preceding section illustrated the basic paradigm process to produce the lemma. This section presents the methodology of the generative paradigm process which is used to produce all word forms from a lemma and a paradigm; this is the basis for building a dictionary of word forms and their corresponding paradigm reference.
FIG. 7 illustrates how a file of lemmas and their corresponding paradigm numbers (references) can be processed against a file containing the paradigms to produce a file of words and paradigm numbers. After sorting and removing duplicates, this file can be used as a reference dictionary against which vocabulary from any text can be matched to retrieve the paradigm numbers. It is then possible to obtain the lemma for each text word by application of the basic paradigm procedure.
The details of the generative procedures are the same as for the basic procedure, but rather than having to scan all the morphological entries in the paradigm for a match, only the morphology of the lemma is examined to identify the linguistic stem to which the rest of the morphological features apply to generate the word forms. As an example, the lemma "grind" and the paradigm number "V4c" (illustrates in FIG. 4) results in the linguistic stem "gr" from which the unique entries "grind," "grinding," "ground," and "grinds" can be generated. Processing of the lemma "ground" and the paradigm "N27" (FIG. 3) results in the entries "ground" and "grounds."
The resulting dictionary with paradigm numbers would have the following entries, and the asterisk (*) indicates the lemma. This dictionary, by the way, can be used for other word processing functions such as spelling error detection and correction, even though it contains the additional information.
grind V4c*
grinding V4c
grinds V4c
ground V4c, N27*
grounds N27
DEFINITION OF PARADIGMS TO AVOID DECODING AMBIGUITIES
It does not follow that because all the word forms are generated properly from a lemma and a paradigm, the lemma can be generated from any word form. It is important to define judiciously the paradigms that apply to specific lemmas to obtain this symmetry of function.
FIG. 8 illustrates a paradigm applicable to many regular English verbs such as "talk," "paint," and "remind" which end in consonants. On the left side of the paradigm is the definition of the endings (underscore indicates blank), and on the right side the corresponding grammatical categories are listed (where the numbers 1 through 3 apply to the singular person of the present tense and 4 through 6 to the corresponding plural person).
Application of this paradigm to lemmas that end in the same substrings that define the lemma can cause problems during the decoding stage. Let us say that we apply this paradigm to a word like "speed." There are no problems during generation of the words from the lemma, but during decoding the words "speeded" and "speed" will match against "ed" giving the lemma "speed" in the first case and an incorrect stem "spe" in the second.
This problem can be corrected by: (1) recognizing the situation, and (2) defining a new paradigm with longer substrings in the paradigm. These are created by adding letters to the desinences from the end of the lemma for which the problem occurs. A paradigm that resolves this problem is illustrated in FIG. 9. This paradigm applies to verbs whose lemma ends in "ed" such as "speed," "seed," "need," etc.
APPLICATIONS OF PARADIGM-BASED TEXT ANALYSIS
Compact Dictionary Representation.
A list of lemmas and their corresponding paradigms is a compact way of representing a dictionary. This is particularly advantageous for languages that have verbs with a large number of declensions. This invention represents a dictionary as a set of lemmas and their corresponding paradigms.
Grammatical Analysis.
The information obtained as part of the basic paradigm process contains grammatical information that not only identifies the part of speech, but also gives the grammatical roles of the words. This makes it possible to use the paradigm procedure for grammar checking tasks such as subject/verb agreement, article/noun agreement, and other gender, number, and verb-form agreement tasks.
Automatic Indexing.
The lemma obtained from the basic paradigm process can be used as an index point for natural language text which is independent of the word forms encountered in the text. Similarly, retrieval becomes easier when only the lemmas need to be searched without needing to cope with the uncertainties in vocabulary in unrestricted natural language text.
As an additional option, a data base query may be expanded by generating the plural forms of the query terms (through the paradigm process); this will improve recall when searching data bases.
Synonym Retrieval.
FIG. 10 illustrates three levels of synonym support which are possible using the basic and generative paradigm processes. Level-1, which is the most fundamental level is a traditional look-up based on the lemma. This is how a manual or unassisted synonym support process works. A person has to provide the lemma.
Level-2 uses the basic paradigm process to convert text words automatically to the lemma. This makes it possible to reference a synonym dictionary automatically in a text-processing task by simply positioning a cursor on a word and pressing a function key. The synonyms retrieved are those that would be found in a synonym dictionary in lemma form.
Level-3 refines the output of the synonym dictionary by generating the forms of the synonym that correspond to the input word. At this stage the user of a text processing system only needs to select a word form to replace it in the text.
THE PARADIGM TABLES AS A COMPACT DICTIONARY
FIG. 7 presented a method for associating paradigm numbers with the entries of a dictionary in alphabetical sequence. However, there are many applications for which strict alphabetical sequence is not necessary. In spelling verification, for example, it is sufficient if a dictionary is ordered in roughly alphabetic sequence, that is, so that the first three letters (or any arbitrary number of letters) of the words are in alphabetic sequence. The program that compares the input word against the dictionary scans only the section of the dictionary which has the same first three letters. By increasing the number of initial letters which are in alphabetic sequence the portion of the dictionary which has to be scanned can be reduced so that the time for scanning meets any desired requirements.
It is evident that all the words of a dictionary can be represented by a list of lemmas and their associated paradigms. By adding the words which have no paradigms (e.g., the prepositions "at," "for," and other function words) the word list can be made complete. If we order the lemmas in alphabetical order, the "roughly alphabetic" sequence criterion is not met in many cases. For example, the word "go" is associated with "goes," "going," "gone" and "went." Clearly, "went" is alphabetically very far from "go" and it is necessary to create a cross-reference entry to maintain an alphabetical sequence with the desired degree of order. Thus, a compact dictionary where the first two letters of all the words represented are in alphabetic sequence, might look like:
______________________________________                                    
Word     Paradigm  Words Represented                                      
______________________________________                                    
at       0         at                                                     
go       V4        go, went, gone goes, going                             
goad     V56       goad, goaded, goads, goading                           
goat     N27       goat, goats                                            
gobble   V71       gobble, gobbled, gobbles, gobbling                     
went     @V4       went (cross-reference entry for                        
                   paradigm V4)                                           
______________________________________                                    
In this example, the lemmas are associated with their corresponding paradigms or "0" if they have none. The entry for "went" is placed in alphabetical sequence with its associated paradigm number, but the "@" indicates that it is a cross-reference rather than a lemma. The paradigms are given here in their symbolic representation. In the actual dictionary, they would be encoded as binary numbers identifying the paradigm tables. There may be more than one paradigm per entry and in some cases it will be necessary to continue scanning after a match has been found to obtain all relevant matches. For example, the word "types" will be found as the third person form of the verb and as the plural of the noun; two different paradigms have to be scanned.
The dictionary can be further compacted by front encoding. This is a technique which specifies a count indicating how many leading characters are the same as the previous entry. Thus, "goad" would be coded as "2ad" since the first two characters are the same as the word "go" which precedes it; "goat" would be coded as "3t" since three leading characters are the same as the preceding word. If there are no characters in common with the preceding word, the count is omitted. The front-encoded dictionary might look like:
at 0
go V4
2ad V56
3t N27
2bble V71
went @V4
Features to Speed up Matching
Thus far, a paradigm has been described as consisting of a set of desinences (endings) associated with grammatical categories. The paradigm reference number itself provides information about part of speech since it identifies a verb model, noun model, etc. Additional information may be added to the paradigms to speed up access to a compact dictionary based on paradigms. Length screens and content screens are particularly useful. A length screen indicates the minimum and maximum length of the words generated by a paradigm and can be used to avoid scanning the paradigm for possible matches when there can be none. As an example, let us say that we are looking for the word "gobbles" in the dictionary. Since "gobbles" starts with the substring "go," it would be reasonable to check paradigm V4 associated with "go" to see if it generates the word "gobbles." A length screen would prevent this futile exploration. All that is needed is to indicate in each paradigm the minimum and maximum length of the words generated by the paradigm and the length of the longest desinence. Since the longest word generated by paradigm V4 contains five characters and since "gobbles" is longer, we can deduce that the word cannot possibly match and there is no need to examine the paradigm further.
Similarly, a content screen gives an indication of the characters contained in the desinences. If the word to be matched contains some characters which are not in the content screen, the match will fail and there is no need to go further. Both length and content screens can be combined for efficient use of the paradigms as a compact dictionary representation.
Although a specific embodiment of the invention has been disclosed, it will be understood by those having skill in the art that minor changes can be made to the disclosed embodiment without departing from the spirit and the scope of the invention.

Claims (5)

What is claimed is:
1. In a computer system, a method for generating a file structure for enabling paradigm-based morphological text analysis upon words in an input word stream of natural language text, the computer system including a memory, comprising the steps of:
compiling in said computer system from a starting list of lemmas associated with their corresponding paradigm references, a list of all the word forms generated by each lemma and its paradigm and associating each word form with its paradigm reference;
ordering in said computer system said compiled list of word forms and paradigm references in any desired collating sequence;
consolidating in said computer system duplicate word entries generated from multiple lemma and paradign combinations and compiling a file structure of unique word forms, each of which is associated with a list of paradigm references, and a list of paradigm tables which may be accessed by these references.
2. A computer method for matching in a computer system an input word against a dictionary structured as a list of lemmas and their associated paradigms, comprising the steps of:
selecting in said computer system a set of entries in said dictionary to scan based on the collating sequence of the input word;
testing in said computer system the input word against paradigm screens to avoid useless comparisons if a dictionary entry has an associated paradigm;
comparing in said computer system the input word against the words generated by application of the paradigm to the dictionary entry and retrieving associated grammatical information when there is a match.
3. A computer method for matching in a computer system an input word against a dictionary structured as a list of lemmas and their associated paradigms, comprising the steps of:
selecting in said computer system a set of lemmas to scan;
testing in said computer system the input word against its associated paradigm;
comparing in said computer system the input word against the words generated by application of the paradigm to the lemma and retrieving associated grammatical information when there is a match.
4. A computer method for generating in a computer system the lemma of a word from a file structure consisting of two components, the first component being a dictionary list of words, each word of which is associated with a set of paradigm references and the second component being a file of paradigm consisting of grammatical categories paired with their corresponding desinences, comprising:
matching in said computer system the input word against said dictionary;
accessing in said computer system from said file of paradigms a set of paradigms using the paradigm references from the matched word found in said dictionary;
matching in said computer system the desinences of the paradigm corresponding to the matched word, against the input word;
recording in said computer system the grammatical categories corresponding to each matching desinence and generating the lemma by replacing the matching desinence of the input word with the desinence of the lemma.
5. A method for generating in a computer system a specific grammatical form of a word from the lemma and the grammatical category, using a file structure consisting of two components, the first component being a dictionary list of words, each word of which is associated with a set of paradigm references and the second component being a file of paradigms consisting of grammatical categories paired with their corresponding desinences, comprising:
matching in said computer system the lemma against said dictionary;
accessing in said computer system a set of paradigms from said file of paradigms corresponding to the paradigm reference of the matched lemma;
matching in said computer system the desinences of the paradigm against the lemma;
selecting in said computer system the desinence corresponding to the specified grammatical category and generating the specific grammatical form by replacing the desinence of the lemma with the desinence of the desired grammatical form.
US07/028,437 1987-03-20 1987-03-20 Paradigm-based morphological text analysis for natural languages Expired - Fee Related US4862408A (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US07/028,437 US4862408A (en) 1987-03-20 1987-03-20 Paradigm-based morphological text analysis for natural languages
JP63008602A JPH0724056B2 (en) 1987-03-20 1988-01-20 Computer-based morphological text analysis method
DE3853894T DE3853894T2 (en) 1987-03-20 1988-02-05 Paradigm-based morphological text analysis for natural languages.
EP88101694A EP0282721B1 (en) 1987-03-20 1988-02-05 Paradigm-based morphological text analysis for natural languages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US07/028,437 US4862408A (en) 1987-03-20 1987-03-20 Paradigm-based morphological text analysis for natural languages

Publications (1)

Publication Number Publication Date
US4862408A true US4862408A (en) 1989-08-29

Family

ID=21843442

Family Applications (1)

Application Number Title Priority Date Filing Date
US07/028,437 Expired - Fee Related US4862408A (en) 1987-03-20 1987-03-20 Paradigm-based morphological text analysis for natural languages

Country Status (4)

Country Link
US (1) US4862408A (en)
EP (1) EP0282721B1 (en)
JP (1) JPH0724056B2 (en)
DE (1) DE3853894T2 (en)

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5068789A (en) * 1988-09-15 1991-11-26 Oce-Nederland B.V. Method and means for grammatically processing a natural language sentence
US5099426A (en) * 1989-01-19 1992-03-24 International Business Machines Corporation Method for use of morphological information to cross reference keywords used for information retrieval
US5146406A (en) * 1989-08-16 1992-09-08 International Business Machines Corporation Computer method for identifying predicate-argument structures in natural language text
US5151857A (en) * 1989-12-18 1992-09-29 Fujitsu Limited Dictionary linked text base apparatus
US5161105A (en) * 1989-06-30 1992-11-03 Sharp Corporation Machine translation apparatus having a process function for proper nouns with acronyms
US5229936A (en) * 1991-01-04 1993-07-20 Franklin Electronic Publishers, Incorporated Device and method for the storage and retrieval of inflection information for electronic reference products
US5241674A (en) * 1990-03-22 1993-08-31 Kabushiki Kaisha Toshiba Electronic dictionary system with automatic extraction and recognition of letter pattern series to speed up the dictionary lookup operation
US5265065A (en) * 1991-10-08 1993-11-23 West Publishing Company Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query
US5267165A (en) * 1990-03-20 1993-11-30 U.S. Philips Corporation Data processing device and method for selecting data words contained in a dictionary
US5369577A (en) * 1991-02-01 1994-11-29 Wang Laboratories, Inc. Text searching system
US5369576A (en) * 1991-07-23 1994-11-29 Oce-Nederland, B.V. Method of inflecting words and a data processing unit for performing such method
US5475587A (en) * 1991-06-28 1995-12-12 Digital Equipment Corporation Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
US5490061A (en) * 1987-02-05 1996-02-06 Toltran, Ltd. Improved translation system utilizing a morphological stripping process to reduce words to their root configuration to produce reduction of database size
US5508718A (en) * 1994-04-25 1996-04-16 Canon Information Systems, Inc. Objective-based color selection system
US5521816A (en) * 1994-06-01 1996-05-28 Mitsubishi Electric Research Laboratories, Inc. Word inflection correction system
US5546573A (en) * 1993-10-27 1996-08-13 International Business Machines Corporation Specification of cultural bias in database manager
US5559693A (en) * 1991-06-28 1996-09-24 Digital Equipment Corporation Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
US5615320A (en) * 1994-04-25 1997-03-25 Canon Information Systems, Inc. Computer-aided color selection and colorizing system using objective-based coloring criteria
US5634086A (en) * 1993-03-12 1997-05-27 Sri International Method and apparatus for voice-interactive language instruction
US5680628A (en) * 1995-07-19 1997-10-21 Inso Corporation Method and apparatus for automated search and retrieval process
US5692176A (en) * 1993-11-22 1997-11-25 Reed Elsevier Inc. Associative text search and retrieval system
US5708829A (en) * 1991-02-01 1998-01-13 Wang Laboratories, Inc. Text indexing system
US5752025A (en) * 1996-07-12 1998-05-12 Microsoft Corporation Method, computer program product, and system for creating and displaying a categorization table
US5787386A (en) * 1992-02-11 1998-07-28 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US5794177A (en) * 1995-07-19 1998-08-11 Inso Corporation Method and apparatus for morphological analysis and generation of natural language text
US5966710A (en) * 1996-08-09 1999-10-12 Digital Equipment Corporation Method for searching an index
US6055498A (en) * 1996-10-02 2000-04-25 Sri International Method and apparatus for automatic text-independent grading of pronunciation for language instruction
US6185576B1 (en) 1996-09-23 2001-02-06 Mcintosh Lowrie Defining a uniform subject classification system incorporating document management/records retention functions
US6192333B1 (en) * 1998-05-12 2001-02-20 Microsoft Corporation System for creating a dictionary
US20030088554A1 (en) * 1998-03-16 2003-05-08 S.L.I. Systems, Inc. Search engine
US20040243396A1 (en) * 2002-12-30 2004-12-02 International Business Machines Corporation User-oriented electronic dictionary, electronic dictionary system and method for creating same
US7072827B1 (en) * 2000-06-29 2006-07-04 International Business Machines Corporation Morphological disambiguation
US20090150140A1 (en) * 2007-12-06 2009-06-11 International Business Machines Corporation Efficient stemming of semitic languages
US20110144978A1 (en) * 2009-12-15 2011-06-16 Marc Tinkler System and method for advancement of vocabulary skills and for identifying subject matter of a document
US8051096B1 (en) * 2004-09-30 2011-11-01 Google Inc. Methods and systems for augmenting a token lexicon
US20130149681A1 (en) * 2011-12-12 2013-06-13 Marc Tinkler System and method for automatically generating document specific vocabulary questions
US20130332145A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US20150066485A1 (en) * 2013-08-27 2015-03-05 Nuance Communications, Inc. Method and System for Dictionary Noise Removal
US9235566B2 (en) 2011-03-30 2016-01-12 Thinkmap, Inc. System and method for enhanced lookup in an online dictionary
US9384678B2 (en) 2010-04-14 2016-07-05 Thinkmap, Inc. System and method for generating questions and multiple choice answers to adaptively aid in word comprehension
US9442916B2 (en) 2012-05-14 2016-09-13 International Business Machines Corporation Management of language usage to facilitate effective communication
CN106782516A (en) * 2016-11-17 2017-05-31 北京云知声信息技术有限公司 Language material sorting technique and device

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5551026A (en) * 1987-05-26 1996-08-27 Xerox Corporation Stored mapping data with information for skipping branches while keeping count of suffix endings
US5754847A (en) * 1987-05-26 1998-05-19 Xerox Corporation Word/number and number/word mapping
US5551049A (en) * 1987-05-26 1996-08-27 Xerox Corporation Thesaurus with compactly stored word groups
US5488719A (en) * 1991-12-30 1996-01-30 Xerox Corporation System for categorizing character strings using acceptability and category information contained in ending substrings
DE4209280C2 (en) * 1992-03-21 1995-12-07 Ibm Process and computer system for automated analysis of texts
US5412567A (en) * 1992-12-31 1995-05-02 Xerox Corporation Augmenting a lexical transducer by analogy
ATE203604T1 (en) * 1993-02-23 2001-08-15 Xerox Corp CATEGORIZING STRINGS IN CHARACTER RECOGNITION.
EP0856851B1 (en) * 1997-01-30 2004-03-24 Motorola, Inc. Circuit and method of latching a bit line in a non-volatile memory
GB0006721D0 (en) * 2000-03-20 2000-05-10 Mitchell Thomas A Assessment methods and systems
AUPQ811500A0 (en) * 2000-06-09 2000-07-06 Educational And Computing Software Pty Ltd System for program source code conversion
DE202018107421U1 (en) 2018-12-21 2019-06-18 Elbmind Gmbh System for automated communication with allowable blurring for human and machine source and target systems

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4328562A (en) * 1979-01-20 1982-05-04 Sharp Kabushiki Kaisha Means for displaying at least two different words and for providing independent pronunciations of these words
US4339806A (en) * 1978-11-20 1982-07-13 Kunio Yoshida Electronic dictionary and language interpreter with faculties of examining a full-length word based on a partial word entered and of displaying the total word and a translation corresponding thereto
US4420816A (en) * 1978-10-31 1983-12-13 Sharp Kabushiki Kaisha Electronic word retrieval device for searching and displaying one of different forms of a word entered
US4420817A (en) * 1979-05-25 1983-12-13 Sharp Kabushiki Kaisha Word endings inflection means for use with electronic translation device
US4439836A (en) * 1979-10-24 1984-03-27 Sharp Kabushiki Kaisha Electronic translator
US4495566A (en) * 1981-09-30 1985-01-22 System Development Corporation Method and means using digital data processing means for locating representations in a stored textual data base
US4499553A (en) * 1981-09-30 1985-02-12 Dickinson Robert V Locating digital coded words which are both acceptable misspellings and acceptable inflections of digital coded query words
US4541069A (en) * 1979-09-13 1985-09-10 Sharp Kabushiki Kaisha Storing address codes with words for alphabetical accessing in an electronic translator
US4594686A (en) * 1979-08-30 1986-06-10 Sharp Kabushiki Kaisha Language interpreter for inflecting words from their uninflected forms
US4641264A (en) * 1981-09-04 1987-02-03 Hitachi, Ltd. Method for automatic translation between natural languages

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4706212A (en) * 1971-08-31 1987-11-10 Toma Peter P Method using a programmed digital computer system for translation between natural languages
FR2569883B1 (en) * 1984-09-05 1989-11-17 Sharp Kk ELECTRONIC DICTIONARY FOR USE IN A FRENCH TEXT PROCESSING SYSTEM
JPS6165361A (en) * 1984-09-05 1986-04-03 Sharp Corp Electronic french word dictionary
JPS62251876A (en) * 1986-04-18 1987-11-02 インタ−ナショナル ビジネス マシ−ンズ コ−ポレ−ション Language processing system
US4887212A (en) * 1986-10-29 1989-12-12 International Business Machines Corporation Parser for natural language text

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4420816A (en) * 1978-10-31 1983-12-13 Sharp Kabushiki Kaisha Electronic word retrieval device for searching and displaying one of different forms of a word entered
US4339806A (en) * 1978-11-20 1982-07-13 Kunio Yoshida Electronic dictionary and language interpreter with faculties of examining a full-length word based on a partial word entered and of displaying the total word and a translation corresponding thereto
US4328562A (en) * 1979-01-20 1982-05-04 Sharp Kabushiki Kaisha Means for displaying at least two different words and for providing independent pronunciations of these words
US4420817A (en) * 1979-05-25 1983-12-13 Sharp Kabushiki Kaisha Word endings inflection means for use with electronic translation device
US4594686A (en) * 1979-08-30 1986-06-10 Sharp Kabushiki Kaisha Language interpreter for inflecting words from their uninflected forms
US4541069A (en) * 1979-09-13 1985-09-10 Sharp Kabushiki Kaisha Storing address codes with words for alphabetical accessing in an electronic translator
US4439836A (en) * 1979-10-24 1984-03-27 Sharp Kabushiki Kaisha Electronic translator
US4641264A (en) * 1981-09-04 1987-02-03 Hitachi, Ltd. Method for automatic translation between natural languages
US4495566A (en) * 1981-09-30 1985-01-22 System Development Corporation Method and means using digital data processing means for locating representations in a stored textual data base
US4499553A (en) * 1981-09-30 1985-02-12 Dickinson Robert V Locating digital coded words which are both acceptable misspellings and acceptable inflections of digital coded query words

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Patent Applic. Ser. No. 853,490, filed 4/18/86, pp. 1 37, Linguistic Analysis Method and Apparatus , B. W. Knystautas et al. *
Patent Applic. Ser. No. 853,490, filed 4/18/86, pp. 1-37, "Linguistic Analysis Method and Apparatus", B. W. Knystautas et al.

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5490061A (en) * 1987-02-05 1996-02-06 Toltran, Ltd. Improved translation system utilizing a morphological stripping process to reduce words to their root configuration to produce reduction of database size
US5068789A (en) * 1988-09-15 1991-11-26 Oce-Nederland B.V. Method and means for grammatically processing a natural language sentence
US5099426A (en) * 1989-01-19 1992-03-24 International Business Machines Corporation Method for use of morphological information to cross reference keywords used for information retrieval
US5161105A (en) * 1989-06-30 1992-11-03 Sharp Corporation Machine translation apparatus having a process function for proper nouns with acronyms
US5146406A (en) * 1989-08-16 1992-09-08 International Business Machines Corporation Computer method for identifying predicate-argument structures in natural language text
US5151857A (en) * 1989-12-18 1992-09-29 Fujitsu Limited Dictionary linked text base apparatus
US5267165A (en) * 1990-03-20 1993-11-30 U.S. Philips Corporation Data processing device and method for selecting data words contained in a dictionary
US5241674A (en) * 1990-03-22 1993-08-31 Kabushiki Kaisha Toshiba Electronic dictionary system with automatic extraction and recognition of letter pattern series to speed up the dictionary lookup operation
US5229936A (en) * 1991-01-04 1993-07-20 Franklin Electronic Publishers, Incorporated Device and method for the storage and retrieval of inflection information for electronic reference products
US5369577A (en) * 1991-02-01 1994-11-29 Wang Laboratories, Inc. Text searching system
US5708829A (en) * 1991-02-01 1998-01-13 Wang Laboratories, Inc. Text indexing system
US5475587A (en) * 1991-06-28 1995-12-12 Digital Equipment Corporation Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
US5559693A (en) * 1991-06-28 1996-09-24 Digital Equipment Corporation Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
US5369576A (en) * 1991-07-23 1994-11-29 Oce-Nederland, B.V. Method of inflecting words and a data processing unit for performing such method
US5265065A (en) * 1991-10-08 1993-11-23 West Publishing Company Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query
US5418948A (en) * 1991-10-08 1995-05-23 West Publishing Company Concept matching of natural language queries with a database of document concepts
US5787386A (en) * 1992-02-11 1998-07-28 Xerox Corporation Compact encoding of multi-lingual translation dictionaries
US5634086A (en) * 1993-03-12 1997-05-27 Sri International Method and apparatus for voice-interactive language instruction
US5546573A (en) * 1993-10-27 1996-08-13 International Business Machines Corporation Specification of cultural bias in database manager
US5692176A (en) * 1993-11-22 1997-11-25 Reed Elsevier Inc. Associative text search and retrieval system
US5771378A (en) * 1993-11-22 1998-06-23 Reed Elsevier, Inc. Associative text search and retrieval system having a table indicating word position in phrases
US5761497A (en) * 1993-11-22 1998-06-02 Reed Elsevier, Inc. Associative text search and retrieval system that calculates ranking scores and window scores
US5508718A (en) * 1994-04-25 1996-04-16 Canon Information Systems, Inc. Objective-based color selection system
US5615320A (en) * 1994-04-25 1997-03-25 Canon Information Systems, Inc. Computer-aided color selection and colorizing system using objective-based coloring criteria
US5521816A (en) * 1994-06-01 1996-05-28 Mitsubishi Electric Research Laboratories, Inc. Word inflection correction system
US5890103A (en) * 1995-07-19 1999-03-30 Lernout & Hauspie Speech Products N.V. Method and apparatus for improved tokenization of natural language text
US5794177A (en) * 1995-07-19 1998-08-11 Inso Corporation Method and apparatus for morphological analysis and generation of natural language text
US5680628A (en) * 1995-07-19 1997-10-21 Inso Corporation Method and apparatus for automated search and retrieval process
US5752025A (en) * 1996-07-12 1998-05-12 Microsoft Corporation Method, computer program product, and system for creating and displaying a categorization table
US5966710A (en) * 1996-08-09 1999-10-12 Digital Equipment Corporation Method for searching an index
US6185576B1 (en) 1996-09-23 2001-02-06 Mcintosh Lowrie Defining a uniform subject classification system incorporating document management/records retention functions
US6055498A (en) * 1996-10-02 2000-04-25 Sri International Method and apparatus for automatic text-independent grading of pronunciation for language instruction
US7725422B2 (en) * 1998-03-16 2010-05-25 S.L.I. Systems, Inc. Search engine
US20030088554A1 (en) * 1998-03-16 2003-05-08 S.L.I. Systems, Inc. Search engine
US6192333B1 (en) * 1998-05-12 2001-02-20 Microsoft Corporation System for creating a dictionary
US7072827B1 (en) * 2000-06-29 2006-07-04 International Business Machines Corporation Morphological disambiguation
US20040243396A1 (en) * 2002-12-30 2004-12-02 International Business Machines Corporation User-oriented electronic dictionary, electronic dictionary system and method for creating same
US9652529B1 (en) * 2004-09-30 2017-05-16 Google Inc. Methods and systems for augmenting a token lexicon
US8051096B1 (en) * 2004-09-30 2011-11-01 Google Inc. Methods and systems for augmenting a token lexicon
US20090150140A1 (en) * 2007-12-06 2009-06-11 International Business Machines Corporation Efficient stemming of semitic languages
US8438010B2 (en) * 2007-12-06 2013-05-07 International Business Machines Corporation Efficient stemming of semitic languages
US8311808B2 (en) * 2009-12-15 2012-11-13 Thinkmap, Inc. System and method for advancement of vocabulary skills and for identifying subject matter of a document
US20110144978A1 (en) * 2009-12-15 2011-06-16 Marc Tinkler System and method for advancement of vocabulary skills and for identifying subject matter of a document
US9384678B2 (en) 2010-04-14 2016-07-05 Thinkmap, Inc. System and method for generating questions and multiple choice answers to adaptively aid in word comprehension
US9384265B2 (en) 2011-03-30 2016-07-05 Thinkmap, Inc. System and method for enhanced lookup in an online dictionary
US9235566B2 (en) 2011-03-30 2016-01-12 Thinkmap, Inc. System and method for enhanced lookup in an online dictionary
US20130149681A1 (en) * 2011-12-12 2013-06-13 Marc Tinkler System and method for automatically generating document specific vocabulary questions
US9442916B2 (en) 2012-05-14 2016-09-13 International Business Machines Corporation Management of language usage to facilitate effective communication
US9460082B2 (en) 2012-05-14 2016-10-04 International Business Machines Corporation Management of language usage to facilitate effective communication
US9372924B2 (en) * 2012-06-12 2016-06-21 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US20130332145A1 (en) * 2012-06-12 2013-12-12 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US9922024B2 (en) 2012-06-12 2018-03-20 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US10268673B2 (en) 2012-06-12 2019-04-23 International Business Machines Corporation Ontology driven dictionary generation and ambiguity resolution for natural language processing
US9336195B2 (en) * 2013-08-27 2016-05-10 Nuance Communications, Inc. Method and system for dictionary noise removal
US20150066485A1 (en) * 2013-08-27 2015-03-05 Nuance Communications, Inc. Method and System for Dictionary Noise Removal
CN106782516A (en) * 2016-11-17 2017-05-31 北京云知声信息技术有限公司 Language material sorting technique and device
CN106782516B (en) * 2016-11-17 2020-02-07 北京云知声信息技术有限公司 Corpus classification method and apparatus

Also Published As

Publication number Publication date
JPS63231674A (en) 1988-09-27
EP0282721A2 (en) 1988-09-21
DE3853894D1 (en) 1995-07-06
EP0282721B1 (en) 1995-05-31
EP0282721A3 (en) 1990-06-27
DE3853894T2 (en) 1995-12-14
JPH0724056B2 (en) 1995-03-15

Similar Documents

Publication Publication Date Title
US4862408A (en) Paradigm-based morphological text analysis for natural languages
US5680628A (en) Method and apparatus for automated search and retrieval process
EP0423683B1 (en) Apparatus for automatically generating index
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
KR101157693B1 (en) Multi-stage query processing system and method for use with tokenspace repository
US4773039A (en) Information processing system for compaction and replacement of phrases
CA1300272C (en) Word annotation system
US4903206A (en) Spelling error correcting system
US20050203900A1 (en) Associative retrieval system and associative retrieval method
JP2742115B2 (en) Similar document search device
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JP2012248210A (en) System and method for retrieving content of complicated language such as japanese
JP2009266244A (en) System and method of creating and using compact linguistic data
JPH11110416A (en) Method and device for retrieving document from data base
US20070011160A1 (en) Literacy automation software
US20100185438A1 (en) Method of creating a dictionary
EP0316743B1 (en) Method for removing enclitic endings from verbs in romance languages
Bell et al. Towards everyday language information retrieval systems via minicomputers
JP4057681B2 (en) Document information storage device, document information storage method, document information search device, document information search method, recording medium on which document information storage program is recorded, and recording medium on which document information search program is recorded
JPH04160473A (en) Method and device for example reuse type translation
US20020065794A1 (en) Phonetic method of retrieving and presenting electronic information from large information sources, an apparatus for performing the method, a computer-readable medium, and a computer program element
Pantelia ‘Noûs, INTO CHAOS’: THE CREATION OF THE THESAURUS OF THE GREEK LANGUAGE
JPH07182354A (en) Method for generating electronic document
JP3693734B2 (en) Information retrieval apparatus and information retrieval method thereof
JP4206266B2 (en) Full-text search device, processing method, processing program, and recording medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, ARMON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST.;ASSIGNOR:ZAMORA, ANTONIO;REEL/FRAME:004681/0732

Effective date: 19870313

FPAY Fee payment

Year of fee payment: 4

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
FP Lapsed due to failure to pay maintenance fee

Effective date: 19970903

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362