SYSTEM AND METHOD FOR DETECTING TEXT SIMILARITY OVER SHORT PASSAGES
FIELD OF THE INVENTION The present invention relates generally to natural language processing and more particularly relates to a system and method for determining the similarity of text in short passages.
BACKGROUND OF THE INVENTION
With the growing volume of textual information, such as newspaper articles, magazines, Internet articles, and the like, there is a growing need to automatically cluster and/or classify such documents and determine whether groups of documents express similarities or not. For the most part, research in this area has focused on detecting similarity between documents and large segments of text or between a short query phrase and one or more documents.
While effective techniques have been developed for document clustering and classification which depend on inter-document similarity measures, these techniques generally rely only on shared words, or occasionally on collocation of words. Such techniques are applicable when large units of text, such as full documents, are compared. In this case, there is generally sufficient overlap to detect similarity in the documents and/or document segments. However, when the units of text are small, for example a paragraph or abstract, such simple surface matching of words and phrases is far more prone to error. In the case of small text units, the sample size is reduced and the number of potential matches is reduced accordingly. Thus, there remains a need for improved techniques for detecting similarities between small text units.
A further problem with known techniques for detecting similarity is that the conventional notions of similarity which are applicable to large text samples, such as documents and large text segments, do not provide sufficient measures of similarity for measuring similarity in small text segments. Standard notions of similarity generally involve the creation of a vector or profile of characteristics of a text
fragment and determine a conceptual distance between vectors on the basis of frequencies. Features typically include stemmed words, although multi-word units and collocations also have been used. Typological characteristics, such as thesaural features, have also been used to calculate features. The difference between vectors for one text unit (usually a query) and another text unit (usually a document) then determines closeness or similarity of the text units.
In some cases, the text units are represented as vectors of sparse n-grams of word occurrences and learning is applied over those vectors. Though effective in the context of large document comparisons, a more fine-grained distinction for similarity measures is required to properly characterize the similarity of two small text segments.
SUMMARY OF THE INVENTION It is an object of the present invention to provide systems and methods for detecting similarity between two or more small text segments. A method for determining similarity in short text segments in accordance with the present invention includes the steps of determining common primitive features in the text segments, determining common composite features in the text segments and then calculating a similarity measure based upon the primitive and composite features. The primitive features can be selected from the group including common single words, common noun phrases, synonyms, common semantic classes of verbs, and common proper nouns. The composite features, which represent relationships between and among the primitive features, can be selected from the group including primitive feature order restrictions, primitive feature distance restrictions, and primitive type restrictions. Preferably, the step of determining common primitive features can include the further steps of identifying common primitive features, assigning a value to the primitive features, and normalizing the feature values. Normalizing the values can include normalizing for text segment length and normalizing for the frequency of primitive feature occurrence. Similarly, determining composite features generally includes identifying the composite features, assigning a value to the composite
features, and normalizing the feature values. Again, normalization of the feature values can include normalizing for text segment length and normalizing for the frequency of feature occurrence.
BRIEF DESCRIPTION OF THE DRAWING Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which
Figure 1 is a flow chart illustrating an overview of a present method for comparing small text segments; Figure 2 is a flow chart illustrating the step of defining similarity for small text segments in accordance with the present methods;
Figure 3 is a flow chart illustrating the process of computing primitive features for use in detecting similarity in small text segments;
Figure 4 is a flow chart illustrating the process of calculating composite features for use in detecting similarity of small text segments in accordance with the present methods;
Figure 5 is a block diagram of a software system topology for determining similarity in small text segments in accordance with the present methods; Figure 6 is an illustration of exemplary short text segments; Figure 7 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "same order" rule;
Figure 8 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "within distance" rule; and
Figure 9 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "primitive type" rule.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made
to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Figure 1 is a flow chart illustrating an overview of the process used in the present invention for detecting similarity in small text segments. As previously noted, a problem in the prior art is that the definition of similarity commonly used for large text segments, such as documents, is not sufficiently refined to provide an adequate measure of similarity when comparing small text segments. Generally, small text segments refer to sentences, phrases and short paragraphs. Referring to Figure 1 , in step 100 a definition of similarity for small text segments is provided. From this definition, the method proceeds to identify primitive features of the small text segments and determine feature values for the primitive features (step 105). Primitive features are those which generally compare simple parts of speech and text, such as single words, word categories, or phrases such as noun phrases, synonyms, verb class and proper nouns. In addition to primitive features, the process can identify composite features of the short-text segments and determine composite feature values (step 110). Composite features are those which compare relationships among two or more primitive features. Once primitive features and composite features have been identified and given an appropriate value, a machine learning algorithm is applied to classify small text segments as similar or not similar (step 115).
Figure 2 is a flow chart which illustrates the process of establishing an appropriate definition of similarity for small text segments. In general, two text units can be considered as similar if they share the same focus on a common concept, actor, object or action. In addition, the common actor or object definition must perform or be subjected to the same action or be the subject of the same description. This is exemplified in the flow chart of Figure 2, where two small text segments are selected from a body of text and are analyzed. If the two text segments relate to a common concept (step 205), then further analysis is performed to see if the common concept relates to the same action (step 210) or relates to the same description (step 215).
Similar tests are performed to determine if the two text segments relate to a common actor (step 220) or to a common object (step 225). If there is no common concept, actor or object, the text segments are considered not similar (step 235). Similarly, for those text segments which do refer or relate to a common concept, actor or object, those segments will still be found not similar unless they also relate to a common action or involve the same description. Thus, for short text segments to be similar, they must contain a common concept, actor, or object which is also the subject of a common action or description. The comparisons in steps 205, 220 and 225 can be the basis for primitive features 240. Those relationships between primitive features which are identified in steps 210, 215 can be referred to as composite features 245.
While Figure 2 is illustrated as a sequential process, it represents a decision tree involved in a definition of similarity of two short text segments as applied in the present invention which can also be performed in a largely parallel manner. For example, decisions 205, 220 and 225 can be performed concurrently as can decisions 210 and 215. Using this definition of similarity for small text segments, a feature- based process can be employed which compares primitive and composite features of short text segments to determine if the definition is satisfied for two or more given input text segments.
Figure 3 is a flow chart which illustrates a method for extracting and scaling primitive features in accordance with the present invention. The text segments are compared for a level of commonality, including determining whether there is a common single word (step 305), a common noun phrase (step 310), whether two words in the phrases are synonyms (step 315), whether the phrases include verbs having a common semantic class (step 320), and whether a common proper noun can be found in the two phrases (step 325). If none of these conditions are satisfied for the applied small text segments, there is no primitive feature common to these two text segments (step 327). When a primitive feature has been identified, e.g., one of the conditions in steps 305 through 325 is satisfied, a feature value is assigned to that primitive feature. Preferably, the values which are assigned to the features are determined by a machine learning algorithm, such as RIPPER, which is trained using a suitable training corpus. RIPPER is a widely -used and effective rule induction
system which is available from AT&T Laboratories and is described by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. It has been found that a sub- set of a corpus of 264 paragraphs which have been manually tagged by human readers as similar or not similar can be used to establish a feature rule set for RIPPER which is then suitable for assigning values to the features identified in the text segments. The particular training corpus and learned rule set will generally vary depending on the desired application. The values assigned will vary based on properties of the machine learning algorithm and training corpus. After feature values are assigned in step 330. these values can be normalized based on text length (step 335) and/or noted frequency of occurrence (step 340). Though normalization is optional, it is a desirable step to provide uniform and accurate results across varying types of text and length of text segments. Primitive features provide a baseline indication of similarity. To further refine the notion of similarity in small text segments, relationships among primitive features, referred to as composite features, can also be evaluated. Referring to Figure 4, a method of evaluating composite features is illustrated. Composite features are those features which identify relationships among primitive feature pairs. Generally, composite features are defined by placing different forms of restrictions on participating primitive feature pairs. Referring to Figure 4, the primitive features identified in each of the small text segments are applied to a test layer 400 where various feature relationships are evaluated. The relationships illustrated in test layer 400 are exemplary in nature and are not intended to illustrate an exhaustive list of possible relationships. It will be appreciated that an large number of relationships between and among primitive features can be used to establish composite features. For example, one type of feature relationship for composite features can be that the primitives occur in the same order in each of the text samples (step 405). This is illustrated by example in Figure 7. Figure 6 provides three short text segments to be compared. Figure 7 illustrates a match according to the "same order" composite feature rule. In Figures 7-9, primitive features are identified by shading and the
relationships which form the composite features are illustrated by connecting lines. In the case illustrated in Figure 7 the primitive features {two, contact} appear in the same order in text segments Figure 6 (a) and 6 (b) from Figure 6.
Another possible relationship is that two pairs of primitive elements are required to occur within a certain distance in both text segments. The maximum distance between the primitive elements which would satisfy the relationship can be a variable or a predetermined constant (step 410). Referring to Figure 8, an example of a positive match for the "within distance" composite feature rule is provided, given that the distance, n, is set to a value less than three. In Figure 8, although the primitive features {contact, lost} do not appear in the same order, they occur within n words of each other (n<3 in this case).
Yet another exemplary relationship can be that the two text segments include the same primitive feature types. For example, one primitive feature can be restricted to a simplex noun phrase while the other to a verb. In such a case, two noun phrases, one from each text unit, must match according to the rule for matching simplex noun phrases and two verbs must match according to the applied rules of verb primitives (e.g., sharing the same semantic class). This is illustrated in Figure 9 where the primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with "the helicopter" and both phrases include a common verb, "lost". By matching primitive feature types, a simple grammatical relationship is determined in the text segments. Returning to Figure 4. for each condition that is satisfied in test layer 400, feature values are assigned to those composite features identified (step 420). The feature values are assigned by a machine learning algorithm, such as RIPPER, which has been trained on a suitable training corpus. As in the case of primitive features, optionally, the feature values assigned to the composite feature can be normalized for text length and relative occurrence of the primitive feature or composite feature (steps 425, 430, respectively). Once both primitive features and composite features of the small text segments have been identified, a machine learning algorithm is applied to determine a similarity value between the text segments (step 435). The machine learning algorithm can perform a rule-based analysis to determine similarity. Alternatively, a simpler algorithm can be
used to determine similarity by comparing the total feature value of the text segments being compared to a predetermined threshold value.
Figure 5 is a block diagram of an exemplary software system for conducting the method described in connection with Figures 1-4. The system is generally implemented in software for a general purpose computer, such as a personal computer or work station. The system includes a main processing section 500. One or more interface modules 510 are included for receiving text input for the text segments to be compared and for providing the text segments to the main processing section 500. The text input can be provided by a number of sources, including but not limited to, computer readable memory, hard disks, optical disks, network databases, on-line sources, manual keyed input and the like. Based on the desired text source and input mechanism, one skilled in the art can provide appropriate text input interface module 510 hardware and software.
The main processing section 500 is also operatively coupled to a training corpus 515, which is generally stored in computer readable storage media. The main processing section 500 is generally programmed in a structured manner which calls various subprograms, library routines, and the like to perform the various functions described in accordance with Figures 1-4. The main processing section 500 can invoke the various subroutines sequentially (serial) or in a parallel, or batched, processing mode. The received text is generally passed to a preprocessing routine 520. The preprocessing routine cleans up the received text, such as by removing control characters from the text. The preprocessing routine also performs part-of- speech (POS) tagging, using known techniques, such as are available in the ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the Alembic System as used for MUC-6," Proceedings of the Sixth Message
Understanding Conference, 1995, which is hereby incorporated by reference. ALEMBIC provides a set of data and language processing tools which identify the various parts of speech present in the small text segments.
Following text preprocessing, control is returned to the main processing section 500 which then preferably invokes a noun phrase comparison subroutine 525, such as Linklt, to perform noun phrase comparison of step 310. Linklt can be
employed to determine whether a common noun phrase is present in the applied text segments and for identifying simplex noun phrases and matching those that share the same noun head. The Linklt tool is described by N. Wacholder in "Simplex NPs Clustered by Head: A Method for Identifying Significant Topics in a Document", Proceedings of the Workshop on the Computational Treatment of Nominals, October 1998, which is hereby incorporated by reference in its entirety.
To determine if two segments include common proper nouns as required in step 325, the noun comparison algorithm can also be used to match those nouns identified using the ALEMBIC toolset using various predetermined matching criteria. Variations on proper noun matching can include restricting the proper noun type to a person, place or organization. Such subcategories can also be extracted using ALEMBIC's named entity finder.
Following noun phrase identification and matching, other routines for detecting primitive features can be employed. For example, to perform step 305 and determine whether common single word primitive features exist between two text segments, a word co-occurrence detection sub-routine 540 can be called by the main program 500. Variations of the word co-occurrence operation can restrict matching to cases where the parts of speech of the words also match, or relax the comparison to cases where only the word stems of the two words are identical. Similarly, to determine if two text segments include words which are synonyms, a synonym detection algorithm 530 can be called by the main processing routine 500. In this regard, a lexical database such as WordNet®, as described by G. Miller in "WordNet, An On-Line Lexical Database," International Journal of Lexicography, Vol. 3, No. 4 (1990), can be employed. WordNet provides sense information and places words in sets of synonyms (synsets). Words that appear in the same synset are generally considered matches. Variations on this feature can be used to restrict the words being compared to a specific part-of- speech class.
To determine if two verbs present in the short text segments are of the same semantic class as set forth in step 320, a verb classifier and comparator algorithm 535 can be operatively coupled to the main processing section 500 and called by the main program. Semantic classes for verbs have been found to be useful for determining
document types and text similarity. This is discussed, for example, in "The Role of Verbs in Document Analysis" by J. Klavans et al., Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, 1998, which is hereby incorporated by reference in its entirety. For those verbs which are found to have a common semantic class, e.g., communication, motion, agreement, argument, etc., those verbs are considered to match.
The program operating in main processing section 500 can also provide algorithms to normalize feature values for text lengths and relative occurrence of the primitive. To normalize feature values for text length, as set forth in step 335, each feature value can be normalized by the size of the textual segments in the pair. For example, for a pair of textual segments A and B, the feature values assigned are divided by a normalization value, N:
N = ^Length{A) x Length{B) (1)
This operation removes any potential bias in favor of longer text segments. It is noted that the units involved in the lengths of A and the lengths of B are generally measured by a word count.
Normalization of feature values can also be based on the relative frequency of occurrence of each primitive feature. Such normalization is motivated by the general observation that infrequently matching primitive elements are likely to have a higher impact on similarity than primitives which match more frequently. Such normalization is similar to the document frequency component of the commonly employed TF*IDF calculation. In this case, each primitive feature is associated with a value which is equal to the number of textual units in which the primitive appeared in the corpus. For a primitive element which compares single words, this is the number of text segments which contain that word in the corpus; for a noun phrase, this is the number of textual units that contain noun phrases that share the same head; and similarly for other primitive types. We multiply each feature's value by:
Log(^) (2)
where T is a number of textual segments and N is the number of textual segments containing the primitive. It is noted that since normalization for text length and frequency of occurrence are both optional operations, when these two normalization techniques are selectively applied, there are up to four variations of normalizations for each primitive feature. Of course, other normalization techniques may be added to, or substituted for, the two methods discussed herein.
The program in main processing section 500 generally employs a machine learning algorithm 545 to determine whether the text units match overall. A suitable machine learning algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. RIPPER is a widely-used and effective rule induction system. This RIPPER algorithm is trained over a corpus of manually marked pairs of text units continued in the training corpus 515. A suitable corpus was constructed using a subset of the Topic Detection and Tracking (TDT) corpus developed by NIST and DARPA. The TDT corpus in a collection of over 16,000 news articles from Reuters and CNN where many of the articles have been manually grouped into 25 categories each of which correspond to a single event. The selected corpus was formed using the Reuters' articles in five of the twenty five categories from randomly selected days. The resulting training corpus 515 contained 30 related articles. The 30 articles provided 264 paragraphs which were selected as the small text segments and resulted in 10,345 comparisons between segments.
Although use of a machine learning algorithm is preferred, other algorithms can also be used. For example, an algorithm can add the total value of composite features found in the text segments and compare this value against a similarity threshold. Similarly, although it is preferred to determine feature values based on the use of a machine learning algorithm, feature values can be predetermined based on human experience through the use of a look-up table. Alternatively, all features can be given a binary value and the similarity comparison can be determined based on a simple accumulated count of detected primary and composite features.
The present methods, while evaluated on a corpus of English language documents, are not language specific and are generally applicable to any language. Of course, the individual subroutines may require some alteration to accommodate the varied constructions found in different languages. The methods for determining similarity in small text segments described herein form an important component in larger systems, such as document archiving systems and multi-document summarization systems.
Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.