WO2014049186A1

WO2014049186A1 - Method for generating semantic patterns

Info

Publication number: WO2014049186A1
Application number: PCT/ES2013/070638
Authority: WO
Inventors: Valentín Miguel MORENO PELAYO; Pablo Miguel SUÁREZ LÓPEZ; Anabel FRAGA VÁZQUEZ; Juan Bautista LLORENS MORILLO; Eugenio PARRA CORREDOR
Original assignee: Universidad Carlos Iii De Madrid
Priority date: 2012-09-26
Filing date: 2013-09-16
Publication date: 2014-04-03

Abstract

The invention relates to the methods for recognising natural language. More concretely, the invention relates to the methods for generating semantic patterns that enable the organisation of the information. The invention includes the steps of: determining the grammatical category of each term of a text, assigning the grammatical categories into groups, counting the frequency of appearance of each group, establishing a pattern candidate if the frequency of appearance of a group is sufficiently high, determining the semantic category of the pattern candidate using a pre-defined taxonomy, and identifying a pattern when the pattern candidate has an associated semantic category.

Description

METHOD OF GENERATION OF SEMANTIC PATTERNS

Technical Field of the Invention

The present invention is related to natural language recognition methods. More specifically, it is framed with those methods for the generation of semantic patterns that enable the organization of information.

State of the Art

In the field of natural language recognition, a tool that automatically generates semantic patterns is necessary.

Although there are different ways and techniques to relate the concepts semantically, the extraction of relations through semantic patterns is one of the most used. The application of this technique requires making a list of patterns for different types of semantic relationships. These patterns must be relatively frequent in the documents. Among the background related to the invention, the following documents are worth mentioning.

Llorens J., Morato J., Genova G. RSHP: An information representation model based on relationships. In: Ernesto Damiani, Lakhmi C. Jain, Mauro Madravio (Eds.), Soft Computing in Software Engineering (Studies in Fuzziness and Soft Computing Series, Vol. 159), Springer, pp 221-253. 2004. This document proposes a representation model based on graphs to relate concepts, compared to the present invention, it differs in that although it manifests the need to nurture that model with patterns, it does not include a method for the automatic obtaining of the same.

Alshawi H. Processing Dictiornary Definitions with Phrasal Pattern Hierarchies. Computational Linguistic. July-December 1987, 13 (3-4). Pp. 195-202. In this document it is proposed to extract taxonomic relationships through patterns from the definitions of the words in a dictionary. Its scope is limited compared to this proposal, since part of documents with a certain structure to obtain a single specific type of relationship between pairs of terms.

R.A. Amsler A taxonomy for English nouns and verbs. Proceedings of the 19th annual Meeting of the Association for Computational Linguistic. Stanford, California, 1981. Pp. 133-138. This document proposes, based on structured and semi-automatic documents, to apply pattern distributions with the purpose of identifying hierarchical structures between dictionary terms. In the face of this proposal, it differs in that the working documents are structured, in addition the patterns are not obtained automatically and are for a very specific purpose.

In these and other works the obtained patterns have the purpose of being able to establish relations between two concepts and their obtaining is carried out from structured sources such as dictionaries, taxonomies, thesauri or ontological, not being in many cases fully automatic. In their application, these traditional patterns identify two related concepts within a sentence without paying attention in general to its overall meaning and structure.

Unlike the above, the methodology proposed here allows to have complex patterns at the phrase level, which can be expressed by simpler ones as commonly used, with the advantage of identifying larger areas of the texts from which to extract semantics with Greater precision. Consequently, they have greater semantic wealth. In addition, its obtaining is fully automatic based on documents in natural language.

Brief Description of the Invention

It would therefore be desirable in view of the problems identified in the state of the art, to have a method to solve these inconveniences. In particular, that contemplates the automatic identification of patterns from a corpus and the implementation of functions that organize the generated data.

As a solution, a method is developed in which different steps are described that allow the automatic generation of indexing patterns, having a corpus as origin and as a result a list of patterns ordered by frequency. Also including, optionally, several special functions that organize and represent the information intermediate obtained, and expansion capacities of the generated patterns to other hierarchical formats.

The method of generating semantic patterns includes at least the following stages:

- Determine the grammatical category of each term in a sequence of terms in a text.

- Group the grammatical categories of the terms in the previous sequence into groups. These groups are formed following the order of the sequence terms.

- Count the frequency of occurrence of each group.

- Establish a group as a pattern candidate when the frequency of the group's appearance exceeds a threshold.

- Determine the semantic category of the employer candidate according to a predefined taxonomy for a plurality of groups based on the grammatical category of the terms that make up said employer candidate.

- Identify that a patron candidate is really a patron when said patron candidate has an associated semantic category.

Optionally, the groups are tupias of grammatical categories of terms.

Optionally, for grouping into grammatical categories, the terms of the text include at least one punctuation mark and / or one word.

Optionally, to establish a pattern candidate, the grouping of categories is made from the categories of adjacent terms in a first iteration. This type candidate for employer is called basic.

Alternatively to the previous case, in order to establish a patron candidate, the grouping of categories is based on the categories of the terms distanced from each other by at least one intermediate term whose specific category has an optional presence in the patron candidate. This type of candidate is called a pattern candidate with optional element (s).

Optionally, after a first iteration, for a subsequent grouping into grammatical categories, at least one of the components of the pattern is in turn a candidate for the pattern of a previous iteration. This pattern candidate type is called a compound. Optionally, the step of determining the semantic category is performed on the groups only in case one of its components is a grammatical category of verb.

Optionally, according to the previous case, one of its components is a pattern from a previous iteration from which it acquires (inherits) its semantic category.

Brief description of the figures

In order to complement the description and in order to help a better understanding of the features of the invention, the present specification, as an integral part thereof, is accompanied by figures

FIG. 1: shows in a diagram the main steps according to a possible embodiment.

Detailed description of the invention

The present invention is further illustrated by the following example, which is not intended to limit its scope.

In FIG. 1 you can see the sequence of steps to obtain a semantic pattern. It is based on a text to determine the grammatical category 11 of the words, word groups and punctuation marks that compose it.

With the grammatical categories identified, categories 12 are grouped under different criteria. Thus groups (tupias) are obtained that contain the grammar codes associated with the words and / or punctuation marks of the text that have been grouped. There are different ways of grouping and therefore, different types of candidates for employers (basic patterns, compound or with optional terms).

With the above information, it is possible to find out which groups or tupias are more common and count frequencies of occurrence 13 in each iteration. By defining a frequency threshold, those candidates for patterns that meet the condition of having an occurrence frequency above threshold 14 can be searched iteratively.

Those who meet the required condition have been established as candidates for employers 15. To verify whether or not they are employers, such candidates must have an associated semantics. For this, a taxonomy is sought and the semantic category of the components of the possible pattern is determined 16. In case the pattern contains a specific semantic category, preferably provided by a verb, it is inferred that said candidate is really a semantic pattern 17. With the following practical example, the process object of the present invention will be better understood.

We define a possible pattern or candidate for a pattern to a tupia that can contain both grammatical categories (C) and other patterns (P), that is, subpatrons. Since they have the same structure, for the moment, they will talk about a general pattern, understanding that it is a possible pattern, that is, a candidate that has to be verified in later stages.

Examples of pattern types:

Basic pattern (C1 C2): Composed only by terms with grammatical categories, the tupias can be binary or n-ary. In the present example, without loss of generality, binaries will be chosen.

Compound patterns otherwise (P1 C2), (C1 P2), (P1 P2).

Indexing patterns contribute to identifying texts through grammatical and semantic categories. These patterns will be generated and stored through three defined data structures:

Terms or tokens, which contain sequentially; that is, in order of appearance, the categories or patterns of the text;

Map, which contains sequentially; that is, in order of appearance, the patterns identified from the terms or tokens, and

Patterns, which contain and group the generated patterns in order of frequency.

For example: if you have token structure is represented by:

T = (C138, C22, C1 1, C22, C137, C22, C5, C22, C11, C13,22, C29, C11 1, C132, C22, C11, C67), then the Map is generated as:

PM = (PM1, PM2, PM3, PM4, PM5, PM6, PM7, ..., PMn),

where:

PM1 = (C138, C22), PM2 = (C22, C11), PM3 = (C11, C22), ...

and the Pattern table is as shown below:

P = (P1 -> 83 times, P2 -> 55 times, P3 -> 40 times, ..., Pn -> 1 time).

The procedure described uses different independent modules with the ability to analyze texts, generate sentences, tokens, maps and patterns. Below is a brief description of the most important steps of the methodology: Generation of sentences and tokens

The corpus of work are formed by texts that can include, together with their terms, the information of their grammatical category. You can work directly on them or preprocess them to represent the morphological information of the words with other sets of grammar labels. In the example, a corpus created for the English language is used from numerous representative sources. However, the method described here should not be considered limited to a specific corpus.

The codes that appear next to each term either word or punctuation mark indicate the grammatical category to which it belongs. This information will also be used for subsequent semantic analysis.

Text extracted from Brown corpus:

The / at Fulton / np-tl County / nn-tl Grand / jj-tl Jury / nn-tl said / vbd

Friday / nr an / at investigation / nn of / in Atlanta's / np $ recent / jj

primary / nn election / nn produced / vbd

The grammar labels shown and their corresponding categories are:

You can also work without the grammatical information they provide and obtain it term by term using specialized morphological analysis tools. These tools usually have specific grammatical label sets according to the information needs that have been pre-established on a specific domain. Assignment of the grammar information to the previous text by a specialized tool through a specific grammar code set:

The / 1879 Fulton / 1801 County / 1793 Grand / 1850 Jury / 1793 said / 1828

Friday / 1814 an / 1880 investigation / 1792 of / 1857 Atlanta's / 1802

recent / 1850 primary / 1793 election / 1793 produced / 1944

The grammatical categories shown and their corresponding grammatical codes associated with these categories are:

From the corpus their sentences are extracted taking into account a set of delimiting characters. Once the sentences have been identified, their terms are normalized and this information is stored in the Tokens data structure.

Example:

In the following example, you can see how Brown Corpus phrases have been chopped and the Tokens structures have been extracted, taking into account that commas and end of sentence symbols are filtered for use in natural language processing.

1) The jury further said in term-end presentments that the city Executive Committee, 2) which had over-all charge I heard the election, 3) "deserves the praise and thanks I heard the City of Atlanta" for the manner in which the election was conducted The grammatical category Ck of each text term is determined. This is how the token structure, T.

T = (C138, C22, C55, C1 1, C127, C22, C1 11, C138, C22, C50, C1 11, C147, C22, C11, C151, C138, C22, C50, C13, C1 1, C138, C22 , C162, C82, C151, C138, C22, C29, C142, C138, C22, C127, C1 11, C138, C22, C144, C1 1, C54);

Where: C138 = determined article; C22 = name; C55 = adverb; C1 1 = verb;

C127 = preposition; C11 1 = relative pronoun; C50 = comma [,]; C147 = verb to have;

C151 = preposition of; C13 = symbol; C162 = conjunction and; C82 = absolute verb;

C29 = apostrophe; C142 = preposition for; C144 = verb to be; C54 = point [.]

Basic Pattern Generation

For the generation of basic patterns, the terms of the set T with their grammatical categories are grouped in MPk groups, following the order of appearance in the text.

The grouping form can be chosen to create groups of several terms. In the present example, it is done in pairs (only the first one has been underlined).

T = (C138, C22, C55, C1 1.C127.C22.C1 11, C138, C22, c50, C11 1.C147.C22.C11.C151,

C138.C22.C50.C13.C11.C138.C22.C162.C82.C151.C138.C22.C29.C142.C138.C22.C127.

C1 11.C138.C22.C144.C1 1, C54);

MP = (MP1, MP2, ..., MPn);

Where: MP1 = (C138.C22); MP2 = (C22.C55); MP3 = (C55.C1 1); ... MPn = (C11.C54);

The frequency of occurrence of each group (couple) Pk is counted to generate the basic patterns P.

P = (P1, P2, ..., Pn);

Where: P1 = (C138.C22), 7 times; P2 = (C22.C55), 2 times; ... Pn = (C11.C54), 1 time;

From this analysis, a valid pattern candidate can be established when the frequency exceeds a threshold, for example more than 3 times.

Generation of composite patterns

This process can be repeated iteratively to generate composite patterns. A compound pattern contains at least one other pattern as one of its terms. The process would be similar to the previous one except that instead of categories the terms would be replaced by their equivalent subpattern. Thus, new patterns can be located that will be by their nature compound patterns.

In the previous example, from the grammatical categories that make up T and the basic patterns, a substitution is made replacing the most frequent pattern, P1 as follows:

P1 = (C138.C22)

T = (C138, C22, C55, C1 1, C127, C22, C1 11, C138, C22, c50, C1 11, C147, C22, C11, C151,

C138, C22, c50, C13, C1 1, C138, C22, C162, C82, C151, C138, C22, C29, C142, C138, C22, C127, C1 11, C138, C22, C144, C1 1, C54);

When replacing with the P1 pattern, T looks like:

T = (P1.C55.C11, C127, C22, C11 1, PJ., C50, C1 11, C147, C22, C1 1, C151, PJ., C50, C13,

C1 1, P1, C162, C82, C151, P1, C29, C142, C138, C22, C127, PJ., C22, C144, C11, C54);

Then, from the new tupias formed by the elements of the substituted pattern, new patterns are obtained that are saved on the Map. These new patterns are in this case:

MP1-1 = (P1.C55); MP2-1 = (C1 11.P1); MP3-1 = (P1.C50); MPn-1 = (P1.C22);

As with the basic patterns, the frequency of occurrence of each group (couple) Pk is counted to generate the composite patterns that are stored in the Patterns table.

P = (P1-1, P2-1, Pn-1);

Where: P1-1 = (P1.C55), 1 times; P2-1 = (C1 11.P1), 1 times; P3-1 = (P1.C50), 2 times;

Pn-1 = (P1.C22), 1 time;

This process is defined by maximum substitution levels or it is executed until no more substitutions are possible. However, it is advantageous to have a maximum configurable stop level in case the domain knowledge is extensive or the level from which the patterns are no longer useful and therefore their extraction is not necessary.

Generation of patterns with optional terms

The patterns with optional elements are patterns formed by a tupia that can contain N elements (either grammatical categories or subpatrons) optional intermediate. To generate these patterns, the components of each pattern are searched in order in the token list, admitting the presence of intermediate elements between them. The maximum number of consecutive intermediate elements allowed is configurable. Its value is usually two (2). Subsequently, these patterns are stored on the Map and added to the structure of Patterns ordered by their frequency of appearance.

For example, P99 = (C11.C22) is defined; when applied to T, MOk is obtained.

T = (C138.C22.C55.C1 1.C127.C22.C1 11.C138.C22.C50.C11 1.C147.C22.

C1 1, C151, C138, C22, C50, C13, C11.C138, C22, C162, C82, C151, C138, C22, C29, C142, C138, C22, C127, C11 1, C138, C22, C144, C11, C54);

M01 = (C11, [C127], C22); 01 = (C1 1, [C127], C22), 1 time;

M01 '= (C11, [C138], C22); 01 '= (C11, [C138], C22), 1 time;

M02 = (C11, [C151, C138], C22); 02 = (C11, [C151, C138], C22), 1 time;

MO = (M01, M01 ', M02);

0 = (01, 01 ', 02)

Semantic characteristics

Once you have the patterns, it is possible to determine the semantic category to which they belong. To do this, a predefined taxonomy is used where they determine to which semantic category the previously generated groups belong, according to the terms that compose them.

The semantics are found, in the present example of embodiment, in terms with grammatical categories of the verb type.

Once the grammatical categories have been identified, the corresponding semantic code can be associated with the help of a taxonomy.

To include the semantics in the generated patterns, it is validated for the elements that form the pattern if one corresponds to a grammar category verb type and its corresponding semantic code is saved. Thus, four scenarios are obtained when defining the semantics for a pattern that has an associated verb: Case 1: Pattern with semantics: It has a verb category with associated semantic code. For example, to the semantic code "Feed" belong "feed,""eat,""drink."

Example:

P _K = (C _A , C _B ) = (NAME, VERB) and VERB has a semantic code with value x for example.

Case 2: Pattern that obtains semantics directly from the verb because it is not contained in a semantic group

Example:

P _s = (PT, C _b ) = (PT, VERB) and VERB does not belong to a semantic group, but has semantics of the verb that it intrinsically represents.

.If P _T contributes semantics (it contains at least one verb directly or indirectly), its semantics will be associated with the patterns that contain it (in this example P _s ).

Case 3: Pattern without semantics for not having a verb category.

Example:

Pv = (Cj, C _A ) = (DEFINED ARTICLE, NAME).

Case 4: Pattern without direct semantics for being composed of two (2) subpatrons

Example:

PF = (PG, PH) - Although it is categorized as without direct semantics because it ignores the essence of the patterns it contains; the pattern P _F will have associated the semantics (if it exists) that contain the patterns P _G and PH-

Claims

1. - Method of generating semantic patterns characterized by comprising the following stages:

- determine the grammatical category (11) Ck of each term of a T sequence of terms in a text,

- group the grammatical categories of the terms of the sequence or T in groups (12) MPk, where the groups are formed following the order of the sequence terms,

- count the frequency (13) of occurrence of each group P,

- establish a pattern candidate (15) when the frequency of occurrence of a group is greater than a threshold,

- determine the semantic category of the employer candidate (16) according to a predefined taxonomy for a plurality of groups based on the grammatical category of the terms that make up said employer candidate,

- identify a pattern (17) when the pattern candidate has an associated semantic category.

2. - Method according to claim 1, characterized in that the groups are tupias of grammatical categories of terms.

3. - Method according to claim 2, characterized in that, for grouping into grammatical categories, the terms of the text comprise at least one of the following elements:

- a punctuation mark,

- a word.

4. - Method according to claim 2 or 3, characterized in that, to establish a candidate for pattern, the grouping of categories is made from the categories of adjacent terms in a first iteration.

5. - Method according to claim 2 or 3, characterized in that, to establish a candidate for employer, the grouping of categories is made from the categories of terms distanced from each other by at least one intermediate term whose specific category has a Optional presence in the employer candidate.

6. - Method according to claim 4 or 5, characterized in that, after a first iteration, for a subsequent grouping into grammatical categories, at least one of the components of the pattern is in turn a candidate for a pattern of a previous iteration.

7. - Method according to any one of the preceding claims, characterized in that the step of determining the semantic category is performed on the groups when one of its components is a grammatical category of verb.

8. - Method according to claim 7, characterized in that the component, whose grammatical category is verb, is a pattern of a previous iteration.