WO2002027536A1 - Extended functionality for an inverse inference engine based web search - Google Patents
Extended functionality for an inverse inference engine based web search Download PDFInfo
- Publication number
- WO2002027536A1 WO2002027536A1 PCT/US2001/029943 US0129943W WO0227536A1 WO 2002027536 A1 WO2002027536 A1 WO 2002027536A1 US 0129943 W US0129943 W US 0129943W WO 0227536 A1 WO0227536 A1 WO 0227536A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- term
- matrix
- documents
- weights
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
Definitions
- the present invention relates generally to systems for searching document sets, and more specifically to an advanced system for cross language document retrieval.
- LSA Latent Semantic Analysis
- LSI Latent Semantic Indexing
- LSI is based on Singular Value Decomposition (SVD). Bartell et al. (1996), No. 3 in Appendix A, have shown that LSI is an optimal special case of multidimensional scaling.
- the aim of all indexing schemes which are based on multivariate analysis or unsupervised classification methods is to automate the process of clustering and categorizing documents by topic.
- An expensive precursor was the method of repertory hypergrids, which requires expert rating of knowledge chunks against a number of discriminant traits (Boose, 1985, No. 6 in Appendix A; Waltz and Pollack, 1985, No.
- the Internet is a multilingual universe where travel is limited by the speed of indexing.
- existing search portals do not equalize the accessibility of information across languages.
- No existing search engine indexes more than 30% of the Web. This results, at least in part, from technological limitations, which have to do with the speed and scalability of existing Web crawling technology, and the availability of network bandwidth.
- many existing sites cannot maintain up-to-date indices because indexing technology has not been fully integrated with a database management system. Whenever possible, existing Web robots and crawlers limit indexing to pages in the language that is most likely the language of a regional audience.
- any algorithm applied to cross language document retrieval should be scalable to very large information matrices.
- An effective system could power the first truly international search portal. Multilingual search provided through such a portal could change the overall dynamics and structure of the Internet, upset its cultural imbalance, and open new markets.
- Machine Translators Automatic translation engines, referred to as Machine Translators (MT)
- MT Machine Translators
- Examples of existing Machine Translators include BabelfishTM as provided by the AltaVista Company, and NeuroTranTM provided by Translation Experts, Ltd.
- Multilingual search technology could also improve monolingual search in more than one way. The omission of many foreign language pages from the relevant indices destroys the integrity of the link structure of the Web. As a result, for example, the HTML page of a foreign researcher or a foreign institution may never be found, even if it points to a publication in the English language.
- multilingual search capabilities could resolve keyword and concept ambiguities across languages .
- a direct approach to multilingual interrogation is to use existing Machine Translation (MT) systems to automatically translate an entire textual database from every single language into the language of the user.
- MT Machine Translation
- This approach is clearly unrealistic for the Internet, due to the size of the target search space.
- MT syntax errors, and, more significantly, errors in translating concepts make it technically unsuitable for other multilingual database collections in general.
- a variation on this approach is multilingual interrogation.
- multilingual interrogation the idea is to translate the query from a source language to multiple target languages, for example, using inter-lingual dictionaries and knowledge bases.
- translation into different languages must account for the fact that concepts expressed by a single term in one language sometimes are expressed by multiple distinct terms in another. For example, the term "tempo" in Italian corresponds to two different concepts in English: time and weather.
- Still another existing approach consists of combining machine translation methods with information retrieval methods.
- This approach has been developed by the European ESPRIT consortium in the project EMIR (European Multilingual Information Retrieval) (EMIR, 1994, No. 15 in Appendix A).
- This system uses three main tools: 1) linguistic processors (morphological and syntactic analysis) which perform grammatical tagging, identify dependency relations and normalize the representation of uniterms and compounds; 2) a statistical model which is used to weight the query-document intersection; 3) a monolingual and multilingual reformulation system whose aim is to infer, from the original natural language query words, all possible expressions of the same concept that can occur in the document, whatever the language.
- An inverse inference engine for high performance Web searching includes a superior method for performing Latent Semantic Analysis, in which the underlying search problem is cast as a Backus-Gilbert (B- G) inverse problem (Press et . al, 1997, No. 32 in Appendix A) .
- B- G Backus-Gilbert
- Improved efficiency is provided by the inverse inference engine as a result of solving an optimization problem for the distance between a transformed query vector and document clusters directly in a transform space.
- Semantic bases approximate the query in this transform space. Bases with negative coefficients contain the latent semantic information.
- the inverse inference engine may be applied to a search tool that returns a list of direct document hits and a list of latent document hits in response to a query.
- the Inverse Inference approach of the disclosed system is a new approach to Latent Semantic Analysis (LSI) , that unlike LSI is fast and scalable, and therefore applicable to the task of cross language semantic analysis.
- An extension of the inverse inference engine provides cross language document retrieval in a way that is scalable to very large information matrices.
- CL-LSI cross-language LSI
- the disclosed system for cross language document retrieval uses the much faster inverse inference engine, instead of SVD, to perform matrix reduction.
- the list of direct document hits may contain local language document hits, while the list of latent document hits may contain foreign language document hits .
- the disclosed search technology also provides automatic tools for accelerating the construction of a multilingual lexicon, and for extracting terminology from multilingual corpora of texts.
- the information matrix used as input to the inverse inference engine is organized into blocks of rows corresponding to languages within a predetermined set of natural languages. For example, using a predetermined language set consisting of English, French and Italian, an illustrative information matrix would consist of 3 sections of rows, a first of which is associated with English keywords, a second of which is associated with Italian keywords, and a third of which is associated with French keywords. Columns of entries within the first section of rows in the information matrix represent documents in English, columns of entries within the second section of rows represent documents in French, and columns of entries within the third section of rows represent documents in Italian. The information matrix is further organized columnwise into two main partitions.
- the first partition is a left-hand side column vector of blocks of entries representing fully translated documents, which may referred to as the "reference documents", or "training set.”
- the second partition is a matrix of blocks of entries representing documents for which translations are not available in all of the predetermined languages, including a number of sets of columns corresponding to the languages in the predetermined language set. Further in the second partition, entries in blocks outside the main diagonal of blocks contain zero values. In other words, those entries in blocks along the main diagonal within the second partition represent the contents of those documents for which full translations are not available, and which make up the target search space.
- Fig. 1 is a flow chart showing a series of steps for processing documents and processing user queries
- Fig. 2 shows an architectural view of components in an illustrative embodiment
- Fig. 3 shows steps performed during feature extraction and information matrix (term-document matrix) formation
- Figs. 4a and 4b shows examples information (or term- document) matrices used for cross-language document retrieval
- Fig. 5 illustrates a solution of the inverse optimization problem for a number of single term queries in a cross-language document retrieval system
- Fig. 6 illustrates cross language retrieval using an inverse inference engine
- Fig. 7 illustrates a solution of the inverse optimization problem for a number of single term queries in an automatic, knowledge based training embodiment.
- Information retrieval is the process of comparing document content with information need.
- information retrieval engines are based on two simple but robust metrics: exact matching or the vector space model.
- exact-match systems partition the set of documents in the collection into those documents that match the query and those that do not.
- the logic used in exact-match systems typically involves Boolean operators, and accordingly is very rigid: the presence or absence of a single term in a document is sufficient for retrieval or rejection of that document.
- the exact-match model does not incorporate term weights.
- the exact-match model generally assumes that all documents containing the exact term(s) found in the query are equally useful. Information retrieval researchers have proposed various revisions and extensions to the basic exact-match model.
- the "fuzzy-set" retrieval model introduces term weights so that documents can be ranked in decreasing order relative to the frequency of occurrence of those weighted terms.
- the vector space model (Salton et al., 1983, No. 41 in Appendix A) views documents and queries as vectors in a high-dimensional vector space, where each dimension corresponds to a possible document feature.
- the vector elements may be binary, as in the exact-match model, but they are usually taken to be term weights which assign "importance" values to the terms within the query or document.
- the term weights are usually normalized.
- the similarity between a given query and a document to which it is compared is considered to be the distance between the query and document vectors.
- the cosine similarity measure is used most frequently for this purpose. It is the normal inner product between vector elements:
- weights w are an expression of some statistical measure, like the absolute frequency of occurrence of each term within a document, whereas the weights in the query vector reflect the relative importance of the terms in the query, as perceived by the user.
- the disclosed system computes a constrained measure of the similarity between a query vector and all documents in a term-document matrix. More specifically, at step 5 of Fig. 1, the disclosed information retrieval system parses a number of electronic information files containing text.
- the parsing of the electronic text at step 5 of Fig. 1 may include recognizing acronyms, recording word positions, and extracting word roots.
- the parsing of step 5 may include processing of tag information associated with HTML and XML files, in the case where any of the electronic information files are in HTML or XML format.
- the parsing of the electronic information files performed at step 5 may further include generating a number of concept identification numbers (concept IDs) corresponding to respective terms (also referred to as "keywords") to be associated with the rows of the term-document matrix formed at step 6.
- the disclosed system may also count the occurrences of individual terms in each of the electronic information files at step 5.
- the disclosed system generates a term-document matrix (also referred to as the "information matrix" ) based on the contents of the electronic document files parsed at step 5.
- the value of each cell (or "entry") in the term-document matrix generated at step 6 indicates the number of occurrences of the respective term indicated by the row of the cell, within the respective one of the electronic information files indicated by the column of the cell.
- the values of the cells in the term-document matrix may reflect the presence or absence of the respective term in the respective electronic information file.
- the information matrix used as input to the inverse inference engine is as follows: or
- the superscripts identify the language of document blocks in the term document matrix.
- E stands for English, F for French, and I for Italian.
- the left-hand partition is referred to as the reference partition, and includes blocks (R) of entries representing the contents of reference documents.
- R blocks
- the reference documents (R) are documents for which there is a translation in every language of a predetermined set of languages. However, in practice it may be easier to find bilingual translations than trilingual translations.
- the term document matrix may be split into multiple matrices in which the reference documents used are those for which a translation is available from a first language in the set languages to a second language in the set of languages, set. Accordingly, separate matrices linking English to French and English to Italian are used in embodiment (b) above, and the reference documents or translations linking English to French may be different from the reference documents or translations linking English to Italian.
- the predetermined language set in examples (a) and (b) above includes English, French and Italian.
- the right-hand partition in each matrix includes blocks (T) of entries representing the contents of documents to be searched.
- the diagonal blocks (T) include entries representing the contents of all "target" multilingual documents to be searched.
- embodiment (a) above is used as the term document matrix
- a single trilingual search is performed across the single matrix.
- embodiment (b) above is used as the term document matrix
- two bilingual searches are performed.
- the first bilingual search is performed from English to French using the top matrix, which represents the contents of those reference documents available in both English and French, as well as target documents in English and French for which translations between English and French are not available.
- the second bilingual search is performed from English to Italian using the bottom matrix, which represents the contents of those reference documents available in both English and Italian, as well as target documents in Italian and English for which translations between English and Italian are not available.
- the R blocks are relatively large with respect to T blocks, searching by the disclosed system using the information matrix would potentially yield relatively more accurate results.
- searching by the disclosed system using the information matrix would potentially be performed more quickly, but without the gains in accuracy obtained in the case where the R blocks are relatively larger than the T blocks. Accordingly, making the R blocks as large as possible may be done in order to optimize search accuracy, while making R blocks smaller may optimize performance in terms of search time.
- the R blocks may also be referred to as the full translation blocks or training corpus.
- the search space over which the information matrix is compiled is application specific and/or user specified.
- the T blocks of the term document matrix are not necessarily equal in size.
- the number of columns in each T block reflects the number of target documents in the associated language.
- the number of rows in each block need not be equal, since the number of rows in each block may reflect in part the flexibility of the translation of keywords between languages.
- the documents represented by the R blocks are described as full translations, this is not a requirement of the disclosed system.
- corresponding documents represented by the information matrix entries in the R blocks may be equivalent across the relevant languages in that they cover common topics.
- documents sharing a single column of the R blocks need not be exact translations, they do need to be equivalent in terms of covering the same topics in the respective different languages.
- multiple news articles describing the same event, such as an election may be written in different languages by different authors.
- Such semantically related articles, in which a common topic is being discussed may be considered translations for purposes of the R blocks in the information matrix.
- cross language retrieval is accomplished by extending an English term document matrix to French and Italian.
- the extended term document matrix consisted of a left hand side "reference" partition representing the trilingual translation of the previously employed English keywords for the previous set of target documents.
- the right hand side or "target” partition of the term document matrix represented the contents of three sets of unrelated documents in each of the three languages in the predetermined language set: English, French, and Italian.
- the translation used for the English keywords was, for example, a "noisy" translation, allowing for semantic ambiguities and preferences that may result when translating across languages.
- Tempest in English may be split into both Tempete and orage in French; playwright in English may be split into both tragediografo and drammaturgo in Italian.
- the keyword theatre has the same spelling in English and French.
- the inverse inference algorithm was applied to the multilingual term document matrix, and searching performed only on the target documents.
- the training set approach for cross language retrieval is applied to the problem of searching databases where information is diluted or not reliable enough to allow the creation of robust semantic links.
- This embodiment could be used to provide an application for searching financial chat rooms or message boards.
- the application would index and accumulate information from multiple chat rooms on a hourly basis.
- a search agent would attempt to convert information that is present in a descriptive form into a quantitative or symbolic form, and provide a sentiment indicator by aligning investor opinions about a stock along some predefined semantic axes.
- the application also is capable of detecting participants who are trying to manipulate investor's opinions.
- the need for such an application is predicated on the fact that the information in the message boards or chat rooms ' alone is not robust or reliable enough to support intelligent information retrieval.
- the left partition of the term document matrix is loaded with a large amount of concurrent financial news from reliable sources.
- the information matrix accordingly is as follows:
- the disclosed system generates an auxiliary data structure associated with the previously generated concept identification numbers.
- the elements of the auxiliary data structure generated during step 7 are used to store the relative positions of each term of the term-document matrix within the electronic information files in which the term occurs. Additionally, the auxiliary data structure may be used to store the relative positions of tag information from the electronic information files, such as date information, that may be contained in the headers of any HTML and XML files. Weighting of the term-document matrix formed at step
- Weighting of the elements of the term-document matrix performed at step 8 may reflect absolute term frequency count, or any of several other measures of term distributions that combine local weighting of a matrix element with a global entropy weight for a term across the document collection, such as inverse document frequency.
- the disclosed system generates, in response to the term-document matrix generated at step 6, a term-spread matrix.
- the term-spread matrix generated at step 9 is a weighted autocorrelation of the term- document matrix generated at step 6, indicating the amount of variation in term usage, for each term, across the set of electronic information files.
- the term-spread matrix generated at step 9 is also indicative of the extent to which the terms in the electronic information files are correlated.
- the disclosed system receives a user query from a user, consisting of a list of keywords or phrases. The disclosed system parses the electronic text included in the received user query at step 16.
- the parsing of the electronic text performed at step 16 may include, for example, recognizing acronyms, extracting word roots, and looking up those previously generated concept ID numbers corresponding to individual terms in the query.
- step 17 in response to the user query received in step 16, the disclosed system generates a user query vector having as many elements as the number of rows in the term-spread matrix generated at step 9.
- the disclosed system Following creation of the query vector at step 17, at step 18 the disclosed system generates, in response to the user query vector, an error-covariance matrix.
- the error- covariance matrix generated at step 18 reflects an expected degree of uncertainty in the initial choice of terms by the user, and contained within the user query.
- the disclosed system augments the term-document matrix with an additional row for each phrase included in the user query.
- a "phrase" is considered to be a contiguous sequence of terms.
- the disclosed system adds a new row to the term-document matrix, where each cell in the new row contains the frequency of occurrence of the phrase within the respective electronic information file, as determined by the frequencies of occurrence of individual terms composing the phrase and the proximity of such concepts, as determined by their relative positions in the electronic information files, as indicated by the elements of the auxiliary data structure.
- auxiliary data structure permits reforming of the term- document matrix to include rows corresponding to phrases in the user query for the purposes of processing that query. Rows added to the term-document matrix for handling of phrases in a user query are removed after the user query has been processed.
- the disclosed system formulates, in response to the term spread matrix, error covariance matrix, and user query vector, a constrained optimization problem.
- the choice of a lambda value for the constrained optimization problem set up in step 11 is a Lagrange multiplier, and its specific value determines a trade-off between the degree of fit and the stability of all possible solutions to the constrained optimization problem.
- the disclosed system computes the similarity between each of the electronic information files and the user query by solving the constrained optimization problem formulated in step 11. Specifically, in an illustrative embodiment, the disclosed system generates a solution vector consisting of a plurality of solution weights ("document weights") .
- the document weights in the solution vector each correspond to a respective one of the electronic information files, and reflect the degree of correlation of the user query to the respective electronic information file.
- the disclosed system sorts the document weights based on a predetermined ordering, such as in decreasing order of similarity to the user query.
- the disclosed system automatically builds a lexical knowledge base responsive to the solution of the constrained optimization problem computed at step 12.
- the original term-document matrix created at step 6 and potentially weighted at step 8, rather than the term spread matrix computed at step 9, is cross-multiplied with the unsorted document weights generated at step 12 (note that the document weights must be unsorted in this step to match the original order of columns in the term-document matrix) to form a plurality of term weights, one for each term.
- term weights reflect the degree of correlation of the terms in the lexical knowledge base to the terms in the user query.
- the disclosed system returns a list of documents corresponding to the sorted document weights generated at step 13, and the lexical knowledge base generated at step 14, to the user.
- the document weights can be positive or negative.
- the positive weights are relevance scores for the source language documents (for example English)
- the negative weights are relevance scores for the target language documents (for example French or Italian) .
- the illustrative embodiment of the disclosed system splits the returned documents by sign, and sorts them in decreasing order by absolute value (e.g. positive weighted documents 0.997, 0.912, 0.843, etc., followed by negative weighted documents -0.897, 0.765, -0.564, etc. ) .
- FIG. 2 shows the overall architecture of the distributed information retrieval system.
- the system consists of four modules: Indexing 20, Storage 22, Search 24, and Query 26.
- the modules may run in different address spaces on one computer or on different computers that are linked via a network using CORBA (Common Object Request Broker Architecture) .
- CORBA Common Object Request Broker Architecture
- each server is wrapped as a distributed object which can be accessed by remote clients via method invocations.
- Multiple instances of the feature extraction modules 21 can run in parallel on different machines, and database storage can be spread across multiple platforms.
- the disclosed system may be highly modularized, thus allowing a variety of configurations and embodiments.
- the feature extraction modules 21 in the indexing module 20 may be run on inexpensive parallel systems of machines, like Beowulf clusters of Celeron PCs, and Clusters of Workstations (COW) technology consisting of dual processor SUN Ultra 60 systems.
- COW Clusters of Workstations
- the entire architecture of Fig. 2 may be deployed across an Intranet, with the "inverse inference" search engine 23 residing on a Sun Ultra 60 server and multiple GUI clients 25 on Unix and Windows platforms.
- the disclosed system may be deployed entirely on a laptop computer executing the Windows ' operating system of Microsoft Corporation.
- the indexing module 20 performs steps to reduce the original documents 27 and a query received from one of the clients 21 into symbolic form (i.e. a term-document matrix and a query vector, respectively) .
- the steps performed by the indexing module 20 can be run in batch mode (when indexing a large collection of documents for the first time or updating the indices) or on-line (when processing query tokens) .
- the disclosed architecture allows extensibility of the indexing module 20 to media other than electronic text.
- the storage module 22 shown in Fig. 2 includes a Relational DataBase Management System (RDBMS) 29, for storing the term-document matrix.
- RDBMS Relational DataBase Management System
- a search engine module 23 implements the presently disclosed inverse inference search technique. These functions provide infrastructures to search, cluster data, and establish conceptual links across the entire document database.
- Client GUIs (Graphical User Interfaces) 25 permits users to pose queries, browse query results, and inspect documents.
- GUI components may be written in the Java programming language provided by Sun Microsystems, using the standard JDK 1.1 and accompanying Swing Set.
- Various visual interface modules may be employed in connection with the GUI clients 25, for example executing in connection with the Sun Solaris operating system of Sun Microsystems, or in connection with the Windows NT, Windows 95, or Windows 98 operating systems of Microsoft Corporation.
- a feature extraction module 21 comprises a parser module 31, a stopwording module 33, a stemming module 35, and a module for generating inverted indices 37.
- the output of the indexing process using the feature extraction module 21 includes a number of inverted files (Hartman et al, 1992, No. 38 in Appendix A) , shown as the "term-document" or "information" matrix 39.
- the parser 31 removes punctuation and records relative word order.
- the parser 31 employs a set of rules to detect acronyms before they go through the stopword 33 and stemmer 35 modules.
- the parser 31 can also recognize specific HTML, SGML and XML tags.
- the stopword 33 uses a list of non-diagnostic English terms.
- the stemmer 35 is based on the Porter algorithm (described in Hartman et al, 1992, No. 38 in Appendix A) .
- Porter algorithm described in Hartman et al, 1992, No. 38 in Appendix A
- Those skilled in the art should recognize that alternative embodiments of the disclosed system may employ stemming methods based on successor variety.
- the feature extraction module provides functions 37 that generate the inverted indices by transposing individual document statistics into a term-document matrix 39.
- the indexing performed in the embodiment shown in Fig. 3 also supports indexing of document attributes.
- document attributes are HTML, SGML or XML document tags, like date, author, source.
- Each document attribute is allocated a private row for entry in the term-document matrix.
- weighting of the elements of the term-document matrix 39 may reflect absolute term frequency count, binary count, or any of several other measures of term distributions that combine local weighting of a matrix element with a global entropy weight for a term across the document collection, such as inverse document frequency.
- high precision recall results are obtained with the following weighting scheme for an element di k of the term-document matrix:
- tf ⁇ k is the frequency of term k in a document i, while the inverse document frequency of a term, idf kr is the log of the ratio of the total number, of documents in the collection to the number of documents containing that term.
- w ⁇ k is the weighting applied to the value in cell Ik of the term-document matrix. The effect of these weightings is to normalize the statistics of term frequency counts. This step weights the term frequency counts according to: 1) the length of the document in which the term occurs and 2) how common the term is across documents. To illustrate the significance of this weighting step with regard to document length, consider a term equal to the word "Clinton" .
- An electronic text document that is a 300 page thesis on Cuban-American relationships may, for example, have 35 counts of this term, while a 2 page biographical article on Bill Clinton may have 15 counts. Normalizing keyword counts by the total number of words in a document prevents the 300 pages thesis to be prioritized over the biographical article for the user query "Bill Clinton" .
- the weighting step prevents overemphasis of terms that have a high probability of occurring everywhere.
- the storage module 22 of Fig. 2 includes a Relational DataBase Management System (RDBMS) 29 for storing the information matrix 39 (also referred to as the "term-document" matrix) output by the indexing module 20.
- RDBMS Relational DataBase Management System
- the interface between the RDBMS and the Indexing and Search modules complies with OBDC standards, making the storage module vendor independent.
- the Enterprise Edition of Oracle 8.1.5 on Sun Solaris may be employed.
- a database management system is not an essential component of the disclosed invention.
- a file system may be employed for this purpose, instead of a RDBMS.
- the concept synchronizer 28 is used by a parallelized implementation of the indexing module. In such an implementation, at indexing time, multiple processors parse and index electronic text files in parallel.
- the concept synchronizer 28 maintains a look up table of concept identification numbers, so that when one processor encounters a keyword which has already been assigned a concept identification number by another processor, the same concept identification number is used, instead of creating a new one. In this way, the concept synchronizer 28 prevents having more than one row for the same term in the term-document matrix.
- the search engine 23 is based on a data driven inductive learning model, of which LSI is an example (Berry et al, 1995, No. 5 in Appendix A; Landauer and Dumais, 1997. No. 20 in Appendix A).
- LSI data driven inductive learning model
- the disclosed system provides distinct advantages with regard to: 1) mathematical procedure; 2) precision of the search; 3) speed of computations and 4) scalability to large information matrices.
- the disclosed system attempts to overcome the problems of existing systems related to synonymy and polysemy using a data driven approach. In other words, instead of using a lexical knowledge base built manually by experts, the disclosed system builds one automatically from the observed statistical distribution of terms and word co-occurrences in the document database.
- Fig. 4a shows an example of a term-document matrix 40, used for cross-language document retrieval in the disclosed system.
- the term-document matrix 40 illustrates the embodiment of the disclosed system in which a single matrix is used, and the reference documents (R) are documents for which there is a translation in every language of a predetermined set of languages. Accordingly, the reference documents in the example of Fig. 4a are shown as Rl, R2, R3, R4, R5 and R6.
- the term- document matrix 40 of Fig. 4a consists, for example, of elements storing values representing absolute keyword frequencies.
- Term-document matrix 40 is shown including a set of rows 42 for English keywords, a set of rows 44 for French keywords, and a set of rows 46 for Italian keywords.
- the term-document matrix 40 is further shown including a set of columns 48 describing the contents of the reference documents.
- Each column in the set of columns 48 describes the contents of a document for which there exists translations in each of the predetermined language set, in this case English, French and Italian.
- the translations used within a single column need not be literal translations, but must at least share semantic content. Accordingly, the contents of the English version of reference document Rl are reflected in the values of column Rl in the set of rows 42, the contents of the French version of the reference document Rl are reflected in the values of column Rl in the set of rows 44, and the contents of the Italian version of the reference document Rl are reflected in the values of column Rl in the set of rows 46.
- the term-document matrix 40 is further shown including a set of columns 50 describing the contents of a number of target documents.
- the columns TE1, TE2, TE3, and TE4 represent the contents of English language target documents
- the columns TFl, TF2, and TF3 represent the contents of French language target documents
- the columns Til, TI2, TI3 and TI4 represent the contents of Italian language target documents.
- the target documents are those documents for which translations are not available in all of the languages in the predetermined set of languages.
- the column TE1 describes the contents of the target document TEl
- the column TE2 describes the contents of the target document TE2, and so on.
- the keywords present in a given target document are those keywords in the language in which that target document is written.
- the matrix elements for a given one of the rows 50 are zero outside of the set of rows for the language of the specific target document.
- the matrix element values of columns TEl, TE2, TE3, and TE4 are zero outside of the set of rows 42
- the matrix element values of columns TFl, TF2, and TF3 are zero outside of the set of rows 44
- the matrix element values of columns Til, TI2, TI3 and TI4 are zero outside of the set of rows 46.
- Non-zero matrix element values for keywords in languages other than the source language of a given document may reflect the presence of language invariant keywords. In the example of Fig. 4a, the keyword Shakespeare illustrates such a language invariant keyword.
- the reference document keyword content results in translations of keywords being present in each of the sets of rows 42, 44 and 46.
- the target documents may include keywords not found in the reference documents.
- the keyword content of the target documents would result in one or more keywords existing in only one of the languages in the predetermined set of languages, without translation to the other languages.
- the terms "sail”, “cuir” and “torre” in the term-document matrix of Fig. 4a are additional terms not present in the reference documents.
- Fig. 4b shows two term document matrices, illustrating the embodiment of the disclosed system in which multiple matrices are used, where the reference documents (R) for a given one of the matrices are documents for which versions are available in only two of the languages in the predetermined set of languages.
- the term-document matrix 52 of Fig. 4b is shown- including a set of rows 56 for English keywords, and a set of rows 58 for French keywords.
- the matrix 52 further is shown including a set of columns 60 describing the contents of reference documents Rl, R2, R3, R4, R5 and R6.
- the set of columns 62 in matrix 52 describes the contents of English target documents TEl, TE2, TE3 and TE4, as well as French documents TFl, TF2 and TF3.
- the matrix 54 is shown including a set of rows 64 for English keywords, and a set of rows 66 for Italian keywords.
- the matrix 54 further includes columns 68 for the contents of the reference documents Rl, R2, R3, R4, R5 and R6.
- the columns 70 describe the contents of the English target documents TEl, TE2, TE3, and TE4, and the contents of the Italian target documents Til, TI2, TI3 and TI4.
- LSI assumes that there is some underlying or latent structure in term usage. This structure is partially obscured through variability in the individual term attributes which are extracted from a document or used in the query.
- a truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents.
- SSD singular value decomposition
- D mxn term-document or information matrix with m>n, where each element d ⁇ j is some statistical indicator (binary, term frequency or Inverse Document Frequency (IDF) weights - more complex statistical measures of term distribution could be supported) of the occurrence of term i in a particular document , and let q be the input query.
- LSI approximates D as
- the weighted left orthogonal matrix provides a transform operator for both documents
- the cosine metric is then employed to measure the similarity between the transformed query a and the transformed document vectors (rows of V k ) in the reduced k- dimensional space.
- the SVD employed by the LSI technique of equation (1) above provides a special solution to the overdetermined decomposition problem
- D is an m x n term-document matrix
- g is a query vector with m elements
- a and a are a k x n matrix and k- length vector of transform coefficients, respectively.
- the columns of A are document transforms, whereas a is the query transform.
- Ranking a document against a query is a matter of comparing a and the corresponding column of A in a reduced transform space spanned by ⁇ .
- the decomposition of an overdetermined system is not unique. Nonuniqueness provides the possibility of adaptation, i.e. of choosing among the many representations, or transform spaces, one of which is more suited for the purposes of the disclosed system.
- the columns of U k and V k are the first k orthonormal eigenvectors associated with DD T and D T D respectively.
- ⁇ (UA) k
- the columns of A are a set of norm preserving, orthonormal basis functions. If we use the cosine metric to measure the distance between the transformed documents and query, we can show that as k—>n
- the present invention is based on the recognition that the measurement of the distance between the transformed documents and query, as stated above is a special solution to the more general optimization problem
- c ⁇ is a noisy and imprecise datum, consisting of a signal s ⁇ and noise n ⁇ ; r ⁇ is a linear response kernel, and w (x) is a model about which information is to be determined.
- w (x) is a model about which information is to be determined.
- Equation (3) an element in the query datum, is one of an imprecise collection of terms and term weights input by the user, q" ⁇ is the best choice of terms and term weights that the user could have input to retrieve the documents that are most relevant to a given search, and n ⁇ is the difference between the user's choice and such an ideal set of input terms and term weights.
- a statistical measure of term distribution across the document collection, D ⁇ (x) describes the system response.
- the subscript i is the term number; x is the document dimension (or document number, when 3 is discretized) .
- the statistical measure of term distribution may be simple binary, frequency, or inverse document frequency indices, or more refined statistical indices.
- the model is an unknown document distance w (x) that satisfies the query datum in a semantic transform space. Equation (3) above is also referred to as the forward model equation.
- Equation (3) in non-unique.
- the optimization principle illustrated by equation (2) above considers two positive functionals of w, one of which, [w] , quantifies a property of the solution, while the other, A [w] , quantifies the degree of fit to the input data.
- the present system operates to minimize A [w] subject to the constraint that B [w] has some particular value, by the method of Lagrange multipliers:
- ⁇ is a Lagrange multiplier.
- the Backus-Gilbert method "differs from other regularization methods in the nature of its functionals A and B.” (Press et al, 1997, No. 32 in Appendix A). These functionals maximize both the stability (B) and the resolving power (A) of the solution.
- An additional distinguishing feature is that, unlike what happens in conventional methods, the choice of the constant ⁇ which determines the relative weighting of A versus B can easily be made before any actual data is processed.
- the document-query distances w c appear as a linear combination of transformed documents T ⁇ (x) and the terms in input query q ⁇ , where i is the term number.
- the inverse response kernels reverse the relationship established by the linear response kernels D ⁇ (x) in the forward model equation (3).
- the D ⁇ (x) 's are binary, frequency, or inverse document frequency distributions.
- the integral of each term distribution Di (x) is defined in the illustrative embodiment as
- ⁇ is a resolution kernel, whose width or spread is minimized by the disclosed system in order to maximize the resolving power of the solution. If we substitute equation (5) into equation (3) we arrive at an explicit expression for the resolution kernel ⁇
- the Backus and Gilbert method chooses to minimize the second moment of the width or spread of ⁇ at each value of x, while requiring it to have unit area.
- Optional parameters available in an illustrative embodiment are: 1) the dimensionality of the semantic transform space; 2) latent term feedback; 3) latent document list; 4) document feedback.
- the value of the Lagrangian multiplier ⁇ in (7) determines the dimensionality of the transform space. The larger the value of ⁇ , the smaller the number of concepts in transform space, and the coarser the clustering of documents.
- the effect of the regularization is that relevance weights are assigned more uniformly across a document collection. A relevance judgement is forced even for those documents which do not explicitly contain the keywords in the user query. These documents may contain relevant keyword structures in transform space.
- the disclosed system achieves latency by sorting the coefficients in the solution to equation (7). Positive coefficients are associated with semantic bases which contain the keywords in the query; negative coefficients are associated with semantic bases which contain latent keywords.
- Fig. 5 shows the inverse optimization problem solved for a number of single keyword queries g 72.
- the output consists of direct concept feedback q r + 76, which consists of concepts directly related to g in the source language, for example English in Fig. 5.
- the output further includes latent concept feedback g' - 78, which consists of French language concepts never associated with the English language g, but found in similar semantic relations across the two languages.
- This latent concept feedback (g'-) is shown for purposes of illustration as French concepts in Fig. 5.
- Also returned are lists of relevant documents for the two languages, shown as a list 77 of relevant English documents, and a list 79 of relevant French documents.
- FIG. 6 illustrates a list of documents returned by the illustrative embodiment in response to the English language query 200 consisting of "theatre, comedy.”
- Two separate ranked lists are returned: a first list 202 of direct hits, and a second list 204 of latent hits.
- Foreign language documents are found prevalently in the second list 204.
- Some French documents appear in the first list 202 because they contain one of the keywords in the query, "theatre.”
- a by-product of the disclosed system for cross language retrieval is the alignment of semantic axes for the English, French and Italian subspaces, shown as Direct Keyword Suggestion and Relative Weights 206 and Latent Keyword Suggestion and Relative Weights 208.
- the distances between keywords in the three languages are generated as the absolute weights that each keyword should have in a fully multilingual query. That is, in response to the monolingual query theatre, comedy the engine retrieves multilingual documents, and also suggests to the user the foreign language keywords in 206 and 208, as well respective relative weights 210 and 212 that a fully multilingual query should have. Note that the keyword theatre is weighted twice as much as the Italian teatro, since it applies to twice as many languages (English and French) . The keyword Shakespeare dominates the latent semantic space since it is the same in all languages.
- Fig. 7 illustrates semantic keyword feedback obtained by isolating positive and negative coefficients in the truncated basis function expansion for the query approximation g c , in the disclosed automatic knowledge based training embodiment.
- the inverse optimization problem is solved for a single keyword query g 172, shown for purposes of illustration as the word "wind".
- the left hand partition of the term-document matrix provided as input consists of training information, for example the contents of the Encarta encyclopedia.
- the disclosed system then operates to form semantic relationships based on the contents of the training information, but returns results to the user only from the target documents described in the right hand side partition of the input term-document matrix, which represents the documents in the search space.
- the automatic knowledge based training embodiment of the disclosed system may be used to find information in the search space that is semantically relevant to the input query.
- the disclosed system returns direct concept feedback g c+ 176, consisting of concepts in the target documents that are directly related to a term or terms from g 172, and latent concept feedback g c _ 178, consisting of concepts never associated directly with the query term 172 in the target documents, but semantically linked within the reference documents to a term or terms from g 172.
- the list of directly relevant terms g c+ 176 is shown for purposes of illustration consisting of the terms "WIND" and "STORM”, while the list of indirectly relevant terms g c _ 178 is shown consisting of the terms "hurricane, snow, mph, rain, weather, flood, thunderstorm, tornado".
- Fig. 7 the disclosed system is shown generating two lists of relevant documents: a list of direct documents 174, and a list of latent documents 175.
- the list of direct documents 174 indicates a number of relevant documents that contain one or more of the input query keywords.
- the list of indirect documents 175 indicates a number of relevant documents that do not contain a keyword from the input query.
- the programs defining the functions of the present invention can be delivered to a computer in many forms; including, but not limited to: (a) information permanently stored on non-writable storage media (e.g. read only memory devices within a computer such as ROM or CD-ROM disks readable by a computer I/O attachment) ; (b) information alterably stored on writable storage media (e.g. floppy disks and hard drives); or (c) information conveyed to a computer through communication media for example using baseband signaling or broadband signaling techniques, including carrier wave signaling techniques, such as over computer or telephone networks via a modem.
- non-writable storage media e.g. read only memory devices within a computer such as ROM or CD-ROM disks readable by a computer I/O attachment
- writable storage media e.g. floppy disks and hard drives
- information conveyed to a computer through communication media for example using baseband signaling or broadband signaling techniques, including carrier wave signaling techniques,
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP01977165.8A EP1323067A4 (en) | 2000-09-25 | 2001-09-25 | Extended functionality for an inverse inference engine based web search |
AU2001296304A AU2001296304A1 (en) | 2000-09-25 | 2001-09-25 | Extended functionality for an inverse inference engine based web search |
CA2423476A CA2423476C (en) | 2000-09-25 | 2001-09-25 | Extended functionality for an inverse inference engine based web search |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US23525500P | 2000-09-25 | 2000-09-25 | |
US60/235,255 | 2000-09-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002027536A1 true WO2002027536A1 (en) | 2002-04-04 |
Family
ID=22884742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/029943 WO2002027536A1 (en) | 2000-09-25 | 2001-09-25 | Extended functionality for an inverse inference engine based web search |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1323067A4 (en) |
AU (1) | AU2001296304A1 (en) |
CA (1) | CA2423476C (en) |
WO (1) | WO2002027536A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006068872A2 (en) * | 2004-12-13 | 2006-06-29 | Insightful Corporation | Method and system for extending keyword searching to syntactically and semantically annotated data |
WO2006074324A1 (en) * | 2005-01-04 | 2006-07-13 | Thomson Global Resources | Systems, methods, software, and interfaces for multilingual information retrieval |
US7398201B2 (en) | 2001-08-14 | 2008-07-08 | Evri Inc. | Method and system for enhanced data searching |
US8838633B2 (en) | 2010-08-11 | 2014-09-16 | Vcvc Iii Llc | NLP-based sentiment analysis |
US8856096B2 (en) | 2005-11-16 | 2014-10-07 | Vcvc Iii Llc | Extending keyword searching to syntactically and semantically annotated data |
US8954469B2 (en) | 2007-03-14 | 2015-02-10 | Vcvciii Llc | Query templates and labeled search tip system, methods, and techniques |
US9092416B2 (en) | 2010-03-30 | 2015-07-28 | Vcvc Iii Llc | NLP-based systems and methods for providing quotations |
US9116995B2 (en) | 2011-03-30 | 2015-08-25 | Vcvc Iii Llc | Cluster-based identification of news stories |
US9405848B2 (en) | 2010-09-15 | 2016-08-02 | Vcvc Iii Llc | Recommending mobile device activities |
US9471670B2 (en) | 2007-10-17 | 2016-10-18 | Vcvc Iii Llc | NLP-based content recommender |
US9613004B2 (en) | 2007-10-17 | 2017-04-04 | Vcvc Iii Llc | NLP-based entity recognition and disambiguation |
US9710556B2 (en) | 2010-03-01 | 2017-07-18 | Vcvc Iii Llc | Content recommendation based on collections of entities |
US10049150B2 (en) | 2010-11-01 | 2018-08-14 | Fiver Llc | Category-based content recommendation |
CN108984647A (en) * | 2018-06-26 | 2018-12-11 | 北京工业大学 | A kind of water utilities domain knowledge map construction method based on Chinese text |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4839853A (en) * | 1988-09-15 | 1989-06-13 | Bell Communications Research, Inc. | Computer information retrieval using latent semantic structure |
US5301109A (en) | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
US5794178A (en) * | 1993-09-20 | 1998-08-11 | Hnc Software, Inc. | Visualization of information using graphical representations of context vector based relationships and attributes |
US6006221A (en) * | 1995-08-16 | 1999-12-21 | Syracuse University | Multilingual document retrieval system and method using semantic vector matching |
US6064951A (en) * | 1997-12-11 | 2000-05-16 | Electronic And Telecommunications Research Institute | Query transformation system and method enabling retrieval of multilingual web documents |
-
2001
- 2001-09-25 CA CA2423476A patent/CA2423476C/en not_active Expired - Fee Related
- 2001-09-25 EP EP01977165.8A patent/EP1323067A4/en not_active Ceased
- 2001-09-25 AU AU2001296304A patent/AU2001296304A1/en not_active Abandoned
- 2001-09-25 WO PCT/US2001/029943 patent/WO2002027536A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4839853A (en) * | 1988-09-15 | 1989-06-13 | Bell Communications Research, Inc. | Computer information retrieval using latent semantic structure |
US5301109A (en) | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
US5794178A (en) * | 1993-09-20 | 1998-08-11 | Hnc Software, Inc. | Visualization of information using graphical representations of context vector based relationships and attributes |
US6006221A (en) * | 1995-08-16 | 1999-12-21 | Syracuse University | Multilingual document retrieval system and method using semantic vector matching |
US6064951A (en) * | 1997-12-11 | 2000-05-16 | Electronic And Telecommunications Research Institute | Query transformation system and method enabling retrieval of multilingual web documents |
Non-Patent Citations (4)
Title |
---|
DEERWESTER ET AL., APPENDIX A, 1990 |
DUMAIS ET AL.: "Automatic cross-language information retrieval using latent semantic indexing", October 1996 (1996-10-01), pages 1 - 11, XP002949693 * |
PRESS, APPENDIX A, 1997 |
See also references of EP1323067A4 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7526425B2 (en) | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US8131540B2 (en) | 2001-08-14 | 2012-03-06 | Evri, Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US7953593B2 (en) | 2001-08-14 | 2011-05-31 | Evri, Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US7398201B2 (en) | 2001-08-14 | 2008-07-08 | Evri Inc. | Method and system for enhanced data searching |
WO2006068872A2 (en) * | 2004-12-13 | 2006-06-29 | Insightful Corporation | Method and system for extending keyword searching to syntactically and semantically annotated data |
WO2006068872A3 (en) * | 2004-12-13 | 2006-09-28 | Insightful Corp | Method and system for extending keyword searching to syntactically and semantically annotated data |
JP4881878B2 (en) * | 2005-01-04 | 2012-02-22 | トムソン ルーターズ グローバル リソーシーズ | Systems, methods, software, and interfaces for multilingual information retrieval |
WO2006074324A1 (en) * | 2005-01-04 | 2006-07-13 | Thomson Global Resources | Systems, methods, software, and interfaces for multilingual information retrieval |
US9418139B2 (en) | 2005-01-04 | 2016-08-16 | Thomson Reuters Global Resources | Systems, methods, software, and interfaces for multilingual information retrieval |
US8856096B2 (en) | 2005-11-16 | 2014-10-07 | Vcvc Iii Llc | Extending keyword searching to syntactically and semantically annotated data |
US9378285B2 (en) | 2005-11-16 | 2016-06-28 | Vcvc Iii Llc | Extending keyword searching to syntactically and semantically annotated data |
US8954469B2 (en) | 2007-03-14 | 2015-02-10 | Vcvciii Llc | Query templates and labeled search tip system, methods, and techniques |
US9934313B2 (en) | 2007-03-14 | 2018-04-03 | Fiver Llc | Query templates and labeled search tip system, methods and techniques |
US10282389B2 (en) | 2007-10-17 | 2019-05-07 | Fiver Llc | NLP-based entity recognition and disambiguation |
US9471670B2 (en) | 2007-10-17 | 2016-10-18 | Vcvc Iii Llc | NLP-based content recommender |
US9613004B2 (en) | 2007-10-17 | 2017-04-04 | Vcvc Iii Llc | NLP-based entity recognition and disambiguation |
US9710556B2 (en) | 2010-03-01 | 2017-07-18 | Vcvc Iii Llc | Content recommendation based on collections of entities |
US9092416B2 (en) | 2010-03-30 | 2015-07-28 | Vcvc Iii Llc | NLP-based systems and methods for providing quotations |
US10331783B2 (en) | 2010-03-30 | 2019-06-25 | Fiver Llc | NLP-based systems and methods for providing quotations |
US8838633B2 (en) | 2010-08-11 | 2014-09-16 | Vcvc Iii Llc | NLP-based sentiment analysis |
US9405848B2 (en) | 2010-09-15 | 2016-08-02 | Vcvc Iii Llc | Recommending mobile device activities |
US10049150B2 (en) | 2010-11-01 | 2018-08-14 | Fiver Llc | Category-based content recommendation |
US9116995B2 (en) | 2011-03-30 | 2015-08-25 | Vcvc Iii Llc | Cluster-based identification of news stories |
CN108984647A (en) * | 2018-06-26 | 2018-12-11 | 北京工业大学 | A kind of water utilities domain knowledge map construction method based on Chinese text |
Also Published As
Publication number | Publication date |
---|---|
EP1323067A1 (en) | 2003-07-02 |
AU2001296304A1 (en) | 2002-04-08 |
CA2423476C (en) | 2010-07-20 |
CA2423476A1 (en) | 2002-04-04 |
EP1323067A4 (en) | 2013-11-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6757646B2 (en) | Extended functionality for an inverse inference engine based web search | |
US6510406B1 (en) | Inverse inference engine for high performance web search | |
US6862710B1 (en) | Internet navigation using soft hyperlinks | |
US7085771B2 (en) | System and method for automatically discovering a hierarchy of concepts from a corpus of documents | |
Anick et al. | The paraphrase search assistant: terminological feedback for iterative information seeking | |
Korenius et al. | On principal component analysis, cosine and Euclidean measures in information retrieval | |
Moldovan et al. | Using wordnet and lexical operators to improve internet searches | |
Dumais | Latent semantic indexing (LSI) and TREC-2 | |
Karypis et al. | Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization | |
US6480843B2 (en) | Supporting web-query expansion efficiently using multi-granularity indexing and query processing | |
Pons-Porrata et al. | Topic discovery based on text mining techniques | |
US7831597B2 (en) | Text summarization method and apparatus using a multidimensional subspace | |
Liddy et al. | Text categorization for multiple users based on semantic features from a machine-readable dictionary | |
CA2423476C (en) | Extended functionality for an inverse inference engine based web search | |
Lam et al. | Using contextual analysis for news event detection | |
Dumais | LSA and information retrieval: Getting back to basics | |
Hull | Information retrieval using statistical classification | |
Phadnis et al. | Framework for document retrieval using latent semantic indexing | |
Park et al. | Automatic query-based personalized summarization that uses pseudo relevance feedback with nmf | |
Momin et al. | Web document clustering using document index graph | |
Mehler et al. | Text mining | |
He et al. | Mining a web citation database for document clustering | |
Korenius et al. | Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments | |
Piotrowski | NLP-supported full-text retrieval | |
Wang et al. | Document Clustering using Compound Words. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2423476 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2001977165 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2001977165 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |