CN102024026B

CN102024026B - Method and system for processing query terms

Info

Publication number: CN102024026B
Application number: CN2010105465806A
Authority: CN
Inventors: 鲁齐拉·S·达特; 法比奥·洛皮亚诺
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2006-04-19
Filing date: 2007-04-19
Publication date: 2013-03-27
Anticipated expiration: 2027-04-19
Also published as: EP2016486A4; CN102024026A; EP2016486A2; WO2007124385A2; WO2007124385A3

Abstract

Methods, systems, and apparatus, including computer program products, to perform operations relating to processing query terms in a search query presented to a search engine. In one aspect, a method includes determining a query language from the query terms and the language of a user interface. In another aspect, a method includes using the interface language to select one or more mappings and using the mappings to simplify each query term; and applying each simplified query term to a synonyms map to identify possible synonyms with which to augment the search query. In another aspect, a synonyms map is generated from a corpus of documents. In another aspect, a method includes identifying one or more potential synonyms for a query term by looking up simplified query term in a synonyms map, the synonyms map mapping each of a plurality of keys to one or more variants, each variant being a word associated with one or more document languages.

Description

Method and system for the treatment of query terms

The application is based on dividing an application that the PCT China national stage patented claim 200780021902.1 that the applying date is on April 19th, 2007 (date that enters the China national stage is on Dec 12nd, 2008) proposes.

Background technology

But the present invention relates to process Linguistic indeterminacy in disposal search queries and in the search on the storehouse that comprises document and other searching resource, wherein inquiry and resource can be with any expressions the in the multiple different language.

Document is carried out index with search engine and supplying method is searched for its content is carried out index by search engine document.Document is write with many different language; Some documents have uses multilingual content.Various characters are used to represent the word of these language: the Latin alphabet (that is, 26 anacrusis characters from A to Z, capital and small letter body), difference note (that is, reading character again), loigature are (for example, β,

), Cyrillic character and other.

Regrettably, produce the ability of these characters and simplicity device with install between difference very big.The author of content and the user of search engine may can not produce the character that it prefers expediently.On the contrary, the user of such device will often be provided as character or the character string of close substitute.For example, AE can be provided to substitute

And such convention that substitutes is different between language and user.For example, the certain user of search AE may prefer to see also and comprise

The result.

Being used for solution is to process index content to remove stress and special character is converted to one group of standard character in a kind of method of this problem of search engine.The method removes information from index, so that the specific stressed example of searching word only.The method is also impaired because of language agnosticism (agnosticism), and wherein said language agnosticism is not subjected to such customer impact: described user's expection is formed by the convention of described user's language-specific.

Summary of the invention

This instructions discloses the various embodiment of the technology of the word that is used for the use search inquiry.Embodiment is characterized by (feature) method, system, equipment, comprises computer program equipment.In content of the present invention, describe in these each with reference to method, have corresponding system and equipment for described method.

Generally speaking, in one aspect in, method has following characteristics: receive the search inquiry comprise one or more query terms by user interface from the user, described user interface has interface languages, and described interface languages is natural language; And determine query language from query terms and interface languages for inquiry, described query language is natural language.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method be included as multilingual each determine score value, described score value indication query language is a kind of possibility in the multilingual.Described method comprises with query language to be selected one or more mappings and with selected one or more mappings each query terms is reduced to corresponding simplification query terms; And each is simplified query terms be applied to the synonym mapping table with the possible synonym of identification amplification (augment) search inquiry.Described method be included as multilingual each determine score value, described score value indication query language is a kind of possibility in the multilingual.

Generally speaking, in yet another aspect, method has following characteristics: receive the search inquiry that is comprised of one or more query terms by user interface from the user, described user interface has interface languages, and described interface languages is natural language; Select one or more mappings and with selected one or more mappings each query terms is reduced to corresponding simplification query terms with interface languages; And each is simplified query terms be applied to the synonym mapping table with the possible synonym of identification amplification search inquiry.

Generally speaking, in yet another aspect, method has following characteristics: generate the synonym mapping table from document library, each document has ownership (attribute) in the document language of the document, and each all is natural language for described document language; Wherein the synonym mapping table is mapped to one or more corresponding variants with in a plurality of keys each; And each variant is associated with in the document language one or more.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method comprises: for each language that is associated, each variant is associated at the score value of the relative frequency of all variants of the language that is associated that is used for same keys with this variant of indication.Automatically determine the document language ownership of each document.

Generally speaking, in yet another aspect, method has following characteristics: the word that the first set of the mapping by will depending on language is applied in the storehouse thinks that mapping table generates key and comes to generate the synonym mapping table from document library, each document has the document language that belongs to the document, and the document language that belongs to each document is used to determine the mapping that depends on language of the word that is applied in the document.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method comprises that the second set by the mapping that will depend on language is applied to each query terms and comes each query terms from search inquiry to generate to simplify query terms, described search inquiry has the query language that belongs to this search inquiry, and the query language that belongs to this search inquiry is used to determine the mapping that depends on language that is applied to each query terms.The first set that depends on the mapping of language is gathered different from second of the mapping that depends on language.

Generally speaking, in yet another aspect, method has following characteristics: the word that the first set of the mapping by will depending on language is applied in the storehouse thinks that mapping table generates key and comes to generate the synonym mapping table from document library, each document has the document language that belongs to the document, and the document language that belongs to each document is used to determine the mapping that depends on language of the word that is applied in the document; The second set of the mapping by will depending on language is applied to query terms in the search inquiry and comes to generate from search inquiry and simplify query terms, described search inquiry has the query language that belongs to this search inquiry, and the query language that belongs to this search inquiry is used to determine the mapping that depends on language that is applied to query terms; Wherein said search inquiry comprises the first query terms, by determined the second applied mapping that depends on language of gathering that depends on the mapping of language is mapped to the first simplification query terms with the first query terms from query language, by the mapping that depends on language in the first set of the determined mapping that depends on language of query language the first query terms is mapped to first key, and first to simplify query terms different from first key.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method comprise with interface languages belong to the inquiry as query language.

Generally speaking, in one aspect of the method, method has following characteristics: receive the search inquiry that comprises query terms by user interface from the user, described search inquiry has the query language that belongs to this search inquiry; Be simplified query terms from query terms; And by in the synonym mapping table, searching to simplify query terms be that query terms is identified one or more potential synonyms, described synonym mapping table is mapped to one or more corresponding variants with in a plurality of keys each, each variant is the word that is associated with one or more document languages, and each variant is associated with the variant of indicating this variant at the relative frequency of all variants of the language that is associated that is used for same keys-language score value for each language that is associated.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method comprises that the variant using the query language that belongs to and be used for simplifying one or more variants of query terms-language score value selects variant to use at the amplification search inquiry.Described method comprise with interface languages belong to the inquiry as query language.Have at search inquiry in the situation of the multiple query languages that belongs to this search inquiry, each has separately inquiry-language score value, and described method further comprises use (a) inquiry-language score value and (b) is used for simplifying the variant of one or more variants of query terms-language score value selects variant to use at the amplification search inquiry.Use inquiry-language score value and variant-language score value to comprise following product summation to all language: for each language, to be used for the inquiry of this language-language score value and for the product of the variant of this language-language score value.

Generally speaking, in one aspect of the method, method has following characteristics: receive the search inquiry that is comprised of one or more query terms from the user by user interface; And be received in the indication of using the user preference of mark with phonetic symbols (transliteration) in the query terms of simplifying search inquiry.The embodiment of these and other can comprise one or more in the following feature alternatively.Described method comprises: if user preference is to use mark with phonetic symbols then use mark with phonetic symbols in the query terms of simplifying search inquiry to generate the simplification query terms, do not generate the simplification query terms otherwise do not use mark with phonetic symbols in the query terms of simplifying search inquiry; And identify synonym to use in the search inquiry in amplification with the simplification query terms.The indication of using the user preference of mark with phonetic symbols in simplifying search inquiry is to a kind of user selection in the multiple specific interface language.Described method comprises the search inquiry that is comprised of one or more query terms from user's reception by user interface; In the query terms of simplifying search inquiry, use mark with phonetic symbols and generate the simplification query terms; And identify synonym to use in the search inquiry in amplification with the simplification query terms.

Generally speaking, in one aspect of the method, method has following characteristics: receive the set that the search inquiry that is comprised of one or more original query words is used for the search document by user interface from the user, described user interface has ui language; Ui language is identified as on a small scale language or non-small-scale language, and language is the natural language that has relatively less performance in the set of document on a small scale; Each query terms is reduced to reduced form; And if ui language is the small-scale language, then for each the original query word with reduced form different from original word, use original query word itself and do not provide any synonym as query terms, and for each original query word identical with its reduced form, come to be used for using at the amplification search inquiry for original query word identification synonym with reduced form.The embodiment of these and other can comprise one or more in the following feature alternatively.Simplify each query terms and comprise mark with phonetic symbols.

Can realize that specific embodiment of the present invention is to realize one or more in the following advantage.System can correctly add suitable stress to Spanish or Portuguese word, and wherein stress is different in each language.System can correctly add stress to and uses the word of the different language of the language of mutual user interface just with it from the user.System is mark with phonetic symbols in appropriate circumstances.System can avoid adding the unnecessary variant distinguished to search inquiry, and increasing Search Results will be with the possibility of the desirable language of user.

One or more embodiments of the detail of the present invention have been set forth in the the accompanying drawings and the following description.Further feature of the present invention, aspect and advantage are from description and accompanying drawing and will be apparent from claim.

Description of drawings

Fig. 1 is the process flow diagram be used to the process of setting up the synonym mapping table.

Fig. 2 is for the process flow diagram that creates the process of synonym mapping table from the common form clauses and subclauses.

Fig. 3 is the process flow diagram of the process of rewritten query.

Fig. 4 is the diagram of synonym mapping table.

Fig. 5 A, 5B and 5C and 6-34 show conversion mapping table group.

Figure 35 is the block diagram of search engine.

Reference numerals and mark identical in each accompanying drawing are indicated identical element.

Embodiment

As shown in fig. 1, process 100 creates the synonym mapping table from document library.Document can be HTML (HTML (Hypertext Markup Language)) document, PDF (portable document format) document, text document, word processing document (for example, Microsoft

Word document), user network article or document with any other kind of content of text (comprising content metadata).Process 100 also can be applied to the resource that the text of other kind can be searched for, for example the media resource by metadata identification.

The synonym mapping table comprises the word as the common form of key, and each in the word of described common form is associated with one or more variants.For example, consider only to find therein bilingual: the simple storehouse of French and English.If " elephant " is the clauses and subclauses of the common form in the synonym mapping table, if then find variant " elephant ", " é l é phant " and " el é phant " in the storehouse, these variants will be associated with these clauses and subclauses as value.Each value also comprises additional information: the language of the document that the example of variant occurs therein, and variant is with the number of times of this language appearance.Continue this example, in the storehouse, " el é phant " may be in being considered to the document of English found 90 times, and in being considered to the document of French found 300 times.

Process 100 is in the training storehouse operation (step 110) of document.The training storehouse of document is the set that representative is included in the document of the document in the search library ideally.As an alternative, the training storehouse can be identical storehouse with search library, and perhaps training the storehouse can be the snapshot of search library or from the Extraction parts of search library.The training storehouse should comprise to come the document of all language that show in the comfortable search library.The training storehouse should comprise the document with the sufficient amount of each language, so that document package is contained in the pith of the word that finds in all documents of this language in the search library.

In one embodiment, with known and consistent character code to the training and search library in each document coding, described character code is such as 8 unified format transformations (UTF-8), it can come any character code with Unicode standard (that is, most of known character and ideograph).Document inconsistent or unknown coding must encoded conversion.In one embodiment, the storehouse is that the web crawl device is from the set of the document of Web discovery.

The language of each document in the recognition training storehouse.The language of determining each document can be the part (step 120) of process 100 clearly.As an alternative, the language of document can be included in the part of the information in the training storehouse.The language of document or word is not necessarily simply corresponding to natural language.Language can comprise any diacritic language system by its spelling, grammer, vocabulary or morphology definition.For example, the Rome indian languages, the equivalents of the romanization mark with phonetic symbols of one group of language (for example Bengali and Hindi) can be considered to be at the language that is independent of Bengali and Hindi in the tradition spelling font.

The document language testing process is used Statistical Learning Theory.In one embodiment, its use naive Bayesian ( Bayes) disaggregated model calculates the possibility of possible kind and the kind that prediction has maximum likelihood.Kind is that language/coding is right, for example English/ASCII, Japanese/Shift-JIS or Russian/UTF8, and document can be with described language/coding to representing.Some language is corresponding with a plurality of kinds, because can be with Multi-encoding to described speech encoding, and some coding be corresponding with a plurality of kinds, because described coding can be used to represent multilingual.

Model-naive Bayesian is used to determine most probable kind based on the text of page of text and (alternatively) URL(uniform resource locator) (URL) for page of text.

Determine the coding of page of text with model-naive Bayesian, described model-naive Bayesian is predicted the coding of maximum likelihood based on the pairing of the byte of performance text.If the URL of page of text is available, suppose text from a certain TLD (that is, the decline of internet domain name) then this model also with the probability calculation of specific coding interior.

When effective language detects, text is converted to Unicode from its original coding, and use characteristic is carried out this language detection.Typically, the natural language word is the best features that will use, therefore text segmentation is become word.Given language, model-naive Bayesian calculates the probability of each word and comes to be text prediction maximum possible language based on this probability.

Can use with a large amount of electronic document sample training of various codings and language and test model-naive Bayesian.The training model-naive Bayesian is in fact that calculated characteristics is for the probability of given language.

Process 100 creates the dictionary (step 125) of each the unique word that finds in all documents that are included in the training storehouse.Come each example counting to the given word that in the storehouse, finds according to the language of identifying of the document that finds therein this word.To be recorded in the dictionary with the frequency of each word of each document language.For example, if run into 200 hello-in being identified as the document of English documents 150 times and be identified as in the document of German document 50 times-then the hello dictionary entry be recorded in and found hello in English and the German document and find respectively 150 and 50 times.

For each language, can define predetermined character blacklist.The blacklist of character is the tabulation of the common character that can not occur in the document of this language.The blacklist of character not necessarily reflects the strict inherent characteristics of language.For example, ' w ' do not occur in the French word, therefore it can be added to the French blacklist.Yet use and the external word that comprises ' w ' occurs enough repeatedly in the French document, can get rid of from the French blacklist ' w '.Can be fully or partly manually determine tabulation.As an alternative, can analyze the occurrence number of character in being known as the document of language-specific, to inform artificial process or automatically to produce the blacklist of character with adding up.

Process 100 can determine whether the word that finds seems to violate the Conventional rules of language with the blacklist of character in the training storehouse.Ignore such word, namely such word is not inserted in the dictionary.For example, if " QqWwXxYy " is the blacklist for Hungarian character, then when in the Hungarian document, finding " xylophone ", it is ignored.

Process 100 is mapped to each word entries in the dictionary common form (step 130) of each language that looks like for word.Usually, common form is to meet word simplification, standard, standard or the spelling that other is consistent, does not for example have the word that represents with reading character again.Process 100 is shone upon each word according to predefine with specific to the mapping of language.For example, mapping " the é l é phant " that will find in being identified as the document of French is converted to " elephant ".

According to the mapping specific to language word is mapped to common form.Each mapping specific to language is the set of one or more character conversion mapping tables.One or more output characters that each conversion mapping table specifies one or more input characters and one or more input character to be mapped to.Process 100 substitutes the maximal sequence (or prefix) of the character that is complementary with the input of changing mapping table with one or more output characters of mapping table.Other character copies constant.For any given word, the result of this character conversion process generates the common form of this word.The data structure that longest prefix match is helped in design can be used to store the mapping (for example, search tree (trie) or prefix trees) specific to language.

For example, be mapped to " в о д к а " (changing) from " the в о д к а " of Russian document, and " в о д к а " in the Serbian document is mapped to " vodka ".Be intended to catch the author's of those language expection specific to the conversion of language.Although this has reflected that Russian writer may provide " в о д к а ", Serbian custom hint in search inquiry Cyrillic language word more the equivalents of Chang Zuowei romanization mark with phonetic symbols provide.

Appointment is the special circumstances that comprise the conversion of the word that can collapse loigature for mapping more than the conversion mapping table of an input character.Can collapse loigature is two character combinations, and it can be expressed as character single, that usually read again in some language.For example, if German conversion hint Can not be set type, then ' Ue ' or ' UE ' is suitable alternative body.Therefore the German document can be pieced together work " ueber " with word " ü ber ".During being mapped to common form, two character conversion mapping tables will often collapse the loigature that can collapse and the result will be removed stress.For example, in one embodiment, German conversion mapping table all is converted to " uber " with " ueber " and " ü ber ".

Process 100 creates synonym mapping table (step 150) from the language statistics that is associated of common form mapping, dictionary entry and clauses and subclauses.The common form that each that as above obtains is different becomes the key in the synonym mapping table.Each the mapping that the dictionary entry that is mapped to given key use to be used for the language of clauses and subclauses becomes the value of key.In the synonym mapping table, dictionary entry will be called as variant.Usually, each key is associated with a plurality of variants, and each in the variant is associated with the language statistics of variant.If be the mapping in above-mentioned example, " в о д к а " is a key, and its value refers at least one variant " в о д к а " that is associated with Russian (but not Serbian).In addition, " vodka " is another key, and its value refers at least one variant " в о д к а " that is associated with Serbian (but not Russian).

Fig. 2 shows an embodiment for the process 200 that creates synonym mapping table (step 150 of Fig. 1).Process 200 comprises reception common form clauses and subclauses, as mentioned above (step 210).From the synonym mapping table, omit any common form clauses and subclauses (step 220) that only comprise a variant identical with its common form.Such clauses and subclauses do not provide synonym for common form.

Process 200 also removes any language (step 230) that is associated with the variant that has not the frequency that surpasses predefined absolute threshold.Absolute threshold is predetermined and specifies as the basis take each language.This threshold value is used to remove in the training storehouse and may be misspelled or mistaken variant.For the language that is fully showed in the training storehouse, large threshold value (for example, being used for English is 40) will be omitted faint misspelling usually.That the threshold value that is used for the small-scale language fully do not showed will be set to will be lower (for example 10) to keep legal but rare word.For the language that is showed insufficiently in the storehouse, threshold value can be closed (or being set to 0).

In language-specific, if comprising, variant collapses the variant that loigature and its stressed equivalents neither be used for key, then process 200 is omitted this variant (step 240) for this key.

Some variant only depends on its stress just may have different connotations.For fear of the undesirable pollution of such variant to the synonym mapping table, can define the word blacklist specific to language.It should not be the word list of the variant that is associated with given language that each blacklist comprises.If variant is on the blacklist of language, then this language is by from the variant disassociation.For example, if " the " on the French blacklist, then its common form is that the variant of " the " can not be associated with French.This has prevented obscuring between English " the " and French " th é ".

For each key, calculate each variant at the relative frequency (step 250) of all variants that are used for language-specific.In order to calculate the relative frequency of any given variant in given language, for identical key, the number of times that this variant is occurred in this language is divided by the sum of the appearance of all variants in same-language.For example, if key is " elephant ", and " é l é phant " occurred respectively in English and French 100 and 1000 times; And " el é phant " occurred respectively in English and French 90 and 300 times, and then the relative frequency of " é l é phant " is 52% (that is, 100/ (100+90)) in English.In one embodiment, the relative frequency for each each variant of language is stored in the synonym mapping table.

Process 200 does not remove this any language (step 260) from each variant of synonym mapping table if the relative frequency of language does not satisfy predefined relative threshold (for example 10%).Identical threshold application is in all variants and all language.Also remove any variant (step 270) that is not associated with at least a language from the synonym mapping table.

For illustrative purpose, process 200 has been described to for example by remove the process that clauses and subclauses or variant change this existing synonym mapping table from existing synonym mapping table.As an alternative, during the initial construction of synonym mapping table by not comprising that at first some clauses and subclauses or variant can obtain identical effect.

Figure 4 illustrates illustrative example synonym mapping table.This diagram hypothesis storehouse is by four kinds of language Symbols: English, French, Rome Dard and Bengali.This mapping table comprises three keys: " elephant ", " liberte " and " nityananda ".Each key is associated with a plurality of variants.Particularly, variant " nity.a-nanda " (410) occurs in the document that is identified as Rome Dard and Bengali from the storehouse.Yet this variant only occurs in each language 6 times.If specified absolute threshold greater than 6 for each language, then will from the synonym mapping table, remove these language and variant.

Variant

In three kinds of language, occur (430), according to the relative frequency of language, this variant less of comparing with other variant in each language.If use 10% relative frequency threshold value, then these language and whole variant will be removed from the synonym mapping table.Suppose that identical relative threshold is used for " nityAnanda " variant, also will be removed with the related of Bengali (420).The language of this this variant of variant and remaining is related will to be kept because these other Languages each all frequently occur being enough to surpassing hypothesis relatively and absolute threshold.

Can utilize one of useful thing that the synonym mapping table carries out is to increase to the inquiry of search engine with this synonym mapping table.

As shown in Figure 3, process 300 can be used to increase inquiry to merge the synonym from the synonym mapping table.In fact, receive the usually not perfect user's of description the inquiry of wanting of inquiry of (step 310).The user is subjected to the limitation of input media and accurately indicates the not toilet of the language of inquiry to retrain.Desirable synonym is reflection user those words with the content that provides under desirable environment.Process 300 is intended to by to marking to approach desirable synonym with respect to the variant in the synonym mapping table of the word in the inquiry and the language that means of user, and the language that described user means is approached by the language of inquiry.

Process 300 has determined to receive the language (step 315) at the interface of inquiry.The user offers the interface with inquiry.This interface will have interface languages, and namely the interface is to the used language of user's exhibition information, for example English, French or Esperanto.Yet the word in inquiry is not necessarily used and is inquired about the interface phase language together that is provided to.

Process 300 identifications are from each word (step 320) of inquiry.The identification of word depends on the specific convention of query language.For example, in the Latin font language, word is cut apart by space or other punctuate (for example '-').

Process 300 determines that inquiry may be which kind of language (step 325) of usefulness.In one embodiment, determine query language with two parts: determine that inquiry is the possibility with the language at interface, for example probability; And for the inquiry in each word determine that this word is the possibility with certain language-specific, for example probability.

Determine that the inquiry possibility is to use the language identical with interface languages to carry out with the inquiry in past.If Search Results has been sent in the inquiry in past, the result's that the inquiry of then passing by can be selected subsequently based on the user language by automatic classification for using language-specific.Below hypothesis is rational: the language of inquiry is identical with the language of the document that user selection is checked, if especially selection represents the extracts that comprises from search result document.Its language is determined in inquiry that also can the hand inspection past.Automatic and artificial technology can be combined: be used as the seed of use during automatically determining with the raising degree of accuracy by manual sort's inquiry.The result of automatic categorizer can inform the follow-up adjustment of sorter.The adjustment of artificial definite seed and query classifier can be repeated repeatedly with further raising degree of accuracy.The whole past inquiry that receives with same interface of current inquiry is complementary, and generated query is possibility score value or the probability of using the language identical with interface languages.

Process 300 definite frequency that in the storehouse, within being used for the document of each language, occur from the word of inquiry.Generate vector from frequency counting, this vector provides word with the possibility score value in 0 to 1 scope of this language for each language.For each word in the inquiry generates score value vector, for example probability vector.

For example the word that occurs with many different language of proprietary name (for example the Internet) may excessive influence be used for the score value vector of inquiry.If find such word in query terms, then the score value of described word can be arranged to show that this word may be to use interface languages arbitrarily.As an alternative, such word can be left in the basket.

Process 300 can further be processed each vector by level and smooth each vector.In one embodiment, when compute vector, add little smooth value s with noise reduction.For example, if word t occurs n time in language L and occur N time in whole k kind language, then this word is to be P (L|t)=(n+s)/(k * s+N), but not P (L|t)=n/N with the probability of this language is smoothed.Smooth value can be selected according to the size of N and k.For example, s can be selected and increase to increase along with N and along with k increases and reduces.

Will be from all multiplication of vectors of previous steps.Composite vector and inquiry are that the probability (or score value) with the language at interface multiplies each other, and produce and inquire about probability (or score value) vector.This inquiry probability vector comprises for each language, and inquiry is the probability (or score value) with this language.The speech selection that will have maximum probability (or score value) is the query language that belongs to this inquiry.

Each word (step 330) that process 300 is simplified in the inquiry.In simplifying each word, process collapses loigature, remove stress and to the character mark with phonetic symbols in each word.This is to finish from training the storehouse to obtain the identical mode of common form with aforesaid.Yet, this particular conversion mapping table that makes to simplify looking up words in some aspects from creating the synonym mapping table in the conversion mapping table that uses different.Particularly, simplify each word and usually be independent of language.

Yet in particular case, how the query language of identifying simplifies looking up words if can affecting.This is even more important when the result of word simplification is meaningless in query language.For example, ' ue ' is the meaningless substitute for ' ü ' in Turkish, from different in the German.It will be undesirable for the Turkish user " T ü rk " being reduced to " Tuerk ".

Usually, the simplification word from inquiry is used to use each the simplification word as key to search and retrieve variant (step 340) from the synonym mapping table.Each variant is the potential synonym of original query word.Relative frequency under the key of each variant in each language is used to estimate whether this variant is supposed to as the synonym (step 350) of the key that is used for each language.This estimation is by suing for peace to calculate to following product: for each language, inquiry is to multiply each other with the relative frequency of the probability of this language with variant in this language.For example, consider when " é l é phant " number of times of 52% in English be variant and when 77% number of times is variant in French.Then for inquiry, being determined may be that to have in English 70% probability and be determined may be gallice to have 30% probability, is used for synthetic being estimated as of " é l é phant ": 52% * 70%+77% * 30%=59.5%.If the estimation of calculating surpasses synonym probability threshold value (for example 50%), then this variant is selected to the amplification inquiry.The probability that language statistics in the given synonym mapping table and query language sorter provide selects specific synonym probability threshold value that good results is provided.Be to collapse in result's the special circumstances of loigature in given language at variant, then when calculating the estimation of this variant, reduce the relative frequency (for example becoming 1/4th) of this variant.This punishment reflection of the relative frequency of variant has been collapsed irrelevantly the potential risk of the loigature of variant.

Add each selected variant to inquiry (step 360), do not occur in possible query language unless variant is stop word and variant: such variant is left in the basket.Be used for each selected variant of each original word from inquiry this original word that increases.Each variant is added as the associating with original word.For example, inquiry " el é phant trunk " is amplified as " (el é phant or elephant or é l é phant) trunk ", supposes that wherein elephant and é l é phant are selected as the variant for el é phant.

Process is searched for search library (step 370) with the inquiry of having increased.Search library comprises and is in that it is original, the document of unaltered form.Except the impact of amplification inquiry, from library searching and provide the result can be not influenced in addition.

Query language if possible is the language (that is, the very small scale of whole documents) that is not fully showed in the search library, then may not wish to comprise the variant from the synonym mapping table.Adding variant to search inquiry has increased and the risk that is complementary from the outer document of desirable language, makes potentially the document that has been full of a large amount of other Languages among the result.Yet, when the original query word only comprises the letter of anacrusis and when not comprising the loigature (for example, " ueber " is reduced to " uber ") that can collapse, should not consider that possible query language seeks variant.In one embodiment, the decision that comprises variant is depended on interface languages but not query language.

The embodiment that Fig. 5 A to Figure 34 shows to shine upon the word in the training storehouse or is used for simplifying the conversion mapping table of the word in the search inquiry.Each illustrates the name group of one or more conversion mapping tables.Each conversion mapping table is shown as the delegation in the row among the figure.The conversion mapping table is illustrated as having at least and aforesaid input character and output character.In addition, the row that are labeled as " UCS " show the hexadecimal value of the coding of character according to universal character set (UCS).When not providing the UCS value, each character is in 95 printable ascii characters.

According to convenient or convention and the inessential grouping that comes control conversion mapping table: one or more conversion mapping table groups can be configured for the mapping specific to language of language-specific.The combination that is used for the group of language-specific can depend on whether described group be used to shine upon the word in training storehouse or be used for simplifying word in the inquiry.

Fig. 5 A, 5B and 5C show general conversion mapping table group.Usually, these be impossible with about the afoul safety conversion mapping table of the conversion mapping table of language-specific.

Fig. 6 shows Russian conversion mapping table group.This group is used to shine upon the word from Russian document between the generation of synonym mapping table.

Fig. 7 shows Macedonian conversion mapping table group.This group is used to shine upon the word from the Macedonian document between the generation of synonym mapping table.

Fig. 8 shows Ukrainian conversion mapping table group.This group is used to shine upon the word from the Ukrainian document between the generation of synonym mapping table.

Fig. 9 shows Greek conversion mapping table group.This group is used to shine upon the word from the Greek document between the generation of synonym mapping table.

As shown in Figure 10 and Figure 11, some conversion mapping table is also specified the stressed equivalents (the in the drawings row of Attach Title " A.E. ") of the loigature that collapses.These mapping tables have two character inputs (loigature that can collapse) and outputs (loigature that collapses).This information can be used to determine the loigature whether two characters (input) can collapse.As an alternative, also to indicate specific character (output) possibility be the result of the loigature that can collapse to this information.

Figure 10 shows Esperanto H/X-system conversion mapping table group.This group is used to shine upon the word from the Esperanto document between the generation of synonym mapping table.

Figure 11 shows Ch and ShZh conversion mapping table group.This organizes during the generation of synonym mapping table and query terms simplification combined with other group.

Figure 12 shows Croatian conversion mapping table group.This group is used to shine upon the word from the Croatian document between the generation of synonym mapping table.General, Ch, ShZh, A-umlaut, O-umlaut, U-umlaut and Y-umlaut group are combined and are used to simplify the query terms that is identified as Croatian.A-umlaut, O-umlaut, U-umlaut and Y-umlaut group will be described with reference to Figure 23 below.

Figure 13 shows Catalan conversion mapping table group.This group is used to shine upon the word from the Catalan document between the generation of synonym mapping table.

Figure 14 shows Serbian conversion mapping table group.This group and Croatian group are combined and be used between the generation of synonym mapping table mapping from the word of Serbian document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut, Ch, ShZh and Serbian group are combined and are used to simplify the query terms that is identified as Serbian.

Figure 15 shows French conversion mapping table group.This group is used to shine upon the word from the French document between the generation of synonym mapping table.

Figure 16 shows Italian conversion mapping table group.This group is used to shine upon the word from the Italian document between the generation of synonym mapping table.

Figure 17 shows Portuguese conversion mapping table group.This group is used to shine upon the word from the Portuguese document between the generation of synonym mapping table.

Figure 18 shows Romanian conversion mapping table group.This group is used to shine upon the word from the Romanian document between the generation of synonym mapping table.

Figure 19 shows Spanish conversion mapping table group.This group is used to shine upon the word from Spanish document between the generation of synonym mapping table.

Figure 20 shows Dutch conversion mapping table group.This group is used to shine upon the word from the Dutch document between the generation of synonym mapping table.General, A-umlaut, O-umlaut, U-umlaut and Dutch-Y group is combined and is used to simplify and is identified as Dutch query terms.

Figure 21 shows Danish conversion mapping table group.This group is used to shine upon the word from the Danish document between the generation of synonym mapping table.

Figure 22 shows English conversion mapping table group.This group is used to shine upon the word from English documents between the generation of synonym mapping table.

Figure 22 also shows German conversion mapping table group.This group is used to shine upon the word from the German document between the generation of synonym mapping table.General, Y-umlaut and German umlaut group are used to simplify the query terms that is identified as German.

Figure 22 also shows Dutch-Y conversion mapping table group.This group and combined simplification of other group are identified as Dutch query terms.

Figure 22 also shows German umlaut conversion mapping table group.This group and the combined query terms that is identified as German of simplifying of other group.

Figure 22 also shows Swedish conversion mapping table group.This group is used to shine upon the word from the Swedish document between the generation of synonym mapping table.General, U-umlaut and Y-umlaut group are used to simplify and are identified as Swedish or Finnic query terms.

Figure 23 shows four groups: A-umlaut, O-umlaut, U-umlaut and Y-umlaut group.These groups are used to other group combined to simplify query terms.

Figure 24 shows Icelandic conversion mapping table group.This group is used to shine upon the word from the Icelandic document between the generation of synonym mapping table.

Figure 25 shows Czech conversion mapping table group.This group and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Czech document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut and ShZh group are used to simplify the query terms that is identified as Czech.

Figure 26 shows Latvian conversion mapping table group.This group and Ch and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Latvian document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut, Ch and ShZh group are used to simplify the query terms that is identified as Latvian.

Figure 27 shows Lithuanian conversion mapping table group.This group and Ch and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Lithuanian document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut, Ch and ShZh group are used to simplify and are identified as Lithuanian query terms.

Figure 28 shows Polish conversion mapping table group.This group is used to shine upon the word from the Polish document between the generation of synonym mapping table.

Figure 29 shows Slovak conversion mapping table group.This group and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Slovak document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut and ShZh group are combined and are used to simplify the query terms that is identified as Slovak.

Figure 30 shows Slovene conversion mapping table group.This group and Ch and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Slovene document.

Figure 31 shows Estonian conversion mapping table group.This group and Ch and ShZh group is combined and be used between the generation of synonym mapping table mapping from the word of Estonian document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut, Ch and ShZh group are combined and are used to simplify the query terms that is identified as Estonian.

Figure 32 shows Hungarian conversion mapping table group.This group is used to shine upon the word from the Hungarian document between the generation of synonym mapping table.

Figure 33 shows Esperanto conversion mapping table group.This group and Esperanto HX-system group are combined and be used between the generation of synonym mapping table mapping from the word of Esperanto document.General, A-umlaut, O-umlaut, U-umlaut, Y-umlaut and Esperanto HX-system group are combined and are used to simplify the query terms that is identified as Esperanto.

Figure 34 shows Turkish conversion mapping table group.This group is used to shine upon the word from the Turkish document between the generation of synonym mapping table.

Below expressed which conversion mapping table group and can be used between the generation of synonym mapping table, shine upon word.Each language is designated its character blacklist (as mentioned above) and one or more conversion mapping table group, described conversion mapping table group are formed in a cover conversion mapping table that uses together when the word of training the storehouse obtains common form.

Figure 35 is the synoptic diagram that receives multilingual inquiry and multilingual result's search engine 3550 is provided as response.System 3550 is configured to obtain the information relevant with frequency with the appearance of word from each provenance usually, and the analyzing responding that uses based on the word in such source is in the query generation Search Results.Such source can comprise multilingual document and the file that for example finds in the Internet.

System 3550 comprises one or more interfaces 3552, and wherein each is with different language.The interface allows the user to use the service of search engine and allows the service interaction of user and search engine.Particularly, the interface receives inquiry from the user.Inquiry comprises the word of itemizing, and wherein each word can be with any language.Word in the inquiry does not need the language with the interface.The specific interface 3552 of reception user's inquiry depends on the selection to the user at interface.

System 3550 can be connected to the network such as the Internet 3558 communicatedly, and therefore can communicate by letter described device such as radio communication device 3562 and personal computer 3564 with the various devices that are connected to the Internet.The communication stream that is used for any device can be two-way, so that system 3550 is from installing reception information (for example, the content of inquiry or document) and also information (for example result) can being sent to device.

The inquiry that interface 3552 receives is provided for query processor 3566.Query processor 3566 is processed inquiry, another assembly of system 3550 is inquired about and inquiry is passed in amplification alternatively.For example, query processor 3566 can impel searching system 3570 to generate the Search Results corresponding with inquiry.Such searching system 3570 can be used data retrieval and the search technique of using such as Google PageRankTM system.Then the result that searching system 3570 generates can be provided back the original query device.

System 3550 is a plurality of other assemblies for its suitable operation can rely on.For example, the search library 3572 of system's 3550 reference documents when sending request.Search library can be indexed so that search is more effective.The information that use is collected from the document (for example, by the web crawl device) that finds at Web can be filled out and be increased search library.Document also can be stored in the training storehouse 3574 and be used for aftertreatment.

Training storehouse 3574 can be processed by synonym processor 3580.Synonym processor 3580 can generate synonym mapping table 3585 from training storehouse 3574.Synonym mapping table 3585 can be made with the synonym search inquiry that increases by query processor 3566.

The embodiments of the invention of describing in this manual and all functions operation can be in Fundamental Digital Circuit or computer software, firmware or hardware (comprise disclosed in this manual structure with and the structural equivalence body in) or one or more combination in above-mentioned in realize.Embodiments of the invention can be used as one or more computer programs and realize, described computer program namely is used for being carried out or controlling by data processing equipment the one or more modules that are coded in the computer program instructions on the computer-readable medium of the operation of data processing equipment.Computer-readable medium can be machine-readable storage device, machine readable storage substrate, memory storage, realization machine readable transmitting signal material complex or above-mentioned in one or more combinations.All devices, device and the machine for the treatment of data contained in term " data processing equipment ", comprises programmable processor, computing machine or multiprocessor or computing machine in the mode of example.Except hardware, equipment can comprise the code of the execution environment that create to be used for the computer program just discussed, for example consist of processor firmware, protocol stack, data base management system (DBMS), operating system or above-mentioned in the code of one or more combinations.Transmitting signal is the artificial signal that generates, the electricity, light or the electromagnetic signal that generate of machine for example, and it is generated to information coding in order to be transferred to suitable recipient's equipment.

Computer program (being also referred to as program, software, software application, script or code) can be write by programming language in any form, comprise compiling or interpretative code, and it can be disposed in any form, comprises as stand-alone program or as the module, assembly, subroutine or other unit that are suitable for using in computing environment.Computer program is not necessarily corresponding with the file in the file system.Program (for example can be stored in the part of the file of preserving other program or data, be stored in the one or more scripts in the marking language document), be stored in the Single document that is exclusively used in the program of just discussing or be stored in a plurality of equal files (for example, storing the file of one or more modules, subroutine or code section).Computer program can be deployed to carry out on a computing machine or at a plurality of computing machines, and described a plurality of computer bit are in the three unities or be distributed in a plurality of places and interconnect by communication network.

Process and the logic flow described in this manual can be carried out by one or more programmable processors, and described one or more programmable processors are carried out one or more computer programs to carry out function by operation input data and generation output.Process and logic flow also can be carried out by dedicated logic circuit, and equipment also may be implemented as dedicated logic circuit, and described dedicated logic circuit is FPGA (field programmable gate array) or ASIC (special IC) for example.

The processor that is suitable for computer program comprises, in the mode of example, and any one of the digital machine of general and special microprocessor and any kind or a plurality of processor.Usually, processor will both receive instruction and data from ROM (read-only memory) or random access memory or its.The primary element of computing machine is for the processor of carrying out instruction with for one or more memory storages of storing instruction and data.Usually, computing machine also will comprise the one or more high-capacity storages for the storage data, operatively connect with from described one or more high-capacity storage receive datas or with data transfer to described one or more high-capacity storages, or both, described high-capacity storage for example is magnetic, magneto-optic disk or CD.Yet computing machine does not need to have such device.In addition, computing machine can be embedded in another device, and described device for example is mobile phone, personal digital assistant (PDA), Mobile audio player, GPS (GPS) receiver, has only pointed out some.The computer-readable medium that is suitable for storing computer program instructions and data comprises nonvolatile memory, medium and the memory storage of form of ownership, comprises for example semiconductor storage of EPROM, EEPROM and flash memory device in the mode of example; Disk, for example internal hard drive or removable dish; Magneto-optic disk; And CD-ROM and DVD-ROM dish.Processor and storer can be augmented by dedicated logic circuit, or are incorporated in the dedicated logic circuit.

For mutual with the user is provided, embodiments of the invention can be realized at the computing machine that has with lower device: the display device that is used for showing to the user information, for example CRT (cathode-ray tube (CRT)) or LCD (liquid crystal display) monitor, and the indicator device of the keyboard of input and for example mouse or tracking ball can be provided to computing machine by its user.The device of other kind also can be used to provide mutual with the user; For example, the feedback that offers the user can be any type of sensory feedback, for example visual feedback, audio feedback or tactile feedback; And can receive in any form input from the user, comprise sound, voice or sense of touch input.

Embodiments of the invention can be realized in computing system, described computing system for example comprises the aft-end assembly as data server, or comprise for example middleware component of application server, or comprise front end assemblies, for example by its user can with the client computer with graphic user interface or Web browser of embodiments of the present invention interaction, or any combination of one or more such rear end, middleware or front end assemblies.The assembly of system can interconnect by the digital data communication of any form or medium, and described digital data communication for example is communication network.The example of communication network comprises the wide area network (" WAN ") of LAN (Local Area Network) (" LAN ") and for example the Internet.

Computing system can comprise client and server.Client and server usually mutually away from and typically by the communication network interaction.The relation of client and server is by the computer program generation that moves and have each other the client-server relation at each computing machine.

Although the present invention comprises many details, these should not be interpreted as to the present invention or to the restriction of the scope of the right that may advocate, but as the description specific to the feature of specific embodiment of the present invention.Some feature of describing in the context of different embodiment in this manual also can be combined among the single embodiment and realize.Otherwise the various features of describing in the context of single embodiment also can realize respectively in a plurality of embodiment or realize in any suitable sub-portfolio.In addition, although feature may be described as be in the above work in some combination and even initial advocate for so, but the one or more features from the combination of advocating can be deleted from combination in some cases, and the combination of advocating can be directed to the distortion of sub-portfolio or sub-portfolio.

Similarly, although with specific order operation is described in the accompanying drawings, not should be understood to need to by the certain order that illustrates or in order order carry out such operation, maybe need to carry out the operation shown in all and realize the result that wishes.In some cases, multitask and parallel processing can be favourable.In addition, the separation of the various system components among the aforesaid embodiment all need not to should be understood to such separation in all embodiment, and should be appreciated that described program assembly and system usually can jointly be integrated in the single software product or be encapsulated in a plurality of software products.

Therefore, specific embodiment of the present invention has been described.In the scope of other embodiment claim below.The result of hope can carry out and still realize with different order to the behavior of for example, stating in the claims.

Claims

1. the method for a computer implemented processing query terms comprises:

Receive the search inquiry that comprises query terms by user interface from the user, described search inquiry has the query language that belongs to described search inquiry;

Be simplified query terms from described query terms;

Be that described query terms is identified one or more potential synonyms by in the synonym mapping table, searching described simplification query terms, described synonym mapping table is mapped to one or more corresponding variants with in a plurality of keys each, each variant is the word that is associated with one or more document languages, and each variant is associated with variant-language score value for each language that is associated, and described variant-language score value is indicated the relative frequency of described variant in all variants of the described language that is associated that is used for described same keys; And

With the query language of described ownership be used for the described variant of one or more variants of described simplification query terms-language score value and select variant to use at the described search inquiry of amplification.

2. the method for claim 1 further comprises:

Described interface languages is belonged to described inquiry as described query language.

3. the method for claim 1, wherein:

Described search inquiry has the multiple query languages that belongs to described search inquiry, and each has separately inquiry-language score value;

Described method further comprises:

Using (a) described inquiry-language score value and (b) being used for the described variant of one or more variants of described simplification query terms-language score value selects variant to use at the described search inquiry of amplification.

4. method as claimed in claim 3, wherein use described inquiry-language score value and described variant-language score value to comprise:

Following product summation to all language: for each language, be used for the described inquiry of described language-language score value and be used for the product of the described variant of described language-language score value.

5. system that processes query terms comprises:

Be used for receiving the device of the search inquiry that comprises query terms by user interface from the user, described search inquiry has the query language that belongs to described search inquiry;

Be used for being simplified from described query terms the device of query terms; And

Being used for by search described simplification query terms at the synonym mapping table is that described query terms is identified one or more potential synon devices, described synonym mapping table is mapped to one or more corresponding variants with in a plurality of keys each, each variant is the word that is associated with one or more document languages, and each variant is associated with variant-language score value for each language that is associated, and described variant-language score value is indicated the relative frequency of described variant in all variants of the described language that is associated that is used for described same keys; And

Be used for the query language of described ownership and be used for the described variant of one or more variants of described simplification query terms-language score value selecting the device of variant to use at the described search inquiry that increases.

6. system as claimed in claim 5 further comprises:

Be used for described interface languages is belonged to described inquiry as the device of described query language.

7. system as claimed in claim 5, wherein:

Described system further comprises:

Be used for to use (a) described inquiry-language score value and (b) be used for the described variant of one or more variants of described simplification query terms-language score value and select the device of variant to use at the described search inquiry of amplification.

8. system as claimed in claim 7, wherein use described inquiry-language score value and described variant-language score value to comprise: