WO2014011689A1 - Weight-based stemming for improving search quality - Google Patents
Weight-based stemming for improving search quality Download PDFInfo
- Publication number
- WO2014011689A1 WO2014011689A1 PCT/US2013/049798 US2013049798W WO2014011689A1 WO 2014011689 A1 WO2014011689 A1 WO 2014011689A1 US 2013049798 W US2013049798 W US 2013049798W WO 2014011689 A1 WO2014011689 A1 WO 2014011689A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- query term
- expanded
- documents
- term
- query
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 48
- 230000004044 response Effects 0.000 claims abstract description 11
- 230000003247 decreasing effect Effects 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 16
- 238000012545 processing Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 241000282326 Felis catus Species 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 241000282324 Felis Species 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 235000019013 Viburnum opulus Nutrition 0.000 description 1
- 244000071378 Viburnum opulus Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000012015 optical character recognition Methods 0.000 description 1
- 235000020004 porter Nutrition 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
Definitions
- the present invention relates generally to electronic information search and retrieval. More specifically, systems and methods are disclosed for improving search quality.
- a user In a simple information retrieval system, a user typically enters a query comprising one or more query terms and receives a list of documents containing the query terms. Documents that do not contain the query terms are ignored. However, "recall,” or the fraction of the documents that are relevant to the query that are successfully retrieved, is low for this simple information retrieval system. As a result, documents which may be of interest to the user may not be identified in response to the query, and thus never presented to the user.
- Stemming One technique used to increase recall is known as "stemming,” which involves stripping out pre-fixes or post- fixes to a word. Such pre-fixes and post- fixes are common in the English language, and are seen in other languages.
- stemming is typically applied when indexing a body of documents. For example, an occurrence of the word “tickets" in a document would be indexed as “ticket.”
- stemming of the query terms also known as “term reduction” is performed - the same kind of transformation performed during indexing - and the index is accessed using the stemmed query terms.
- a search for "ticketing" on a search engine employing stemming would return documents containing the word “ticket” (the stem of “ticketing") and documents containing the word “tickets” (which has the same stem, "ticket,” as “ticketing”).
- query expansion Another technique used to increase recall is known as "query expansion," in which one or more query terms are supplemented with additional related query terms.
- One known technique for identifying related terms is analyzing the co-occurrence of terms or cooccurrence with similar terms observed in documents during indexing and query terms submitted in previous search queries (typically obtained by processing query logs) to produce a thesaurus of semantically related terms. Such a technique may, for example, determine that "plane” and “aircraft” are related, that "hospital” and “medical” are related.
- a search query including the term "hospital” may be expanded to also include the term "medical.”
- a weighting may be applied to an added term based on the observed pairwise degree of co-occurrence between the original term and the expanded term. Such weighting signals to a result ranking process where a document is retrieved based on an expanded term with a low degree of co-occurrence, it should be ranked lower among the retrieved documents.
- stemming and query expansion each generally increase recall, they also generally result in reduced "precision,” or the fraction of the documents retrieved that are relevant to the query.
- a search may result in many documents which are not of interest to a user in response to a query.
- An aspect of the disclosed subject matter includes a computer-implemented method comprising receiving a search query; identifying a first original query term based on the query; identifying a first expanded query term related to the first original query term; determining a first lexical distance between the first original query term and the first expanded query term; determining a first weight for the first expanded query term based on the determined first lexical distance; identifying a plurality of documents, from among a corpus of documents, as each relevant to the search query, the plurality of documents including a first document identified based on its inclusion of the first expanded query term; ranking the plurality of documents, with the ranking of the first document being based upon the calculated first weight; and generating a response to the search query identifying two or more of the plurality of documents, ordered according to the ranking.
- Another aspect includes a search system comprising a query expansion engine programmed to receive a search query; identify a first original query term based on the query; identify a first expanded query term related to the first original query term; determine a first lexical distance between the first original query term and the first expanded query term; and determine a first weight for the first expanded query term based on the determined first lexical distance; a search system programmed to identify a plurality of documents, from among a corpus of documents, as each relevant to the search query, the plurality of documents including a first document identified based on its inclusion of the first expanded query term; and a ranking engine programmed to rank the plurality of documents, with the ranking of the first document being based upon the calculated first weight, wherein the search system is further programmed to generate a response to the search query identifying two or more of the plurality of documents, ordered according to the ranking.
- FIG. 1 is a block diagram illustrating an example search system.
- FIG. 2 is a block diagram illustrating a computer system on which aspects of the invention may be implemented.
- FIG. 3 illustrates a method of performing a search query.
- FIG. 4 illustrates a method for a search system to process a received search query.
- FIG. 1 is a block diagram of, and FIGS. 3 and 4 illustrate methods for, an example search system 160 that can be used to provide search results relevant to submitted queries as can be implemented in an Internet, an intranet, or another client and server environment.
- the search system 160 is an example of an information retrieval system in which the systems, components, and techniques described below can be implemented. However, those skilled in the art will appreciate that many variations upon the disclosed system are also effective for implementing the inventive aspects of this disclosure.
- a user 110 can interact with the search system 160 through a client device 120.
- the client 120 can be a computer coupled to the search system 160 through a local area network (LAN) or wide area network (WAN), e.g., the Internet. Examples of such a computer include, but are not limited to, a desktop computer, a laptop or notebook computer, a tablet computer, and a smartphone or other mobile telecommunication device.
- the search system 160 and the client device 120 can be one machine.
- a user can install a desktop search application on the client device 120.
- the client device 120 will generally include a random access memory (RAM) 121 and a processor 122.
- RAM random access memory
- a user 110 can submit a query 131a to the search system 160 found behind front-end server 150.
- user 110 may use a web browser application executing on client device 120 to generate an HTTP-formatted query 131a.
- the query 131a is transmitted through a network 140, then to front-end server 150.
- front-end server 150 issues query 131b to search system 160.
- query 131a will simply be relayed or repeated as query 131b, without significant modification of the content of query 131a.
- front-end server 150 will perform additional processing in response to query 131a in order to generate query 131b. For example, query terms might be added in query
- front-end server 150 transmits query 131b to search system 160.
- front-end server 150 may be configured to provide other information services.
- front-end server 150 may be configured to execute a web server or web application engine to provide network-based services via network 140, including providing access to documents or other information stored and made available by content server 170.
- One specific network-based service includes a network-based customer support system accessible to user 110 via a web browser application executing on client device 120.
- query 131a may be transmitted directly from client device 120 to search system 160, without an intermediate front end server 150, as illustrated by the upper dashed line in FIG. 3.
- search system 160 would directly reply to client device 120, as illustrated by the lower dashed line in FIG. 3
- the search system 160 can be implemented as, for example, one or more computer programs running on one or more computers in one or more locations that are coupled to each other through a network.
- the search system 160 includes an index database 161 and a search engine 165.
- the search system 160 responds to the query 131b by generating search result 132b, which in step 350 is transmitted to front-end server 150.
- front- end server 150 may simply pass result 132b through as result 132a, or in response to receiving result 132b, in step 360 front-end server 150 may perform additional processing in order to generate result 132a.
- result 132a is transmitted through the network 140 to the client device 120.
- result 132a is in a form that can be presented to the user 1 10, such as an HTML-formatted search results web page to be displayed in step 380 in a web
- step 340 when the query 131b is received by the search system 160, the search engine 165 processes query 131b and identifies documents that match or are otherwise responsive to the query 131b.
- "Documents" are understood here to be any form of indexable content, including, but not limited to, textual information in any text or graphics format, images, video, audio, multimedia, presentations, and web pages (which can include embedded hyperlinks and other metadata, and/or programs, for example, in Javascript).
- the search engine 165 will generally include, or have access to, an indexing engine 166 that indexes a corpus of documents and stores indexing information in index database 161. Search engine 165 utilizes the index database 161 to identify documents responsive to query O la.
- the corpus of documents indexed by the indexing engine 166 may be accessible via content server 170, which is also behind front- end server 150 (in other words, not generally accessible directly from network 140), or may be accessible via one or more content servers 175 accessible to indexing engine 166 and client device 120 via network 140. Indexing may be performed based on features including, but not limited to, the content of a document, information automatically generated from a document (such as, but not limited to, information generated by optical character recognition or machine vision techniques applied to images or videos), a "tag" assigned by a user or administrator to describe or characterize a document, and document metadata.
- search engine 165 will identify more than one document as responsive to query 131a, and the end result 132b will identify more than one document.
- the search engine 165 will generally include a ranking engine 168 that ranks documents that are determined by search engine 165 to be responsive to the query 131b, such that, for example, result 132b may present the most relevant documents first.
- the search system 160 can transmit the result 132b to client device 120 through front-end server 150 and network 140 for presentation to the user 1 10.
- front-end server 150 may manipulate the result 132b received from search system 160 in order to present them to user 1 10 in a format consistent with other information services provided by front-end server 150.
- result 132b might be a simple XML-based listing of document identifiers for information available via content server 170, and front-end server 150 is configured to convert these document identifiers into Uniform Resource Identifiers (URIs) included in result 132a which client device 120 can use to access documents identified result 132b.
- URIs Uniform Resource Identifiers
- search system 160 receives query 131b.
- search engine 165 identifies one or more original query terms based on query 13 lb.
- a query term specifies one or more sequences of characters (usually words), which may also specify patterns or regular expressions (for example, the query term "cat*" might positively match with "cat” and "catch").
- query 131b might be an HTTP GET message including the URI
- query 131b may indicate various operators, modifiers, and/or parameters to be used in connection with or in addition to query terms.
- the above HTTP GET messages are merely illustrations, and other message formats may be used.
- Search engine 165 includes query expansion engine 167, which is configured to, in step 420, identify zero or more expanded query terms related to the original query terms.
- query expansion engine 167 may be configured to, for each original query term, identify zero or more related expanded query terms.
- query expansion engine 167 might not identify any expended query terms.
- the expanded query terms are used in addition to the original query terms. However, in some embodiments there may be situations in which one or more original query terms will be replaced in favor of expanded query terms identified by query expansion engine 167.
- One technique for identifying an expanded query term related to an original query term involves identifying words that have a stem in common with the original query term. For example, in connection to the original query term “tickets,” having the stem “ticket,” query expansion engine 167 would identify “ticket,” “ticketed,” and “ticketing” as expanded terms, as each has the same stem "ticket” as the original term. It is noted that although in the English language a given word will usually have only one stem, there are situations, including in non- English languages, in which a term will have multiple stems. Query expansion engine 167 may be configured to identify expanded terms corresponding to all stems identified for a term.
- this identification of related words according to stems is implemented by a dictionary of words that are indexed according to their stem(s). For example, the dictionary entries for "ticket,” “ticketed,” “ticketing,” and “tickets” would each be indexed under the stem "ticket.” With this embodiment, query expansion engine 167 would determine the stem of "tickets” (which may be performed by a dictionary lookup), and perform a lookup on the dictionary using the stem as an index. In another embodiment, each word in a dictionary is associated with other words in the dictionary having a common stem. For example, the dictionary entry for "tickets" would be directly linked to the words “ticket,” “ticketed,” “ticketing,” and “tickets”.
- query expansion engine 167 does not need to determine a stem for the original query term "tickets" before accessing the dictionary.
- such dictionaries can be, in part or in whole, automatically generated based on document processing by indexing engine 166.
- Many other techniques for identifying words that have a stem in common with an original query term are within the skill of the art. Stemming techniques useful for the English language include, but are not limited to, Snowball-based stemmers and the Porter stemming algorithm.
- a standard indexing engine such as the one provided in the Lucene search engine, is used to generate an index and a corresponding dictionary of indexed terms, where the dictionary is sorted in alphabetical order.
- This dictionary can be used to identify candidate expansions by identifying terms in the dictionary which begin with the same n letters as an original query term, such as the first 3 letters.
- such candidate expansions might include “tic,” “tick,” “ticket,” “ticketed,” “ticketing,” “tickled,” “ticklish,” “ticktack,” “ticktock,” “tics,” and “tictac.” Then, stemming is performed on each of the candidate expansions to identify expansions having a stem in common with the original query term.
- indexing of a document includes identifying a stem for a word included in the document and indexing the document in a document index by the identified stem, and query terms are stemmed (in other words, a query term is reduced to its stem) and documents are identified from the document index based on the stemmed query terms.
- a search for "ticketing" on a search engine employing this conventional technique would return documents containing the word "ticket” (the stem of "ticketing") and documents containing the word "tickets" and/or “ticketless” (which each have the same stem, "ticket,” as “ticketing”).
- this technique for indexing and searching results in reduced precision, as there may a significant number of words, in many cases having little relevance to one another, that all have the same stem and consequently get indexed together under the same stem. As a result, the document index is less precise.
- the technique discussed in the previous paragraph, in conjunction with other aspects of this disclosure is able to obtain improved results over this conventional technique, as it is able to utilize a more precise index database by indexing according to words as found in document, but it is able to identify the same breadth of documents as the conventional technique, while also facilitating an improved ranking of the identified documents.
- Another technique for identifying an expanded term related to an original query term is the use of a thesaurus, in which expanded terms for a given term are associated with each other.
- synonyms without any common stem such as "cat” and "feline”
- Thesaurus associations may be manually specified by a user or administrator, for example based on domain experience that certain terms are generally more effectively searched together.
- thesaurus associations may be automatically generated based on document processing by indexing engine 166. For example, a frequent cooccurrence of two terms in documents may be used to determine that the terms are sufficiently related to be associated in the thesaurus.
- the thesaurus associations may be generated based on automated analysis of queries submitted to search system 160. For example, the co-occurrence of terms in a single search or refined searches may be used to determine that the terms are sufficiently related to be associated in the thesaurus.
- query expansion engine 167 is further configured to determine a weight intended for use with ranking of search results. This weighting is determined based on a lexical distance between the original query term for which an expanded query term was identified and the expanded query term.
- a lexical distance indicates a distance between two words according to a particular technique. A smaller value indicates a greater degree of similarity between the two words.
- One trivial example is to calculate an absolute difference in the number of characters for each word. According to this example, the lexical distance between "carry" and "carrier” is 2.
- Other techniques include, but are not limited to, determining a lexical distance based on the Jaro distance or the Jaro-Winkler distance techniques (taking into account that the normalized scores these techniques produce range from 0 for no match to 1 for a perfect match).
- the lexical distance is determined by determining an "edit distance" between an original query term and a corresponding expanded query term.
- An edit distance is determined by calculating a minimum cost of performing edit operations, which typically perform single character edits, to convert a first word to a second word.
- Edit operations may include, but are not limited to, replacement, insertion, deletion, and transposition or characters or character sequences.
- edit operations may have different costs, such as where insertions and deletions have the same cost, and replacements have twice the cost of an insertion.
- edit operations may be performed on phonetic units of one or more characters, rather than just individual characters.
- the Levenshtein distance is used to determine the lexical distance between an original query term and a corresponding expanded query term.
- Algorithms for calculating the Levenshtein distance including Hirschberg's algorithm and the Wagner-Fischer algorithm, are known in the art.
- Other edit distances are known in the art, including the Damerau-Levenshtein distance, Monge-Elkan distance, and Smith- Waterman distance.
- query expansion engine 167 may be configured to determine that the greater of the two respective weightings is the only weighting applied for the expanded query term.
- a weight, reflecting an expected degree of relevance of an expanded query term to a query is determined based on the determined lexical distance.
- the weight is determined according to a strictly decreasing function of lexical distance (under an assumption that increased lexical distance corresponds to decreased similarity between two terms).
- the weight although based on a lexical distance, may also be based on additional factors.
- the determination of distance and weight are collapsed together, whereby a weight based on a lexical distance is obtained.
- a weight based on a lexical distance For example, the Jaro Winkler distance, which generates a score ranging from 0 for no match to 1 for a perfect match between two words, may be directly used for weighting of expanded query terms.
- query expansion engine 167 determines whether the determined distance is at or above a threshold value. If so, the expanded query term is deemed to be too far removed from the original term, and accordingly the expanded query term is not included as part of a subsequent identification of documents relevant to the query. In an embodiment, after a weight is determined for an expanded query term, query expansion engine 167 determines whether the determined distance is at or below a threshold value. If so, the expanded query term is deemed to be insufficiently relevant to the original term, and accordingly the expanded query term is not included as part of a subsequent identification of documents relevant to the query.
- step 450 the original query terms and expanded query terms identified by query expansion engine 167 are used by search engine 165 to identify documents in index database 161 which are relevant to the query from which the original query terms were identified.
- search engine 165 might be configured to identify each document containing one or more of the original or expanded query terms. As a result, a plurality of documents are identified as relevant to the query, although not necessarily in an order reflecting their relevance to the query.
- the weights generated for the expanded query terms are provided to ranking engine 168. These weights are used in step 460 by ranking engine 168 to rank the identified documents. In an embodiment, where a document is identified based on its inclusion of an expanded query term, the weight corresponding to the expanded query term is used for ranking the document. In a nonlimiting example, one may specify a weight or "boost factor" to the Lucene search engine for query terms using the carat symbol in a search query string. In determining the relevance of documents to a search query, the Lucene search engine will apply the weighting in addition to other ranking factors, such as the frequency at which query terms appear throughout the entire indexed corpus of documents.
- This ranking is used in step 470 by search system 165 to generate reply 132b identifying the identified documents ordered according to the ranking.
- documents at or below a particular degree or ranking may be determined to be insufficiently relevant to query 131b, and as a result not identified in reply 132b.
- the identification of documents relevant to query 131a and their ranking may be combined, rather than successive steps.
- ranking engine 168 relies on other factors in addition to the above weighting based on lexical distance.
- weighting based on lexical distance is associated with a respective query term
- other weightings may be based on more document-specific considerations, such as, but not limited to, frequency of citation or access, or a score assigned to a creator or a provider of a given document.
- Other document features which may be used as ranking factors for a customer support ticket system include, but are not limited to ticket age, date of creation, last date of access, ticket status (for example, open or resolved), and number of comments.
- query 131b may include information which causes ranking engine 168 to include, exclude, and/or adjust factors in determining document ranking. For example, query 131b may instruct search system to rank more administrator-generated documents more highly than user-generated documents.
- a weight is not calculated, and ranking by ranking engine 168 relies on a lexical distance for expanded query terms.
- recursive expansions may be performed with or without corresponding weightings.
- query expansion engine 167 may identify a first expanded term using a thesaurus to find words associated with an original query term. As expansions identified from a thesaurus a more likely to have a greater lexical distance that does not correspond to their relevance to the original term, query expansion engine 167 may be configured not to associate a weighting based on lexical distance from the original query term with the first expanded term (although another weighting may be applied to the first expanded term to, for example, reduce the weight of the expanded term relative to the original term).
- query expansion engine 167 may generate a second expanded term by identifying words that have a stem in common with the first expanded term, and according a weight to the second expanded term according to it lexical distance from the first expanded term.
- the weight for the second expanded term might be reduced relative a weight that would be determined were the second expanded term not a recursive expansion.
- a first expanded term may be generated by identifying words that have a stem in common with an original query term, and a second expanded term may be identified using a thesaurus to find words associated with an original query term.
- a first weighting may be determined for the first expanded term based on a lexical distance between the original query term and the first query term, and a second weighting may be determined for the second expanded term based on the first weighting. For example, if a weighting X were determined according to some method for the second expanded term, the weighting X might be multiplied by the first weighting to reflect the second expanded term being a recursive expansion and the relevance of the first expanded query term from which it was expanded to the original query term.
- query 131b may include information instructing search system 160 not to perform query expansion for some or all query terms included in query 131b.
- user 1 10 might enter a search phrase with a query term enclosed in quotation marks or preceded with a plus sign, which has the result that the query term is not expanded.
- query expansion engine 167 may be configured to identify terms that it will not attempt to identify expansions, for example by way of a "do not expand" list.
- indexing engine 166 may index certain document data under various fields, such as document type, title, author, or date, enabling query 131b to specify query terms to be used in connection with certain fields.
- a fixed set of predetermined tags or labels may be defined, such as for a status field indicating whether a customer support ticket is new, open, pending, solved, or closed. In this example, a query term for the status field is not expanded.
- FIG. 2 is a block diagram that illustrates a computer system 200 upon which aspects of the invention may be implemented.
- Computer system 200 includes a bus 202 or other communication mechanism for communicating information, and a processor 204 coupled with bus 202 for processing information.
- Computer system 200 also includes a main memory 206, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 202 for storing information and instructions to be executed by processor 204.
- Main memory 206 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 204.
- Computer system 200 further includes a read only memory (ROM) 208 or other static storage device coupled to bus 202 for storing static information and instructions for processor 204.
- ROM read only memory
- a storage device 210 such as a magnetic disk or optical disk, is provided and coupled to bus 202 for storing information and instructions.
- Computer system 200 may be coupled via bus 202 to a display 212, such as a cathode ray tube (CRT), for displaying information to a computer user.
- a display 212 such as a cathode ray tube (CRT)
- An input device 214 is coupled to bus 202 for communicating information and command selections to processor 204.
- cursor control 216 is Another type of user input device
- cursor control 216 such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 204 and for controlling cursor movement on display 212.
- This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
- the invention is related to the use of computer system 200 for implementing the techniques described herein. According to one embodiment of the invention, those techniques are performed by computer system 200 in response to processor 204 executing one or more sequences of one or more instructions contained in main memory 206. Such instructions may be read into main memory 206 from another machine-readable medium, such as storage device 210. Execution of the sequences of instructions contained in main memory 206 causes processor 204 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.
- machine-readable medium refers to any medium that participates in providing data that causes a machine to operation in a specific fashion.
- various machine-readable media are involved, for example, in providing instructions to processor 204 for execution.
- Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media.
- Non-volatile media includes, for example, optical or magnetic disks, such as storage device 210.
- Volatile media includes dynamic memory, such as main memory 206.
- Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 202.
- Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. All such media must be tangible to enable the instructions carried by the media to be detected by a physical mechanism that reads the instructions into a machine.
- Machine-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punchcards, papertape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
- Various forms of machine-readable media may be involved in carrying one or more sequences of one or more instructions to processor 204 for execution.
- the instructions may initially be carried on a magnetic disk of a remote computer.
- the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
- a modem local to computer system 200 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
- An infrared detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 202.
- Bus 202 carries the data to main memory 206, from which processor 204 retrieves and executes the instructions.
- Computer system 200 also includes a communication interface 218 coupled to bus 202.
- Communication interface 218 provides a two-way data communication coupling to a network link 220 that is connected to a local network 222.
- communication interface 218 may be an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line.
- ISDN integrated services digital network
- communication interface 218 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN.
- LAN local area network
- Wireless links may also be implemented.
- communication interface 218 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
- Network link 220 typically provides data communication through one or more networks to other data devices.
- network link 220 may provide a connection through local network 222 to a host computer 224 or to data equipment operated by an Internet Service Provider (ISP) 226.
- ISP 226 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 228.
- Internet 228 uses electrical, electromagnetic or optical signals that carry digital data streams.
- the signals through the various networks and the signals on network link 220 and through communication interface 218, which carry the digital data to and from computer system 200, are exemplary forms of carrier waves transporting the information.
- Computer system 200 can send messages and receive data, including program code, through the network(s), network link 220 and communication interface 218.
- a server 230 might transmit a requested code for an application program through Internet 228, ISP 226, local network 222 and communication interface 218.
- the received code may be executed by processor 204 as it is received, and/or stored in storage device 210, or other non- volatile storage for later execution. In this manner, computer system 200 may obtain application code in the form of a carrier wave.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2013290306A AU2013290306A1 (en) | 2012-07-09 | 2013-07-09 | Weight-based stemming for improving search quality |
EP13816835.6A EP2870549A4 (en) | 2012-07-09 | 2013-07-09 | Weight-based stemming for improving search quality |
JP2015521758A JP2015525929A (en) | 2012-07-09 | 2013-07-09 | Weight-based stemming to improve search quality |
CA2878891A CA2878891A1 (en) | 2012-07-09 | 2013-07-09 | Weight-based stemming for improving search quality |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/544,890 US8661049B2 (en) | 2012-07-09 | 2012-07-09 | Weight-based stemming for improving search quality |
US13/544,890 | 2012-07-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014011689A1 true WO2014011689A1 (en) | 2014-01-16 |
Family
ID=49879306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2013/049798 WO2014011689A1 (en) | 2012-07-09 | 2013-07-09 | Weight-based stemming for improving search quality |
Country Status (6)
Country | Link |
---|---|
US (1) | US8661049B2 (en) |
EP (1) | EP2870549A4 (en) |
JP (1) | JP2015525929A (en) |
AU (1) | AU2013290306A1 (en) |
CA (1) | CA2878891A1 (en) |
WO (1) | WO2014011689A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3482304A4 (en) * | 2016-07-08 | 2020-03-11 | Newvoicemedia US Inc. | Concept-based search and categorization |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8881005B2 (en) * | 2012-04-20 | 2014-11-04 | King Abdulaziz City For Science And Technology | Methods and systems for large-scale statistical misspelling correction |
US20140040302A1 (en) * | 2012-05-08 | 2014-02-06 | Patrick Sander Walsh | Method and system for developing a list of words related to a search concept |
US20140072226A1 (en) * | 2012-09-13 | 2014-03-13 | International Business Machines Corporation | Searching and Sorting Image Files |
US11144563B2 (en) * | 2012-11-06 | 2021-10-12 | Matthew E. Peterson | Recurring search automation with search event detection |
US10678870B2 (en) * | 2013-01-15 | 2020-06-09 | Open Text Sa Ulc | System and method for search discovery |
CN105447004B (en) * | 2014-08-08 | 2019-12-03 | 北京小度互娱科技有限公司 | The excavation of word, relevant inquiring method and device are recommended in inquiry |
WO2016161432A1 (en) | 2015-04-03 | 2016-10-06 | Xsell Technologies | Method and apparatus to increase personalization and enhance chat experiences on the internet |
JP6340351B2 (en) * | 2015-10-05 | 2018-06-06 | 日本電信電話株式会社 | Information search device, dictionary creation device, method, and program |
US10217020B1 (en) * | 2016-12-19 | 2019-02-26 | Matrox Electronic Systems Ltd. | Method and system for identifying multiple strings in an image based upon positions of model strings relative to one another |
US10586237B2 (en) | 2017-09-20 | 2020-03-10 | XSELL Technologies, Inc. | Method, apparatus, and computer-readable media for customer interaction semantic annotation and analytics |
US10909210B2 (en) * | 2018-03-22 | 2021-02-02 | Ovh | Method and system for defining a web site development strategy |
US11042601B2 (en) | 2018-11-15 | 2021-06-22 | Ovh | Method for attracting users to a web page and server implementing the method |
US11526565B2 (en) | 2019-04-05 | 2022-12-13 | Ovh | Method of and system for clustering search queries |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040024759A1 (en) * | 2002-07-31 | 2004-02-05 | International Business Machines Corporation | Communicating state information in a network employing extended queries and extended responses |
US20040059564A1 (en) * | 2002-09-19 | 2004-03-25 | Ming Zhou | Method and system for retrieving hint sentences using expanded queries |
US20090012778A1 (en) * | 2007-07-05 | 2009-01-08 | Nec (China) Co., Ltd. | Apparatus and method for expanding natural language query requirement |
US20090164456A1 (en) * | 2007-12-20 | 2009-06-25 | Malcolm Slaney | Expanding a query to include terms associated through visual content |
Family Cites Families (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5317507A (en) | 1990-11-07 | 1994-05-31 | Gallant Stephen I | Method for document retrieval and for word sense disambiguation using neural networks |
US5325298A (en) | 1990-11-07 | 1994-06-28 | Hnc, Inc. | Methods for generating or revising context vectors for a plurality of word stems |
US6470307B1 (en) | 1997-06-23 | 2002-10-22 | National Research Council Of Canada | Method and apparatus for automatically identifying keywords within a document |
CN101051311A (en) * | 2000-04-18 | 2007-10-10 | 韩国电气通信公社 | Method for extracting central term of headword through central term dictionary and information search system of the same |
WO2002019147A1 (en) | 2000-08-28 | 2002-03-07 | Emotion, Inc. | Method and apparatus for digital media management, retrieval, and collaboration |
CA2323883C (en) | 2000-10-19 | 2016-02-16 | Patrick Ryan Morin | Method and device for classifying internet objects and objects stored oncomputer-readable media |
US20030020749A1 (en) | 2001-07-10 | 2003-01-30 | Suhayya Abu-Hakima | Concept-based message/document viewer for electronic communications and internet searching |
US7194455B2 (en) * | 2002-09-19 | 2007-03-20 | Microsoft Corporation | Method and system for retrieving confirming sentences |
US8055669B1 (en) * | 2003-03-03 | 2011-11-08 | Google Inc. | Search queries improved based on query semantic information |
US20050149499A1 (en) | 2003-12-30 | 2005-07-07 | Google Inc., A Delaware Corporation | Systems and methods for improving search quality |
WO2005096174A1 (en) | 2004-04-02 | 2005-10-13 | Health Communication Network Limited | Method, apparatus and computer program for searching multiple information sources |
CA2545232A1 (en) | 2005-07-29 | 2007-01-29 | Cognos Incorporated | Method and system for creating a taxonomy from business-oriented metadata content |
US7548929B2 (en) | 2005-07-29 | 2009-06-16 | Yahoo! Inc. | System and method for determining semantically related terms |
US7668887B2 (en) | 2005-12-01 | 2010-02-23 | Object Positive Pty Ltd | Method, system and software product for locating documents of interest |
KR100785928B1 (en) * | 2006-07-04 | 2007-12-17 | 삼성전자주식회사 | Method and system for searching photograph using multimodal |
US8234107B2 (en) | 2007-05-03 | 2012-07-31 | Ketera Technologies, Inc. | Supplier deduplication engine |
US20080313202A1 (en) | 2007-06-12 | 2008-12-18 | Yakov Kamen | Method and apparatus for semantic keyword clusters generation |
CN101802812B (en) | 2007-08-01 | 2015-07-01 | 金格软件有限公司 | Automatic context sensitive language correction and enhancement using an internet corpus |
US7788276B2 (en) | 2007-08-22 | 2010-08-31 | Yahoo! Inc. | Predictive stemming for web search with statistical machine translation models |
US8402046B2 (en) * | 2008-02-28 | 2013-03-19 | Raytheon Company | Conceptual reverse query expander |
US8290975B2 (en) | 2008-03-12 | 2012-10-16 | Microsoft Corporation | Graph-based keyword expansion |
KR100931025B1 (en) | 2008-03-18 | 2009-12-10 | 한국과학기술원 | Query expansion method using additional terms to improve accuracy without compromising recall |
US9411886B2 (en) * | 2008-03-31 | 2016-08-09 | Yahoo! Inc. | Ranking advertisements with pseudo-relevance feedback and translation models |
US8473279B2 (en) | 2008-05-30 | 2013-06-25 | Eiman Al-Shammari | Lemmatizing, stemming, and query expansion method and system |
US8041729B2 (en) * | 2009-02-20 | 2011-10-18 | Yahoo! Inc. | Categorizing queries and expanding keywords with a coreference graph |
US8214363B2 (en) | 2009-07-06 | 2012-07-03 | Abhilasha Chaudhary | Recognizing domain specific entities in search queries |
US8280900B2 (en) | 2010-08-19 | 2012-10-02 | Fuji Xerox Co., Ltd. | Speculative query expansion for relevance feedback |
US9589072B2 (en) * | 2011-06-01 | 2017-03-07 | Microsoft Technology Licensing, Llc | Discovering expertise using document metadata in part to rank authors |
-
2012
- 2012-07-09 US US13/544,890 patent/US8661049B2/en active Active
-
2013
- 2013-07-09 JP JP2015521758A patent/JP2015525929A/en active Pending
- 2013-07-09 EP EP13816835.6A patent/EP2870549A4/en not_active Withdrawn
- 2013-07-09 AU AU2013290306A patent/AU2013290306A1/en not_active Abandoned
- 2013-07-09 WO PCT/US2013/049798 patent/WO2014011689A1/en active Application Filing
- 2013-07-09 CA CA2878891A patent/CA2878891A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040024759A1 (en) * | 2002-07-31 | 2004-02-05 | International Business Machines Corporation | Communicating state information in a network employing extended queries and extended responses |
US20040059564A1 (en) * | 2002-09-19 | 2004-03-25 | Ming Zhou | Method and system for retrieving hint sentences using expanded queries |
US20090012778A1 (en) * | 2007-07-05 | 2009-01-08 | Nec (China) Co., Ltd. | Apparatus and method for expanding natural language query requirement |
US8180628B2 (en) * | 2007-07-05 | 2012-05-15 | Nec (China) Co., Ltd. | Apparatus and method for expanding natural language query requirement |
US20090164456A1 (en) * | 2007-12-20 | 2009-06-25 | Malcolm Slaney | Expanding a query to include terms associated through visual content |
Non-Patent Citations (1)
Title |
---|
See also references of EP2870549A4 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3482304A4 (en) * | 2016-07-08 | 2020-03-11 | Newvoicemedia US Inc. | Concept-based search and categorization |
Also Published As
Publication number | Publication date |
---|---|
JP2015525929A (en) | 2015-09-07 |
EP2870549A1 (en) | 2015-05-13 |
EP2870549A4 (en) | 2016-03-09 |
US20140012841A1 (en) | 2014-01-09 |
US8661049B2 (en) | 2014-02-25 |
CA2878891A1 (en) | 2014-01-16 |
AU2013290306A1 (en) | 2015-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8661049B2 (en) | Weight-based stemming for improving search quality | |
KR101554293B1 (en) | Cross-language information retrieval | |
US8276060B2 (en) | System and method for annotating documents using a viewer | |
US8051080B2 (en) | Contextual ranking of keywords using click data | |
CA2832909C (en) | System and method for matching comment data to text data | |
CA2777520C (en) | System and method for phrase identification | |
US20160335234A1 (en) | Systems and Methods for Generating Summaries of Documents | |
US9390161B2 (en) | Methods and systems for extracting keyphrases from natural text for search engine indexing | |
US10592571B1 (en) | Query modification based on non-textual resource context | |
US7099870B2 (en) | Personalized web page | |
US8745044B2 (en) | Generating descriptions of matching resources based on the kind, quality, and relevance of available sources of information about the matching resources | |
US20120278302A1 (en) | Multilingual search for transliterated content | |
EP1962208A2 (en) | System and method for searching annotated document collections | |
CN103064956A (en) | Method, computing system and computer-readable storage media for searching electric contents | |
EP2165272A2 (en) | Machine translation for query expansion | |
WO2010014082A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
Kerremans et al. | Using data-mining to identify and study patterns in lexical innovation on the web: The NeoCrawler | |
WO2014049310A2 (en) | Method and apparatuses for interactive searching of electronic documents | |
Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
US8930373B2 (en) | Searching with exclusion tokens | |
EP1962201A2 (en) | System and method for annotating documents using a viewer | |
CN116303873A (en) | Patent retrieval system and method thereof | |
Hariharan et al. | Identifying Relevant Snippets from Ranked Web Documents. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13816835 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2878891 Country of ref document: CA Ref document number: 2015521758 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2013816835 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2013816835 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2013290306 Country of ref document: AU Date of ref document: 20130709 Kind code of ref document: A |