US20140207782A1 - System and method for computerized semantic processing of electronic documents including themes - Google Patents

System and method for computerized semantic processing of electronic documents including themes Download PDF

Info

Publication number
US20140207782A1
US20140207782A1 US14/161,159 US201414161159A US2014207782A1 US 20140207782 A1 US20140207782 A1 US 20140207782A1 US 201414161159 A US201414161159 A US 201414161159A US 2014207782 A1 US2014207782 A1 US 2014207782A1
Authority
US
United States
Prior art keywords
document
documents
themes
user
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/161,159
Inventor
Yiftach Ravid
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Israel Research and Development 2002 Ltd
Original Assignee
Equivio Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Equivio Ltd filed Critical Equivio Ltd
Priority to US14/161,159 priority Critical patent/US20140207782A1/en
Assigned to EQUIVIO LTD. reassignment EQUIVIO LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: RAVID, YIFTACH
Publication of US20140207782A1 publication Critical patent/US20140207782A1/en
Assigned to MICROSOFT ISRAEL RESEARCH AND DEVELOPMENT (2002) LTD reassignment MICROSOFT ISRAEL RESEARCH AND DEVELOPMENT (2002) LTD MERGER (SEE DOCUMENT FOR DETAILS). Assignors: EQUIVIO LTD
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30011
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Definitions

  • the present invention relates generally to computerized processing of electronic documents and more particularly to computerized semantic processing of electronic documents.
  • deduplication may refer to Data deduplication, as above, or to Record linkage, in databases, i.e. finding entries that refer to the same entity in two or more files. ‘DeDuping’ may involve removing duplicates in Customer and Address records in a Database or Spreadsheet.
  • Clustering can be used to assist browsing. Browsing tools complement search tools” e.g. as described at the following http-linked publication: pages.cs.wisc.edu/ ⁇ pradheep/Clust-LDA.pdf.
  • Document score the significance of an individual theme in a particular document. For example, if a topic modeling process defines a topic as a distribution over a fixed vocabulary and assumes that each document includes various topics each with different proportions determined by a per-document distribution over topics then a document's “score” for a particular theme may be the document's level of probability given that document's distribution over a universe of topics; this “score” is typically generated in the course of performing conventional topic-modeling processes. In other words, each topic x has some probability ⁇ _yx of being in document y [Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022].
  • Inclusive an e-mail whose subject and/or body is not contained in any other e-mail in a given set of emails. This definition implies that if an email is not an inclusive, its subject and/or body is contained in one of the “inclusive” emails defined for that set of emails. Often, an inclusive culminates and includes an entire email thread.
  • Overlap Themes are considered related, or similar, if they overlap. “overlap” may be computed by noting that each document can be assigned to more than one theme/s.
  • X documents are in theme T — 1, and Y documents in T — 2;
  • the overlap between T — 1 and T — 2 is (size of intersection of X and Y)/(size X).
  • the definition of overlap need not be symmetrical; the number of documents that belong to both themes may be divided by the size of the topic of interest. For example if one theme (say, “parrots”) is included in another theme (say, “birds”) all documents in the small theme are typically also included in the second theme. Then the overlap from the viewpoint of the small theme is 100% and from the viewpoint of the larger theme is less than 100%.
  • Pivot a document which is a representative in some sense, of a set of near-duplicates. Any suitable application-specific policy may be employed to define which document is representative e.g. the g., the document in the set which has the highest/lowest/median number of words.
  • Equivio Zoom's Relevance functionality is a commercially available software tool that uses “supervised” machine learning, hence there is a typically human expert that trains the system.
  • themes functionality described herein there is typically no supervision or, in some versions, topics may be “semi-supervised”.
  • Themes functionality as described herein is useful inter alia in training a system by allowing a human trainer to search for relevancy by browsing through a large collection of electronic documents e.g. as described herein. Since in many cases, finding relevant documents to a specific issue is no simple task, themes functionality described herein may facilitate the process of finding such documents.
  • pruning reducing size of (number and/or size of members in) a data set by removing some members of the set e.g. to achieve a predetermined number/total size of members in the set, including prioritizing removal of set members known to be superfluous e.g. duplicates, vis a vis removal of set members not known to be superfluous which is lower priority.
  • some computerized topic modeling methods yield models in which certain documents have a mixture of topics.
  • topic refers to output by conventional topic modeling whereas “theme” typically refers to that output as further processed in accordance with embodiments of the present invention.
  • Word score the significance of an individual word to an individual theme. For example, if a topic modeling process defines a topic as a distribution over a fixed vocabulary and assumes that each document includes various topics each with different proportions determined by a per-document distribution over topics, then each word's “score” may be its level of probability given the distribution over the fixed vocabulary defined by the topic; this “score” is typically generated in the course of performing conventional topic-modeling processes. In other words, each word x has some probability ⁇ _yx of being in theme y [Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022].
  • Certain embodiments of the present invention seek to provide computerized systems and methods for use of themes in e-discovery and other semantic tasks.
  • Certain embodiments of the present invention seek to provide methods for computerized identification of themes in a large data set.
  • Certain embodiments of the present invention seek to provide methods for use of multi-topic modeling in e-discovery and other semantic tasks.
  • a method for computerized identification of themes in a large data set comprising:
  • a method according to Embodiment 1 wherein the computerized data set member pruning technique comprises thinning out at least one document which passes a document similarity criterion relative to at least one other document not being thinned out, thereby to combat skewing as a result of over-influence of similar, hence over-represented, documents upon the theme identification technique.
  • a method according to Embodiment 2 wherein the thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with at least one inclusive email, thereby to thin out emails which are included in the inclusive email hence are deemed to pass the document similarity criterion with regard to the inclusive.
  • a method according to Embodiment 2 wherein the thinning out at least one document which passes a document similarity criterion comprises identifying and discarding near-duplicates thereby to thin out at least one document which is deemed to pass the document similarity criterion with regard to a set of near-duplicates of the document, at least one of which is not being thinned out.
  • a method according to Embodiment 1 wherein the computerized theme identification technique comprises topic modeling.
  • LDA Latent Dirichlet allocation
  • a browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising:
  • theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high;
  • document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
  • theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high;
  • word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes
  • At least one computerized data set member pruning technique other than random selection is at least one computerized data set member pruning technique other than random selection.
  • a method according to Embodiment 5 wherein the topic modeling which allows documents to have a plurality of topics comprises one of the following computerized techniques: Latent Dirichlet allocation (LDA), PLSI, and Pachinko allocation.
  • a method according to Embodiment 3 wherein the thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with a single inclusive email.
  • An e-discovery method comprising:
  • Step 1 Input: a set of electronic documents
  • Step 2 Extract text from the data collection.
  • Step 3 Compute Near-duplicate (ND) on the dataset.
  • Step 4. Compute Email threads (ET) on the dataset.
  • Step 5. Run a topic modeling on a subset of the dataset, including data manipulation
  • step 3 includes all documents having the same DuplicateSubsetID having an identical text.
  • step 3 includes all documents x in the set for which there is another document y in the set, such that the similarity between the two is greater than some threshold.
  • step 3 includes at least one pivot document selected by a policy such as maximum words in the document.
  • E Email threads
  • Words that appear more than (parameter) times number of words in the document are more than (parameter) times number of words in the document.
  • a method according to Embodiment a1 wherein the output of step 5 includes an assignment of documents to the themes, and an assignment of words (features) to themes and each feature x has some probability P_xy of being in theme y and wherein the P matrix is used to construct names for at least one theme.
  • a method for computerized Early Case Assessment comprising:
  • ND near-duplicates
  • ET Email threads
  • Select pivot and inclusive e. Run topic modeling; g. Generate theme names; and h. Explore the data by browsing themes.
  • a computer program comprising computer program code means for performing any of the methods shown and described herein when the program is run on a computer; and a computer program product, comprising a typically non-transitory computer-usable or -readable medium e.g. non-transitory computer-usable or -readable storage medium, typically tangible, having a computer readable program code embodied therein, the computer readable program code adapted to be executed to implement any or all of the methods shown and described herein. It is appreciated that any or all of the computational steps shown and described herein may be computer-implemented.
  • non-transitory is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
  • Any suitable processor, display and input means may be used to process, display e.g. on a computer screen or other computer output device, store, and accept information such as information used by or generated by any of the methods and apparatus shown and described herein; the above processor, display and input means including computer programs, in accordance with some or all of the embodiments of the present invention.
  • any or all functionalities of the invention shown and described herein, such as but not limited to steps of flowcharts, may be performed by a conventional personal computer processor, workstation or other programmable device or computer or electronic computing device or processor, either general-purpose or specifically constructed, used for processing; a computer display screen and/or printer and/or speaker for displaying; machine-readable memory such as optical disks, CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs, EEPROMs, magnetic or optical or other cards, for storing, and keyboard or mouse for accepting.
  • the term “process” as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g. electronic, phenomena which may occur or reside e.g. within registers and/or memories of a computer or processor.
  • the term processor includes a single processing unit or a plurality of distributed or remote such units.
  • the above devices may communicate via any conventional wired or wireless digital communication means, e.g. via a wired or cellular telephone network or a computer network such as the Internet.
  • the apparatus of the present invention may include, according to certain embodiments of the invention, machine readable memory containing or otherwise storing a program of instructions which, when executed by the machine, implements some or all of the apparatus, methods, features and functionalities of the invention shown and described herein.
  • the apparatus of the present invention may include, according to certain embodiments of the invention, a program as above which may be written in any conventional programming language, and optionally a machine for executing the program such as but not limited to a general purpose computer which may optionally be configured or activated in accordance with the teachings of the present invention. Any of the teachings incorporated herein may wherever suitable operate on signals representative of physical objects or substances.
  • the term “computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, computing system, communication devices, processors (e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices.
  • processors e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • Any suitable input device such as but not limited to a sensor, may be used to generate or otherwise provide information received by the apparatus and methods shown and described herein.
  • Any suitable output device or display may be used to display or output information generated by the apparatus and methods shown and described herein.
  • Any suitable processor may be employed to compute or generate information as described herein e.g. by providing one or more modules in the processor to perform functionalities described herein.
  • Any suitable computerized data storage e.g. computer memory may be used to store information received by or generated by the systems shown and described herein.
  • Functionalities shown and described herein may be divided between a server computer and a plurality of client computers. These or any other computerized components shown and described herein may communicate between themselves via a suitable computer network.
  • FIG. 1 is a simplified flowchart illustration of a method for use of themes in e-discovery, according to certain embodiments.
  • FIG. 2 is a simplified flowchart illustration of a method for early case assessment, according to certain embodiments.
  • FIGS. 3 a - 3 b taken together, is a simplified flowchart illustration of a method for associating topics with documents, according to certain embodiments.
  • FIG. 4 is a simplified flowchart illustration of a “navigating” or browsing method for generating suitable displays to facilitate computer-aided theme exploration, suitable e.g. for implementing step 100 in FIGS. 3 a - 3 b , taken together, according to certain embodiments.
  • FIG. 5 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments.
  • the screen display facilitates theme-level browsing, according to certain embodiments.
  • FIG. 6 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, flitting from document-level to theme-level or word-level is facilitated.
  • FIG. 7 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, document-level browsing is facilitated.
  • FIG. 8 is a simplified flowchart illustration, according to certain embodiments, of a method for utilizing computerized themes functionality under these circumstances.
  • Computational components described and illustrated herein can be implemented in various forms, for example, as hardware circuits such as but not limited to custom VLSI circuits or gate arrays or programmable hardware devices such as but not limited to FPGAs, or as software program code stored on at least one tangible or intangible computer readable medium and executable by at least one processor, or any suitable combination thereof.
  • a specific functional component may be formed by one particular sequence of software code, or by a plurality of such, which collectively act or behave or act as described herein with reference to the functional component in question.
  • the component may be distributed over several code sequences such as but not limited to objects, procedures, functions, routines and programs and may originate from several computer files which typically operate synergistically.
  • Data can be stored on one or more tangible or intangible computer readable media stored at one or more different locations, different network nodes or different storage devices at a single node or location.
  • Suitable computer data storage or information retention apparatus may include apparatus which is primary, secondary, tertiary or off-line; which is of any type or level or amount or category of volatility, differentiation, mutability, accessibility, addressability, capacity, performance and energy use; and which is based on any suitable technologies such as semiconductor, magnetic, optical, paper and others.
  • FIGS. 3 a - 3 b taken together, is a simplified flowchart illustration of a method for associating topics with documents, according to certain embodiments.
  • the method of FIGS. 3 a - 3 b typically include some or all of the following steps, suitably ordered e.g. as shown:
  • Select one (say) document e.g. pivot document
  • each set of near-duplicate set thereby to yield a set X1 of documents.
  • a name may comprise one or more of the words most frequently found in the documents pertaining to the topic and less frequently or infrequently found in documents not pertaining to the topic.
  • Step 40 A may be performed only on non-emails or may be performed on all documents e-mails and non-emails (e.g. e-mails are considered documents).
  • Step 40 b is typically performed on e-mail bodies i.e. without their attachments.
  • Step 40 c is typically performed on e-mails without attachments. After identifying inclusives, near duplicate is applied to these and typically, just one or just a few e-mail/s from each group of text-similar e-mails is/are selected. For example: if an email thread has several inclusives, only one of them might be selected.
  • random step 50 is performed after near-duplicate and inclusive steps 20 , 30 and 40 , to enable a user to ascertain that further random pruning is necessary since it is possible that steps 20 , 30 , 40 reduce the size of the data set sufficiently without requiring any random pruning.
  • random pruning may occur before steps 20 , 30 , and 40 .
  • random step 50 is performed only when it is desired to reduce processing time whereas for a small set of documents, e.g. less than 400 thousand documents, step 50 may be omitted.
  • the system computes cost (monetary or in terms of time) of topic modeling both with random selection and without.
  • the system may for example compute the time or cost to compute a topic model on a random sample which is, say, 50%/10%/1% the size of the original data set.
  • FIG. 4 is a “Navigating” or browsing method for generating suitable displays to facilitate computer-aided theme exploration, suitable e.g. for implementing step 100 in FIGS. 3 a - 3 b , taken together.
  • the method of FIG. 4 may be employed to facilitate computer-aided exploration of any set of themes, which need not have been generated using any or all of steps 10 - 90 in FIGS. 3 a - 3 b , taken together.
  • the method of FIG. 4 may include some or all of the following steps, suitably ordered e.g. as shown:
  • step 420 Sort themes by a default or user-selected (in step 410 ) theme attribute and display themes in order determined by sort process OR display only themes which match a criterion (example criteria: more than 85% of documents in theme are relevant to user-selected predicate, theme name includes “Kennedy”, theme includes more than 1000 documents).
  • Document attribute may include metadata (Custodian, date) or theme related data (e.g. relevance of document to selected predicate, e.g. using Equivio relevance software tool) and display documents in themes in order determined by sort process.
  • metadata Custodian, date
  • theme related data e.g. relevance of document to selected predicate, e.g. using Equivio relevance software tool
  • 470 responsive to a user's selection of a document attribute (e.g. metadata (Custodian, date)), compute distribution and display (e.g. as histogram): number (or %) of documents under (say) custodian C or date D belonging to each theme.
  • a document attribute e.g. metadata (Custodian, date)
  • compute distribution and display e.g. as histogram: number (or %) of documents under (say) custodian C or date D belonging to each theme.
  • Themes functionality herein is particularly useful for identifying relevant documents in a large collection of electronic documents which is sparse in that only a small number of documents are relevant to a particular issue. This is especially the case if it is not possible to identify keywords which can be used to tag relevant documents on the basis of a simple keyword search.
  • FIG. 5 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments.
  • each theme is presented together with a bar (right side of screen) indicating relevance to a user-selected issue.
  • order of presentation of the screens is in accordance with length of the bar.
  • the screen display facilitates theme-level browsing, according to certain embodiments.
  • the bars may indicate the number of files per theme, that are Relevant (e.g. as determined manually, or automatically e.g. by Equivio Zoom's Relevance functionality) to a user-selected issue which may if desired be shown on, and selected via, a suitable GUI (not shown).
  • the screen display facilitates theme-level browsing, according to certain embodiments.
  • FIG. 6 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments.
  • documents are presented along with (on the left) words in the documents and their word scores, relative to an individual theme, as well as themes related to the individual theme.
  • the words and themes may be presented in descending order of their word scores and relatedness to the individual theme, respectively. If a related theme or word is selected (e.g. clicked upon), a different “semantic view” is generated; for example, of all documents in the selected related theme, using a screen format which may be similar to that of FIG. 6 . As shown, flitting from document-level to theme-level or word-level is facilitated.
  • FIG. 7 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, document-level browsing is facilitated.
  • FIG. 8 is a simplified flowchart illustration, according to certain embodiments, of a method for utilizing computerized Themes functionality under these circumstances.
  • the method of FIG. 8 typically includes some or all of the following steps, suitably ordered e.g. as shown:
  • 1010 use computerized Themes functionality to identify an initial “seed” set of (say 5-30) relevant documents in a large sparse collection of electronic documents.
  • step 1010 may be performed in any suitable manner.
  • step 1010 may comprise:
  • Some topic modeling software provides “document scores” for each document relative to each theme, indicating significance of each of the various themes to each document.
  • the theme can be regarded as highly significant to the former documents and less significant to the latter documents.
  • a suitable metric for similarity between Document 1's “thematic distribution” and Document 2's “thematic distribution” may for example be a Euclidean distance (sum of squares-based e.g.) between the document scores of Document 1, summed over all themes, and the document scores of Document 2, summed over all themes.
  • Other distance metrics may also be employed e.g. L-infinity distance (max entry distance), L ⁇ 1 distance, and Manhattan distance.
  • a S-tier browsing system to be generated, in which a user can browse at the word, document/file or topic level, and can move from one level to another.
  • a user may look at a presentation of topics, arranged say by relevance to an issue, and the system may present to her or him, words or documents whose score for the theme/s the user has selected, are high.
  • the system may for example compute word scores or document scores for all words or documents, sort the words or documents, and present to the user only those whose word or document scores is high. The user may then select one of those words or documents, thereby browsing to a different level.
  • the system When s/he does select, say, a document scoring high for the topic s/he previously was viewing, the system then shows the document, and also identifies and displays indications of themes to which the document is strongly related, and words whose document scores are high for the themes to which the document is strongly related.
  • the system may do this by computing the degree of relatedness of the document to all themes (each of the themes), sorting the themes on this basis, and presenting to the user only those themes for which the document's degree of relatedness is high. Again the user can change level, from the document level up to the topic level or down to the word level, or the user may continue to browse at the document level, e.g.
  • the system may compute all documents' distributions over all themes identified, and may also compute the distances between these distributions, either in advance for all document pairs, or in real time for a user-designated document. The system may then present, responsive to a user request for documents similar to document D, the top few documents from a list of documents sorted in accordance with the documents' respective distances from Document D. Alternatively, a user may perform “word-level” browsing by moving from one word to another word which has a similar distribution over N identified topics.
  • the system may compute all words' distributions over all themes identified, and may also compute the distances between these distributions, either in advance for all word pairs, or in real time for a user-designated word.
  • the system may then present, responsive to a user request for words similar to an individual word W of interest, the top few words from a list of words sorted in accordance with the words' respective distances from Word W.
  • Another embodiment of the invention e.g. as described above with reference to FIG. 4 , is a browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising some or all of the following:
  • theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high;
  • theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high;
  • document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes
  • word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes
  • the number of themes to identify is selected by a user and any suitable number of themes may be requested by the user such as 10, 20, 50, 100, 200 or 500 themes.
  • the number of themes selected may be, perhaps, 200 themes for a collection of a few hundred thousand electronic documents, and proportionally more or less themes if the number of documents in the collection is proportionally larger or smaller.
  • Any suitable “view” of themes may be provided, such as themes sorted by number of files or meta-data attributes of the files, themes sorted by various attributes of the words in the theme name, themes sorted by relevance to an issue and so forth.
  • a particular advantage of certain embodiments is that documents which are known to be mutually similar or near duplicates are “thinned” so that they do not over-influence or skew the topic modeling process.
  • thinning need not result in retaining only a single pivot or only a single inclusive email, instead one may, if appropriate, reduce the influence of repeated or highly related materials without eliminating the repetition entirely.
  • step v of FIG. 2 e.g. step v of FIG. 2 , step 3 of FIG. 2 , step 70 of FIG. 3 :
  • a topic model is a computational functionality analyzing a set of documents and yielding “topics” that occur in the set of documents typically including (a) what the topics are and (b) what each document's balance of topics is.
  • Topic models may analyze large volumes of unlabeled text and each “topic” may consist of a cluster of words that occur together frequently.
  • a topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents.”
  • Topic modeling includes any or all of the above, as well as any computerized functionality which inputs text/s and uses a processor to generate and output a list of semantic topics which the text/s are assumed to pertain to, wherein each “topic” comprises a list of keywords assumed to represent a semantic concept.
  • Topic modeling includes but is not limited to any and all of: the Topic modeling functionality described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998; Probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999; Latent Dirichlet allocation (LDA), developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002 and allowing documents to have a mixture of topics; extensions on LDA, such as but not limited to Pachinko allocation;
  • Topic modeling e.g. as published in 2002, 2003, 2004; Hofmann Topic modeling e.g. as published in 1999, 2001; topic modeling using the synchronic approach; topic modeling using the diachronic approach; Topic modeling functionality which attempts to fit appropriate model parameters to the data corpus using heuristic/s for maximum likelihood fit, topic modeling functionality with provable guarantees; topic modeling functionality which uses singular value decomposition (SVD), topic modeling functionality which uses the method of moments, topic modeling functionality which uses an algorithm based upon non-negative matrix factorization (NMF); and topic modeling functionality which allows correlations among topics.
  • Topic modeling implementations may for example employ Mallet (software project), Stanford Topic Modeling Toolkit, or GenSim—Topic Modeling for Humans.
  • a system for computerized derivation of leads from a huge body of data comprising:
  • an electronic repository including a multiplicity of accesses to a respective multiplicity of electronic documents and metadata including metadata parameters having metadata values characterizing each of the multiplicity of electronic documents;
  • a relevance rater using a processor to run a first computer algorithm on the multiplicity of electronic documents which yields a relevance score which rates relevance of each of the multiplicity of electronic documents to an issue
  • a metadata-based relevant-irrelevant document discriminator using a processor to rapidly run a second computer algorithm on at least some of the metadata which yields leads, each lead comprising at least one metadata value for at least one metadata parameter, which value correlates with relevance of the electronic documents to the issue.
  • the application is operative to find outliers of a given metadata and relevancy score (i.e. relevant, not relevant).
  • the system can identify themes with high relevancy score based on the given application.
  • the above system without theme-exploring, may compute the outlier for a given metadata, and each document appears one in each metadata. In the theme-exploring settings for a given set of themes the same document might fall into several of the metadata.
  • step i. Input: a set of electronic documents.
  • the documents could be in: Text format, Native files (PDF, Word, PPT, etc.), ZIP files, PST, Lotus notes, MSG, etc.
  • Step ii Extract text from the data collection. Text extraction can be done by third party software such as: Oracle inside out, iSys, DTSearch, iFilter, etc.
  • Step iii Compute Near-duplicate (ND) on the dataset.
  • ND Near-duplicate
  • Step iiia DuplicateSubsetID: all documents having the same DuplicateSubsetID having an identical text.
  • Step iiib EquiSetID: all documents having the same EquiSetID are similar (for each document x in the set there is another document y in the set, such that the similarity between the two is greater than some threshold).
  • Step iiic Pivot: 1 if the document is a representative of the set (and 0 otherwise). Typically, for each EquiSet only one document is selected as Pivot.
  • the pivot document can be selected by a policy for example (maximum words number of words, minimum number of words, median number of words, minimum docid, etc.) When using theme networking (TN) it is recommended to use maximum words in documents as pivot policy as it is desirable for largest documents to be in the model.
  • Step iv. Compute Email threads (ET) on the dataset.
  • the following teachings may be used: WO 2009/004324, entitled “A Method for Organizing Large Numbers of Documents” and/or suitable functionalities in commercially available e-discovery systems such as those of Equivio.
  • the output of this phase is a collection of trees, and all leafs of the trees are marked as inclusive. Note, that family information is accepted (to group e-mails with their attachments).
  • Step v. Run a topic modeling algorithm (such as LDA) on a subset of the dataset, including feature extraction. Resulting topics are defined as themes.
  • the subset includes the following documents:
  • the data collection include less files (usually the size is 50% of the total size); and the data do not include similar documents, therefore if a document appears many times in the original data collection it will have the same weight as if it appears once.
  • the first step in the topic modeling algorithm is to extract features from each document.
  • a method suitable for the Feature extraction of step v may include obtaining features as follows:
  • a topic modeling algorithm uses features to create the model for the topic-modeling step v above.
  • the features are words; to generate a list of words from each word one may do the following:
  • Words with length less than 3 Words with length greater than 20
  • Words that do not start with an alpha character Words that are stop words. Words that appear more than 0.2 times number of words in the document.
  • Words that appear in less than 0.01 times number of documents Words that appear in more than 0.2 times number of documents.
  • Step viii Theme names.
  • the output of step v includes an assignment of documents to the themes, and an assignment of words (features) to themes.
  • Each feature x has some probability P_xy of being in theme y.
  • P_xy has some probability of being in theme y.
  • Steps a-h Early Case Assessment ( FIG. 2 , Including Some or all of the Following Steps a-h): a. Select at random 100000 documents b. Run near-duplicates (ND) c. Run Email threads (ET) d. Select pivot and inclusive e. Run topic modeling using the above feature selection.
  • the input of the topic modeling is a set of documents.
  • the first phase of the topic modeling is to construct a set of features for each document.
  • the feature getting method described above may be used to construct the set of features.
  • f. Run the model on all other documents (optional).
  • Explore the data by browsing themes one may open a list of documents belonging to a certain theme, from the document one may see all themes connected to that document, and go to other themes.
  • the list of documents might be filtered by a condition set by the user. For example filter all documents by dates, relevancy, file size, etc.
  • the above procedure assists users in early case assessment when the data is known and one would like to know what is in the data, and assess the contents of the data collection. In early case assessment one may randomly sample the dataset to get results faster.
  • Post Case Assessment This process uses some or all of steps I-v above, but in this setting an entire dataset is not used, but rather, only the documents that are relevant to the case. If near-duplicates (ND) and Email threads (ET) have already run, there is no need to re-run them.
  • ND near-duplicates
  • ET Email threads
  • 1 st pass review is a quick review of the documents that can be handled manually or by an automatic predictive coding software; the user wishes to review the results and get an idea on the themes of the documents that passed that review. This phase is essential because the number of such documents might be extremely large. Also, there are cases in which, in some sub-issues, there are only a few documents.
  • the above building block can generate a procedure for such cases.
  • g only documents that passed the 1 St review phase are taken, and themes are calculated for them.
  • each resulting topic is defined as a theme, and for each theme the list of documents is displayed that are related to that theme.
  • the user has an option to select a meta-data (for example is the document relevant to an issue, custodian, date-range, file type, etc.) and the system will display for each theme the percentage of meta-data in that theme. Such presentation would assist the user while evaluating the theme.
  • An LDA model might have themes that can be classified as CAT_related and DOG_related.
  • a theme has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as “CAT_related”.
  • the word cat itself will have high probability given this theme.
  • the DOG_related theme likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as the (see function word), will have roughly even probability between classes (or can be placed into a separate category).
  • a theme is not strongly defined, neither semantically nor epistemologically. It is identified on the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence.
  • a lexical word may occur in several themes with a different probability, however, with a different typical set of neighboring words in each theme.
  • Each document is assumed to be characterized by a particular set of themes. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable.
  • N documents are selected to create the model, and then the model is applied on the remaining documents.
  • theme_3 is better than theme_4.
  • the number of words can be reduced by, for example, taking only those words in each theme that are bigger than the maximum word rank in that theme, divided by some constant.
  • a particular advantage of certain embodiments of the invention is that collections of electronic documents are hereby analyzed semantically by a processor on a scale that would be impossible manually.
  • Output of topic modeling may include the n most frequent words from the m most frequent topics found in an individual document.
  • the methods shown and described herein are particularly useful in processing or analyzing or sorting or searching bodies of knowledge including hundreds, thousands, tens of thousands, or hundreds of thousands of electronic documents or other computerized information repositories, some or many of which are themselves at least tens or hundreds or even thousands of pages long. This is because practically speaking, such large bodies of knowledge can only be processed, analyzed, sorted, or searched using computerized technology.
  • software components of the present invention including programs and data may, if desired, be implemented in ROM (read only memory) form including CD-ROMs, EPROMs and EEPROMs, or may be stored in any other suitable typically non-transitory computer-readable medium such as but not limited to disks of various kinds, cards of various kinds and RAMs.
  • ROM read only memory
  • EEPROM electrically erasable programmable read-only memory
  • Components described herein as software may, alternatively, be implemented wholly or partly in hardware and/or firmware, if desired, using conventional techniques, and vice-versa. Each module or component may be centralized in a single location or distributed over several locations.
  • Any computer-readable or machine-readable media described herein is intended to include non-transitory computer- or machine-readable media.
  • Any computations or other forms of analysis described herein may be performed by a suitable computerized method. Any step described herein may be computer-implemented.
  • the invention shown and described herein may include (a) using a computerized method to identify a solution to any of the problems or for any of the objectives described herein, the solution optionally includes at least one of a decision, an action, a product, a service or any other information described herein that impacts, in a positive manner, a problem or objectives described herein; and (b) outputting the solution.
  • the system may, if desired, be implemented as a web-based system employing software, computers, routers and telecommunication equipment as appropriate.
  • a server may store certain applications, for download to clients, which are executed at the client side, the server side serving only as a storehouse.
  • Some or all functionalities e.g. software functionalities shown and described herein may be deployed in a cloud environment.
  • Clients e.g. mobile communication devices such as smartphones may be operatively associated with, but external to the cloud.
  • the scope of the present invention is not limited to structures and functions specifically described herein and is also intended to include devices which have the capacity to yield a structure, or perform a function, described herein, such that even though users of the device may not use the capacity, they are, if they so desire, able to modify the device to obtain the structure or function.
  • a system embodiment is intended to include a corresponding process embodiment.
  • each system embodiment is intended to include a server-centered “view” or client centered “view”, or “view” from any other node of the system, of the entire functionality of the system, computer-readable medium, apparatus, including only those functionalities performed at that server or client or node.

Abstract

System and method for computerized identification of themes in a large data set, the system comprising reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.

Description

  • Priority is claimed from U.S. Provisional Patent Application No. 61/755,242, entitled “Computerized systems and methods for use of themes in e-discovery” and filed Jan. 22, 2013, the entire contents of which being hereby incorporated herein by reference.
  • FIELD OF THIS DISCLOSURE
  • The present invention relates generally to computerized processing of electronic documents and more particularly to computerized semantic processing of electronic documents.
  • BACKGROUND FOR THIS DISCLOSURE
  • Wikipedia on “Data_deduplication” states that “In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage . . . . For example a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.”
  • Also according to Wikipedia, the term deduplication may refer to Data deduplication, as above, or to Record linkage, in databases, i.e. finding entries that refer to the same entity in two or more files. ‘DeDuping’ may involve removing duplicates in Customer and Address records in a Database or Spreadsheet.
  • The importance of topic modeling for browsing is known, e.g. at the following http-www-linked publication: cs.princeton.edu/˜blei/topicmodeling.html.
  • It is also known that “Clustering can be used to assist browsing. Browsing tools complement search tools” e.g. as described at the following http-linked publication: pages.cs.wisc.edu/˜pradheep/Clust-LDA.pdf.
  • Other state of the art related technologies are described inter alia in:
    • 1. Papadimitriou, Christos; Raghavan, Prabhakar; Tamaki, Hisao; Vempala, Santosh (1998). “Latent Semantic Indexing: A probabilistic analysis” (Postscript). Proceedings of ACM PODS. http://www.cs.berkeley.edu/˜christos/ir.ps.
    • 2. Hofmann, Thomas (1999). “Probabilistic Latent Semantic Indexing” (PDF). Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. http://www.cs.brown.edu/˜th/papers/Hofmann-SIGIR99.pdf.
    • 3. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). “Latent Dirichlet allocation”. Journal of Machine Learning Research 3: 993-1022. doi:10.1162/jmlr 2003.3.4-5.993. http://jmlr.csail.mit.edu/papers/v3/blei03a.html.
    • 4. Blei, David M. (April 2012). “Introduction to Probabilistic Topic Models” (PDF). Comm. ACM 55 (4): 77-84. doi:10.1145/2133806.2133826. http://www.cs.princeton.edu/˜blei/papers/Blei2011.pdf
    • 5. Sanjeev Arora; Rong Ge; Ankur Moitra (April 2012). “Learning Topic Models-Going beyond SVD”. arXiv:1204.1956.
    • 6. Girolami, Mark; Kaban, A. (2003). “On an Equivalence between PLSI and LDA”. Proceedings of SIGIR 2003. New York: Association for Computing Machinery. ISBN 1-58113-646-3.
    • 7. Griffiths, Thomas L.; Steyvers, Mark (Apr. 6, 2004). “Finding scientific topics”. Proceedings of the National Academy of Sciences 101 (Suppl. 1): 5228-5235. doi:10.1073/pnas.0307752101. PMC 387300. PMID 14872004.
    • 8. Minka, Thomas; Lafferty, John (2002). “Expectation-propagation for the generative aspect model”. Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence. San Francisco, Calif.: Morgan Kaufmann ISBN 1-55860-897-4.
    • 9. Blei, David M.; Lafferty, John D. (2006). “Correlated topic models”. Advances in Neural Information Processing Systems 18.
    • 10. Blei, David M.; Jordan, Michael I.; Griffiths, Thomas L.; Tenenbaum; Joshua B (2004). “Hierarchical Topic Models and the Nested Chinese Restaurant Process”. Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference. MIT Press. ISBN 0-262-20152-6.
    • 11. Quercia, Daniele; Harry Askham, Jon Crowcroft (2012). “TweetLDA: Supervised Topic Classification and Link Prediction in Twitter”. ACM WebSci.
    • 12. Li, Fei-Fei; Perona, Pietro. “A Bayesian Hierarchical Model for Learning Natural Scene Categories”. Proceedings of the 2005 IEEE Computer Society Conference on Computer VISION and Pattern Recognition (CVPR′05) 2: 524-531.
    • 13. Wang, Xiaogang; Grimson, Eric (2007). “Spatial Latent Dirichlet Allocation”. Proceedings of Neural Information Processing Systems Conference (NIPS).
      Topic modeling (Wikipedia): In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. [1] Another one, called Probabilistic latent semantic indexing (PLSI), was created by Thomas Hofmann in 1999.[2] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael Jordan in 2002, allowing documents to have a mixture of topics.[3] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.
      Topics in LDA (Wikipedia): In LDA, each document may be viewed as a mixture of various topics. This is similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior. In practice, this results in more reasonable mixtures of topics in a document. It has been noted, however, that the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.[12]
  • The disclosures of all publications and patent documents mentioned in the specification, and of the publications and patent documents cited therein directly or indirectly, are hereby incorporated by reference. Materiality of such publications and patent documents to patentability is not conceded.
  • SUMMARY OF CERTAIN EMBODIMENTS
  • The following terms may be construed either in accordance with any definition thereof appearing in the prior art literature or in accordance with the specification, or as follows:
  • Document score: the significance of an individual theme in a particular document. For example, if a topic modeling process defines a topic as a distribution over a fixed vocabulary and assumes that each document includes various topics each with different proportions determined by a per-document distribution over topics then a document's “score” for a particular theme may be the document's level of probability given that document's distribution over a universe of topics; this “score” is typically generated in the course of performing conventional topic-modeling processes. In other words, each topic x has some probability θ_yx of being in document y [Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022].
  • Duplicate types:
      • Entirely Exact duplicate: Two documents which have the same bits in the same order.
      • Text exact duplicate: Two documents whose extracted texts are exact duplicates.
      • Near-duplicate: Two documents are near-duplicate if the resemblance between them is above a threshold. For example, conventional w-shingling techniques may be employed.
  • File: electronic document
  • Inclusive: an e-mail whose subject and/or body is not contained in any other e-mail in a given set of emails. This definition implies that if an email is not an inclusive, its subject and/or body is contained in one of the “inclusive” emails defined for that set of emails. Often, an inclusive culminates and includes an entire email thread.
  • Overlap: Themes are considered related, or similar, if they overlap. “overlap” may be computed by noting that each document can be assigned to more than one theme/s. Suppose X documents are in theme T 1, and Y documents in T2; The overlap between T 1 and T2 is (size of intersection of X and Y)/(size X). The definition of overlap need not be symmetrical; the number of documents that belong to both themes may be divided by the size of the topic of interest. For example if one theme (say, “parrots”) is included in another theme (say, “birds”) all documents in the small theme are typically also included in the second theme. Then the overlap from the viewpoint of the small theme is 100% and from the viewpoint of the larger theme is less than 100%.
  • Pivot: a document which is a representative in some sense, of a set of near-duplicates. Any suitable application-specific policy may be employed to define which document is representative e.g. the g., the document in the set which has the highest/lowest/median number of words.
  • Equivio Zoom's Relevance functionality is a commercially available software tool that uses “supervised” machine learning, hence there is a typically human expert that trains the system. In the themes functionality described herein there is typically no supervision or, in some versions, topics may be “semi-supervised”. Themes functionality as described herein is useful inter alia in training a system by allowing a human trainer to search for relevancy by browsing through a large collection of electronic documents e.g. as described herein. Since in many cases, finding relevant documents to a specific issue is no simple task, themes functionality described herein may facilitate the process of finding such documents.
  • pruning: reducing size of (number and/or size of members in) a data set by removing some members of the set e.g. to achieve a predetermined number/total size of members in the set, including prioritizing removal of set members known to be superfluous e.g. duplicates, vis a vis removal of set members not known to be superfluous which is lower priority.
  • Similarity/relatedness:
      • similar/related documents/files: Various operational definitions (metrics) are possible for similar documents e.g.
  • (a.) documents A, B respectively belonging to theme set A and theme set B where many themes are common to theme sets A and B, or, more generally, that the distributions of the two documents over all themes are close; or
  • (b.) The text of the documents are similar (near-duplicate).
      • similar/related themes: various operational definitions (metrics) are possible for “similarity” of themes, such as but not limited to themes which have many/few documents in common, or themes whose names have many/few words in common, or
  • themes which “overlap” to a considerable degree (over a threshold e.g.).
  • Theme: A set of documents which relate to a single subject; each document may simultaneously relate to several subjects hence be included in several themes. For example, some computerized topic modeling methods yield models in which certain documents have a mixture of topics.
  • topic: Typically, “topic” as used herein refers to output by conventional topic modeling whereas “theme” typically refers to that output as further processed in accordance with embodiments of the present invention.
  • Unique documents: Documents in a collection that do not have any other documents which are near-duplicates.
  • Word score: the significance of an individual word to an individual theme. For example, if a topic modeling process defines a topic as a distribution over a fixed vocabulary and assumes that each document includes various topics each with different proportions determined by a per-document distribution over topics, then each word's “score” may be its level of probability given the distribution over the fixed vocabulary defined by the topic; this “score” is typically generated in the course of performing conventional topic-modeling processes. In other words, each word x has some probability β_yx of being in theme y [Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022].
  • Certain embodiments of the present invention seek to provide computerized systems and methods for use of themes in e-discovery and other semantic tasks.
  • Certain embodiments of the present invention seek to provide methods for computerized identification of themes in a large data set.
  • Certain embodiments of the present invention seek to provide methods for use of multi-topic modeling in e-discovery and other semantic tasks.
  • Embodiments include:
  • Embodiment 1
  • A method for computerized identification of themes in a large data set, the system comprising:
  • reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and
  • using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.
  • Embodiment 2
  • A method according to Embodiment 1 wherein the computerized data set member pruning technique comprises thinning out at least one document which passes a document similarity criterion relative to at least one other document not being thinned out, thereby to combat skewing as a result of over-influence of similar, hence over-represented, documents upon the theme identification technique.
  • Embodiment 3
  • A method according to Embodiment 2 wherein the thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with at least one inclusive email, thereby to thin out emails which are included in the inclusive email hence are deemed to pass the document similarity criterion with regard to the inclusive.
  • Embodiment 4
  • A method according to Embodiment 2 wherein the thinning out at least one document which passes a document similarity criterion comprises identifying and discarding near-duplicates thereby to thin out at least one document which is deemed to pass the document similarity criterion with regard to a set of near-duplicates of the document, at least one of which is not being thinned out.
  • Embodiment 5
  • A method according to Embodiment 1 wherein the computerized theme identification technique comprises topic modeling.
  • Embodiment 6
  • A method according to Embodiment 5 wherein the topic modeling allows documents to have a plurality of topics.
  • According to Wikipedia, “An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. [1] Another one, called Probabilistic latent semantic indexing (PLSI), was created by Thomas Hofmann in 1999.[2] Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, allowing documents to have a mixture of topics.[3] Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics.”
  • Embodiment 7
  • A browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising:
  • theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high; and
  • document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
  • Embodiment 8
  • A system according to Embodiment 7 and also comprising:
  • theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high;
  • word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes,
  • thereby to provide 3-tier browsing apparatus facilitating browsing at word, document and topic levels responsive to user-initiated flitting between the levels.
  • Embodiment 9
  • A method according to Embodiment 1 and also comprising:
  • facilitating theme-to-word flitting by retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high.
  • Embodiment 10
  • A method according to Embodiment 1 and also comprising:
  • facilitating theme-to-document flitting for retrieving and presenting to a user, documents whose document score for at least one user-selected theme is high.
  • Embodiment 11
  • A method according to Embodiment 1 and also comprising:
  • facilitating document-level browsing for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
  • Embodiment 12
  • A method according to Embodiment 1 and also comprising:
  • facilitating word-level browsing for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes.
  • Embodiment 13
  • A method according to Embodiment 1 wherein the number of data set members in the large data set is further reduced subsequent to the using step and prior to a manual review process.
  • Embodiment 14
  • A method according to Embodiment 1 wherein the reducing is effected using:
  • random selection; and
  • at least one computerized data set member pruning technique other than random selection.
  • Embodiment 15
  • A method according to Embodiment 14 wherein the random selection is performed after the computerized data set member pruning technique.
  • Embodiment 16
  • A method according to Embodiment 14 wherein the random selection is performed before the computerized data set member pruning technique.
  • Embodiment 17
  • A method according to Embodiment 5 wherein the topic modeling which allows documents to have a plurality of topics comprises one of the following computerized techniques: Latent Dirichlet allocation (LDA), PLSI, and Pachinko allocation.
  • Embodiment 18
  • A method according to Embodiment 3 wherein the thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with a single inclusive email.
  • Embodiment 19
  • A method according to Embodiment 4 wherein the identifying and discarding near-duplicates is effected using Equivio Zoom near-duplicate functionality
  • The present invention also typically includes at least the following embodiments:
  • Embodiment a1
  • An e-discovery method comprising:
  • Step 1. Input: a set of electronic documents
    Step 2: Extract text from the data collection.
    Step 3: Compute Near-duplicate (ND) on the dataset.
    Step 4. Compute Email threads (ET) on the dataset.
    Step 5. Run a topic modeling on a subset of the dataset, including data manipulation
  • Embodiment a2
  • A method according to Embodiment a1 wherein the output of step 3 includes all documents having the same DuplicateSubsetID having an identical text.
  • Embodiment a3
  • A method according to Embodiment a wherein the output of step 3 includes all documents x in the set for which there is another document y in the set, such that the similarity between the two is greater than some threshold.
  • Embodiment a4
  • A method according to Embodiment a1 wherein the output of step 3 includes at least one pivot document selected by a policy such as maximum words in the document.
  • Embodiment a5
  • A method according to Embodiment a1 wherein the subset includes inclusives of Email threads (ET).
  • Embodiment a6
  • A method according to Embodiment a1 wherein the subset includes Pivots from documents and attachments, but not emails.
  • Embodiment a7
  • A method according to Embodiment a1 wherein the data manipulation includes, if the document is an e-mail, removing all e-mail headers in the document, but keeping the subject line and the body of the e-mail.
  • Embodiment a8
  • A method according to Embodiment a7 and also comprising multiplying the subject line to set some weight to the subject words.
  • Embodiment a9
  • A method according to Embodiment a1 wherein the data manipulation includes Tokenization of the text using separators.
  • Embodiment a10
  • A method according to Embodiment a1 wherein the data manipulation includes ignoring the following features:
  • Words with length less than (parameter)
  • Words with length greater than (parameter)
  • Words that do not start with an alpha character.
  • (Optionally)—words that contain digits
  • (Optionally)—words that contain non-AlphaNumeric characters, optionally excluding some subset characters such as ‘_’.
  • Words that are stop words.
  • Words that appear more than (parameter) times number of words in the document.
  • Words that appear less than (parameter) times number of documents.
  • Words that appear more than (parameter) times number of documents.
  • Embodiment a11
  • A method according to Embodiment a1 wherein the output of step 5 includes an assignment of documents to the themes, and an assignment of words (features) to themes and each feature x has some probability P_xy of being in theme y and wherein the P matrix is used to construct names for at least one theme.
  • Embodiment a12
  • A method for computerized Early Case Assessment comprising:
  • a. Select at random a set of documents
    b. Run near-duplicates (ND)
    c. Run Email threads (ET)
    d. Select pivot and inclusive
    e. Run topic modeling;
    g. Generate theme names; and
    h. Explore the data by browsing themes.
  • Embodiment a13
  • A method according to Embodiment a1 wherein, for Post Case Assessment rather than using an entire dataset, only the documents that are relevant to the case are used.
  • Embodiment a14
  • A method according to Embodiment a1 and also comprising displaying for each theme the list of documents that are related to that theme.
  • Embodiment a15
  • A method according to Embodiment a14 wherein the user has an option to select a meta-data and the system will display for each theme the percentage of that meta-data in that theme.
  • Also provided, excluding signals, is a computer program comprising computer program code means for performing any of the methods shown and described herein when the program is run on a computer; and a computer program product, comprising a typically non-transitory computer-usable or -readable medium e.g. non-transitory computer-usable or -readable storage medium, typically tangible, having a computer readable program code embodied therein, the computer readable program code adapted to be executed to implement any or all of the methods shown and described herein. It is appreciated that any or all of the computational steps shown and described herein may be computer-implemented. The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general purpose computer specially configured for the desired purpose by a computer program stored in a typically non-transitory computer readable storage medium. The term “non-transitory” is used herein to exclude transitory, propagating signals or waves, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
  • Any suitable processor, display and input means may be used to process, display e.g. on a computer screen or other computer output device, store, and accept information such as information used by or generated by any of the methods and apparatus shown and described herein; the above processor, display and input means including computer programs, in accordance with some or all of the embodiments of the present invention. Any or all functionalities of the invention shown and described herein, such as but not limited to steps of flowcharts, may be performed by a conventional personal computer processor, workstation or other programmable device or computer or electronic computing device or processor, either general-purpose or specifically constructed, used for processing; a computer display screen and/or printer and/or speaker for displaying; machine-readable memory such as optical disks, CDROMs, DVDs, BluRays, magnetic-optical discs or other discs; RAMs, ROMs, EPROMs, EEPROMs, magnetic or optical or other cards, for storing, and keyboard or mouse for accepting. The term “process” as used above is intended to include any type of computation or manipulation or transformation of data represented as physical, e.g. electronic, phenomena which may occur or reside e.g. within registers and/or memories of a computer or processor. The term processor includes a single processing unit or a plurality of distributed or remote such units.
  • The above devices may communicate via any conventional wired or wireless digital communication means, e.g. via a wired or cellular telephone network or a computer network such as the Internet.
  • The apparatus of the present invention may include, according to certain embodiments of the invention, machine readable memory containing or otherwise storing a program of instructions which, when executed by the machine, implements some or all of the apparatus, methods, features and functionalities of the invention shown and described herein. Alternatively or in addition, the apparatus of the present invention may include, according to certain embodiments of the invention, a program as above which may be written in any conventional programming language, and optionally a machine for executing the program such as but not limited to a general purpose computer which may optionally be configured or activated in accordance with the teachings of the present invention. Any of the teachings incorporated herein may wherever suitable operate on signals representative of physical objects or substances.
  • The embodiments referred to above, and other embodiments, are described in detail in the next section.
  • Any trademark occurring in the text or drawings is the property of its owner and occurs herein merely to explain or illustrate one example of how an embodiment of the invention may be implemented.
  • Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions, utilizing terms such as, “processing”, “computing”, “estimating”, “selecting”, “ranking”, “grading”, “calculating”, “determining”, “generating”, “reassessing”, “classifying”, “generating”, “producing”, “stereo-matching”, “registering”, “detecting”, “associating”, “superimposing”, “obtaining” or the like, refer to the action and/or processes of a computer or computing system, or processor or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories, into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices. The term “computer” should be broadly construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, personal computers, servers, computing system, communication devices, processors (e.g. digital signal processor (DSP), microcontrollers, field programmable gate array (FPGA), application specific integrated circuit (ASIC), etc.) and other electronic computing devices.
  • The present invention may be described, merely for clarity, in terms of terminology specific to particular programming languages, operating systems, browsers, system versions, individual products, and the like. It will be appreciated that this terminology is intended to convey general principles of operation clearly and briefly, by way of example, and is not intended to limit the scope of the invention to any particular programming language, operating system, browser, system version, or individual product.
  • Elements separately listed herein need not be distinct components and alternatively may be the same structure.
  • Any suitable input device, such as but not limited to a sensor, may be used to generate or otherwise provide information received by the apparatus and methods shown and described herein. Any suitable output device or display may be used to display or output information generated by the apparatus and methods shown and described herein. Any suitable processor may be employed to compute or generate information as described herein e.g. by providing one or more modules in the processor to perform functionalities described herein. Any suitable computerized data storage e.g. computer memory may be used to store information received by or generated by the systems shown and described herein. Functionalities shown and described herein may be divided between a server computer and a plurality of client computers. These or any other computerized components shown and described herein may communicate between themselves via a suitable computer network.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Certain embodiments of the present invention are illustrated in the following drawings:
  • FIG. 1 is a simplified flowchart illustration of a method for use of themes in e-discovery, according to certain embodiments.
  • FIG. 2 is a simplified flowchart illustration of a method for early case assessment, according to certain embodiments.
  • FIGS. 3 a-3 b, taken together, is a simplified flowchart illustration of a method for associating topics with documents, according to certain embodiments.
  • FIG. 4 is a simplified flowchart illustration of a “navigating” or browsing method for generating suitable displays to facilitate computer-aided theme exploration, suitable e.g. for implementing step 100 in FIGS. 3 a-3 b, taken together, according to certain embodiments.
  • FIG. 5 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. The screen display facilitates theme-level browsing, according to certain embodiments.
  • FIG. 6 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, flitting from document-level to theme-level or word-level is facilitated.
  • FIG. 7 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, document-level browsing is facilitated.
  • FIG. 8 is a simplified flowchart illustration, according to certain embodiments, of a method for utilizing computerized themes functionality under these circumstances.
  • The methods of the flowchart figures each include some or all of the illustrated steps, suitably ordered e.g. as shown.
  • Computational components described and illustrated herein can be implemented in various forms, for example, as hardware circuits such as but not limited to custom VLSI circuits or gate arrays or programmable hardware devices such as but not limited to FPGAs, or as software program code stored on at least one tangible or intangible computer readable medium and executable by at least one processor, or any suitable combination thereof. A specific functional component may be formed by one particular sequence of software code, or by a plurality of such, which collectively act or behave or act as described herein with reference to the functional component in question. For example, the component may be distributed over several code sequences such as but not limited to objects, procedures, functions, routines and programs and may originate from several computer files which typically operate synergistically.
  • Data can be stored on one or more tangible or intangible computer readable media stored at one or more different locations, different network nodes or different storage devices at a single node or location.
  • It is appreciated that any computer data storage technology, including any type of storage or memory and any type of computer components and recording media that retain digital data used for computing for an interval of time, and any type of information retention technology, may be used to store the various data provided and employed herein. Suitable computer data storage or information retention apparatus may include apparatus which is primary, secondary, tertiary or off-line; which is of any type or level or amount or category of volatility, differentiation, mutability, accessibility, addressability, capacity, performance and energy use; and which is based on any suitable technologies such as semiconductor, magnetic, optical, paper and others.
  • DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS
  • FIGS. 3 a-3 b, taken together, is a simplified flowchart illustration of a method for associating topics with documents, according to certain embodiments. The method of FIGS. 3 a-3 b typically include some or all of the following steps, suitably ordered e.g. as shown:
  • 10: Provide a collection of thousands or millions of electronic documents (D) e.g. including a mixture of 1 or more of:
  • non-emails
  • e-mails with attachments
  • e-mails without attachments
  • 20 Run Near-duplicate Identifying functionality on the collection, thereby to identify all sets of near-duplicates in the collection
  • 30 Run Email thread Identifying functionality on all emails in the collection thereby to identify all email threads in the collection
  • 40 perform one, some or all of the following steps to pare down the collection of documents (D), thereby to yield a pared-down collection (Z):
  • 40 a. Select one (say) document (e.g. pivot document) to represent each set of near-duplicate set—thereby to yield a set X1 of documents.
  • 40 b. Select (only) inclusive to represent from each email thread thereby to yield a set X2 of inclusive emails
  • 40 c. From the set X2 of inclusive emails select one (say) document from each near-duplicate set thereby to yield a set X3 e.g. first select all inclusives then take only one inclusive from each set of “similar” inclusives (e.g. sets defined as near-duplicates by Equivio Zoom near-duplicate functionality)
  • 50: if number of documents in Z exceeds a threshold, use random selection to reduce the number of documents in Z to below the threshold
  • 60: select a suitable number, N, of themes to be identified
  • 70: perform topic modeling using documents in Z, thereby to yield N themes
  • 80: Apply the topic model generated in step 70, to dataset D, thereby to yield topics wherein documents may belong to more than one topic; use topics as themes
  • 90: Assign names to the themes. Each word in the set of all words in all documents has some probability to be in a theme; this probability may comprise the “word score”. Typically, the M (predetermined integer e.g. 5) top scoring words are selected to represent the theme i.e. to constitute the theme's name. According to certain embodiments, a name may comprise one or more of the words most frequently found in the documents pertaining to the topic and less frequently or infrequently found in documents not pertaining to the topic.
  • 100: Generate displays (e.g. as per FIG. 4) to facilitate computer-aided exploration of (browsing between) themes and the documents and/or words they include, where themes are represented in the displays by the theme names selected in step 90.
  • Step 40A may be performed only on non-emails or may be performed on all documents e-mails and non-emails (e.g. e-mails are considered documents).
  • Step 40 b is typically performed on e-mail bodies i.e. without their attachments.
  • Step 40 c is typically performed on e-mails without attachments. After identifying inclusives, near duplicate is applied to these and typically, just one or just a few e-mail/s from each group of text-similar e-mails is/are selected. For example: if an email thread has several inclusives, only one of them might be selected.
  • Typically, random step 50 is performed after near-duplicate and inclusive steps 20, 30 and 40, to enable a user to ascertain that further random pruning is necessary since it is possible that steps 20, 30, 40 reduce the size of the data set sufficiently without requiring any random pruning. However, alternatively or in addition, random pruning may occur before steps 20, 30, and 40.
  • Typically, random step 50 is performed only when it is desired to reduce processing time whereas for a small set of documents, e.g. less than 400 thousand documents, step 50 may be omitted. Optionally, the system computes cost (monetary or in terms of time) of topic modeling both with random selection and without. The system may for example compute the time or cost to compute a topic model on a random sample which is, say, 50%/10%/1% the size of the original data set.
  • FIG. 4 is a “Navigating” or browsing method for generating suitable displays to facilitate computer-aided theme exploration, suitable e.g. for implementing step 100 in FIGS. 3 a-3 b, taken together. Alternatively, the method of FIG. 4 may be employed to facilitate computer-aided exploration of any set of themes, which need not have been generated using any or all of steps 10-90 in FIGS. 3 a-3 b, taken together. The method of FIG. 4 may include some or all of the following steps, suitably ordered e.g. as shown:
  • 410: Receive e.g. from user, a theme attribute by which to sort themes, e.g.
      • Number of documents in theme
      • Document Score-related attribute e.g. theme's average or median or mode document score
      • How many times has theme been accessed in the past, using stored history of user/group of users
      • Theme name (can be sorted in alphabetical order)
      • % (richness) or absolute number of documents belonging to theme which match a predicate (e.g. are relevant to a predicate, e.g. using Equivio relevance software tool). A predicate is a logical combination of conditions that the documents must satisfy. Examples of conditions: specific document-types, specific languages, above/below a relevance score generated e.g. by Equivio Zoom's relevance functionality A predicate may be user-selected e.g. via a suitable GUI.
  • 420: Sort themes by a default or user-selected (in step 410) theme attribute and display themes in order determined by sort process OR display only themes which match a criterion (example criteria: more than 85% of documents in theme are relevant to user-selected predicate, theme name includes “Kennedy”, theme includes more than 1000 documents).
  • 430: display theme attribute, in association with displayed theme e.g. how many documents belonging to theme match a predicate (e.g. are relevant to a predicate, e.g. using Equivio relevance software tool).
  • 440: responsive to a user's selection of (e.g. clicking on a displayed) theme, identify themes which are similar to the user-selected theme by identifying themes which have many (number>threshold) documents in common with the user selected theme).
  • 450: responsive to a user's selection of (e.g. clicking on a displayed) theme,
  • sort the documents in the theme by a default or user-selected document attribute. Document attribute may include metadata (Custodian, date) or theme related data (e.g. relevance of document to selected predicate, e.g. using Equivio relevance software tool) and display documents in themes in order determined by sort process.
  • 460: responsive to a user's selection of (e.g. clicking on a displayed) document,
  • Select and display files whose distributions, e.g. rank distributions, over topics are similar to the selected document's distribution e.g. rank distribution over topics. For example, take the vector of scores of the selected document over all themes e.g., for 5 themes, (0.4, 0.01, 0.7, 0, 0); then display all documents whose distance from the above is less than a constant. Any suitable distance metric or function may be employed such as but not limited to Euclidean distance, L-infinity distance (max entry distance), L−1 distance, and Manhattan distance.
  • 470: responsive to a user's selection of a document attribute (e.g. metadata (Custodian, date)), compute distribution and display (e.g. as histogram): number (or %) of documents under (say) custodian C or date D belonging to each theme.
  • The Themes functionality herein is particularly useful for identifying relevant documents in a large collection of electronic documents which is sparse in that only a small number of documents are relevant to a particular issue. This is especially the case if it is not possible to identify keywords which can be used to tag relevant documents on the basis of a simple keyword search.
  • Computerized systems for identifying relevant documents in a large collection of electronic documents exist, such as Equivio Zoom's Relevance functionality.
  • However, for a sparse document set, it is sometimes necessary to seed the initial training with pre-identified relevant documents, rather than randomly selecting a training set which might include a tiny or zero amount of relevant documents. For example, the current Equivio Zoom user guide describes (in section 6.3, from page 58 onward) a process of Adding Seed Files to an Issue.
  • FIG. 5 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, each theme is presented together with a bar (right side of screen) indicating relevance to a user-selected issue. As shown, order of presentation of the screens is in accordance with length of the bar. The screen display facilitates theme-level browsing, according to certain embodiments. For example, the bars may indicate the number of files per theme, that are Relevant (e.g. as determined manually, or automatically e.g. by Equivio Zoom's Relevance functionality) to a user-selected issue which may if desired be shown on, and selected via, a suitable GUI (not shown). The screen display facilitates theme-level browsing, according to certain embodiments.
  • FIG. 6 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, documents are presented along with (on the left) words in the documents and their word scores, relative to an individual theme, as well as themes related to the individual theme. The words and themes may be presented in descending order of their word scores and relatedness to the individual theme, respectively. If a related theme or word is selected (e.g. clicked upon), a different “semantic view” is generated; for example, of all documents in the selected related theme, using a screen format which may be similar to that of FIG. 6. As shown, flitting from document-level to theme-level or word-level is facilitated.
  • FIG. 7 is a simplified screenshot illustration of an example display screen generated by a system constructed and operative in accordance with certain embodiments. As shown, document-level browsing is facilitated.
  • FIG. 8 is a simplified flowchart illustration, according to certain embodiments, of a method for utilizing computerized Themes functionality under these circumstances. The method of FIG. 8 typically includes some or all of the following steps, suitably ordered e.g. as shown:
  • 1010. use computerized Themes functionality to identify an initial “seed” set of (say 5-30) relevant documents in a large sparse collection of electronic documents.
  • 1020. generate a training set of documents including the initial “seed” set of relevant documents and at least an equal number of documents randomly selected from the large sparse collection of electronic documents.
  • 1030. operate computerized relevant document identification system, e.g. Equivio Zoom's Relevance functionality on the training set, thereby to successfully identify the rare relevant documents in the large sparse collection.
  • It is appreciated that step 1010 may be performed in any suitable manner. For example, if at least one relevant document is known, step 1010 may comprise:
  • a. running the “themes” functionality to obtain a “thematic distribution” for the relevant document e.g. an indication of the significance of each of the various themes to the document. Some topic modeling software provides “document scores” for each document relative to each theme, indicating significance of each of the various themes to each document. Alternatively, if the top key words on the key word list of a theme occur relatively frequently in some documents and relatively infrequently in others, the theme can be regarded as highly significant to the former documents and less significant to the latter documents.
  • b. selecting documents within the large collection of electronic documents whose “thematic distribution” is similar, using a suitable metric, to the “thematic distribution” of the document known to be relevant. A suitable metric for similarity between Document 1's “thematic distribution” and Document 2's “thematic distribution” may for example be a Euclidean distance (sum of squares-based e.g.) between the document scores of Document 1, summed over all themes, and the document scores of Document 2, summed over all themes. Other distance metrics may also be employed e.g. L-infinity distance (max entry distance), L−1 distance, and Manhattan distance.
  • It is appreciated that computerized processing tends to generate clusters (and topics) that are artifactual. For example—presence of the word “weekend” might trigger definition of a cluster of documents which, upon inspection, would be found to include a mass of emails about an unrelated variety of subjects united only by the fact that the emails were written on a Friday hence include an exhortation to “have a nice weekend”. In multi-topic processing (e.g. topic modeling in which one document can be assigned to several topics), this is of less relevance: of the many topics found, some are safely ignored as artifactual and the system as a whole remains workable. In clustering (in which each document can belong to only one topic) however, important documents can be assigned to an artifactual cluster and thereby effectively disappear since disregarding the artifactual cluster tends to lead to disregarding documents assigned thereto.
  • It is appreciated that the systems and methods shown and described herein enable a S-tier browsing system to be generated, in which a user can browse at the word, document/file or topic level, and can move from one level to another. For example, a user may look at a presentation of topics, arranged say by relevance to an issue, and the system may present to her or him, words or documents whose score for the theme/s the user has selected, are high. The system may for example compute word scores or document scores for all words or documents, sort the words or documents, and present to the user only those whose word or document scores is high. The user may then select one of those words or documents, thereby browsing to a different level. When s/he does select, say, a document scoring high for the topic s/he previously was viewing, the system then shows the document, and also identifies and displays indications of themes to which the document is strongly related, and words whose document scores are high for the themes to which the document is strongly related. The system may do this by computing the degree of relatedness of the document to all themes (each of the themes), sorting the themes on this basis, and presenting to the user only those themes for which the document's degree of relatedness is high. Again the user can change level, from the document level up to the topic level or down to the word level, or the user may continue to browse at the document level, e.g. to documents whose distribution over the identified themes is similar (using a suitable distance metric) to the distribution over the identified themes of the document of previous interest. To support this, the system may compute all documents' distributions over all themes identified, and may also compute the distances between these distributions, either in advance for all document pairs, or in real time for a user-designated document. The system may then present, responsive to a user request for documents similar to document D, the top few documents from a list of documents sorted in accordance with the documents' respective distances from Document D. Alternatively, a user may perform “word-level” browsing by moving from one word to another word which has a similar distribution over N identified topics. To support this, the system may compute all words' distributions over all themes identified, and may also compute the distances between these distributions, either in advance for all word pairs, or in real time for a user-designated word. The system may then present, responsive to a user request for words similar to an individual word W of interest, the top few words from a list of words sorted in accordance with the words' respective distances from Word W.
  • Another embodiment of the invention, e.g. as described above with reference to FIG. 4, is a browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising some or all of the following:
  • theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high;
  • theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high;
  • document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes
  • word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes,
  • thereby to provide 2- or 3-tier browsing apparatus facilitating browsing at word, document and topic levels responsive to user-initiated flitting between the levels.
  • It is appreciated that any suitable parameters and work-processes may be employed. For example, a set of electronic documents comprising thousands, tens or hundreds of thousands, or millions of electronic documents may be processed as described herein.
  • Typically, the number of themes to identify is selected by a user and any suitable number of themes may be requested by the user such as 10, 20, 50, 100, 200 or 500 themes. For example, the number of themes selected may be, perhaps, 200 themes for a collection of a few hundred thousand electronic documents, and proportionally more or less themes if the number of documents in the collection is proportionally larger or smaller.
  • Any suitable “view” of themes may be provided, such as themes sorted by number of files or meta-data attributes of the files, themes sorted by various attributes of the words in the theme name, themes sorted by relevance to an issue and so forth.
  • A particular advantage of certain embodiments is that documents which are known to be mutually similar or near duplicates are “thinned” so that they do not over-influence or skew the topic modeling process.
  • It is appreciated that thinning need not result in retaining only a single pivot or only a single inclusive email, instead one may, if appropriate, reduce the influence of repeated or highly related materials without eliminating the repetition entirely.
  • Regarding topic-modeling steps herein e.g. step v of FIG. 2, step 3 of FIG. 2, step 70 of FIG. 3:
  • A topic model is a computational functionality analyzing a set of documents and yielding “topics” that occur in the set of documents typically including (a) what the topics are and (b) what each document's balance of topics is. According to Wikipedia, “Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions”. Topic models may analyze large volumes of unlabeled text and each “topic” may consist of a cluster of words that occur together frequently.
  • Another definition, from the following http location:
  • faculty.washington.edu/jwilker/559/SteyversGriffiths.pdf, is that topic modeling functionality proceeds from an assumption “that documents are mixtures of topics, where a topic is a probability distribution over words. A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated. To make a new document, one chooses a distribution over topics. Then, for each word in that document, one chooses a topic at random according to this distribution, and draws a word from that topic. Standard statistical techniques can be used to invert this process, inferring the set of topics that were responsible for generating a collection of documents.”
  • Topic modeling as used herein includes any or all of the above, as well as any computerized functionality which inputs text/s and uses a processor to generate and output a list of semantic topics which the text/s are assumed to pertain to, wherein each “topic” comprises a list of keywords assumed to represent a semantic concept.
  • Topic modeling includes but is not limited to any and all of: the Topic modeling functionality described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998; Probabilistic latent semantic indexing (PLSI), created by Thomas Hofmann in 1999; Latent Dirichlet allocation (LDA), developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002 and allowing documents to have a mixture of topics; extensions on LDA, such as but not limited to Pachinko allocation;
  • Griffiths & Steyvers Topic modeling e.g. as published in 2002, 2003, 2004; Hofmann Topic modeling e.g. as published in 1999, 2001; topic modeling using the synchronic approach; topic modeling using the diachronic approach; Topic modeling functionality which attempts to fit appropriate model parameters to the data corpus using heuristic/s for maximum likelihood fit, topic modeling functionality with provable guarantees; topic modeling functionality which uses singular value decomposition (SVD), topic modeling functionality which uses the method of moments, topic modeling functionality which uses an algorithm based upon non-negative matrix factorization (NMF); and topic modeling functionality which allows correlations among topics. Topic modeling implementations may for example employ Mallet (software project), Stanford Topic Modeling Toolkit, or GenSim—Topic Modeling for Humans.
  • Earlier presented embodiments are now described for use either independently or in suitable combination with the embodiments described above:
  • When enhancing expert-based computerized analysis of a set of digital documents, a system for computerized derivation of leads from a huge body of data may be provided, the system comprising:
  • an electronic repository including a multiplicity of accesses to a respective multiplicity of electronic documents and metadata including metadata parameters having metadata values characterizing each of the multiplicity of electronic documents;
  • a relevance rater using a processor to run a first computer algorithm on the multiplicity of electronic documents which yields a relevance score which rates relevance of each of the multiplicity of electronic documents to an issue; and
  • a metadata-based relevant-irrelevant document discriminator using a processor to rapidly run a second computer algorithm on at least some of the metadata which yields leads, each lead comprising at least one metadata value for at least one metadata parameter, which value correlates with relevance of the electronic documents to the issue.
  • The application is operative to find outliers of a given metadata and relevancy score (i.e. relevant, not relevant). When theme-exploring is used, the system can identify themes with high relevancy score based on the given application. The above system, without theme-exploring, may compute the outlier for a given metadata, and each document appears one in each metadata. In the theme-exploring settings for a given set of themes the same document might fall into several of the metadata.
  • Method for Use of Themes in e-Discovery (FIG. 1):
  • step i. Input: a set of electronic documents. The documents could be in:
    Text format, Native files (PDF, Word, PPT, etc.), ZIP files, PST, Lotus notes, MSG, etc.
    Step ii Extract text from the data collection. Text extraction can be done by third party software such as: Oracle inside out, iSys, DTSearch, iFilter, etc.
    Step iii: Compute Near-duplicate (ND) on the dataset.
    The following teachings may be used: U.S. Pat. No. 8,015,124, entitled “A Method for Determining Near Duplicate Data Objects”; and/or WO 2007/086059, entitled “Determining Near Duplicate “Noisy” Data Objects”, and/or suitable functionalities in commercially available e-discovery systems such as those of Equivio.
  • For each document compute the following:
  • Step iiia: DuplicateSubsetID: all documents having the same DuplicateSubsetID having an identical text.
    Step iiib: EquiSetID: all documents having the same EquiSetID are similar (for each document x in the set there is another document y in the set, such that the similarity between the two is greater than some threshold).
    Step iiic: Pivot: 1 if the document is a representative of the set (and 0 otherwise). Typically, for each EquiSet only one document is selected as Pivot. The pivot document can be selected by a policy for example (maximum words number of words, minimum number of words, median number of words, minimum docid, etc.) When using theme networking (TN) it is recommended to use maximum words in documents as pivot policy as it is desirable for largest documents to be in the model.
    Step iv. Compute Email threads (ET) on the dataset. The following teachings may be used: WO 2009/004324, entitled “A Method for Organizing Large Numbers of Documents” and/or suitable functionalities in commercially available e-discovery systems such as those of Equivio.
    The output of this phase is a collection of trees, and all leafs of the trees are marked as inclusive. Note, that family information is accepted (to group e-mails with their attachments).
    Step v. Run a topic modeling algorithm (such as LDA) on a subset of the dataset, including feature extraction. Resulting topics are defined as themes. The subset includes the following documents:
      • Inclusive from Email threads (ET)
      • Pivots from all documents that are not e-mails. i.e. pivots from documents and attachments.
  • The data collection include less files (usually the size is 50% of the total size); and the data do not include similar documents, therefore if a document appears many times in the original data collection it will have the same weight as if it appears once.
  • In building the model documents were used with more than 25 (parameter) words and less than 20,000 words. The idea behind this limitation was to improve performance, and not be influenced by high words frequency when the document has few features.
  • If the dataset is extremely large, at most 100,000 (parameter) documents may be selected at random to build the model, and after building the model, it may be applied on all other documents.
  • The first step in the topic modeling algorithm is to extract features from each document.
  • A method suitable for the Feature extraction of step v may include obtaining features as follows:
  • A topic modeling algorithm uses features to create the model for the topic-modeling step v above. The features are words; to generate a list of words from each word one may do the following:
  • If the document is an e-mail, remove all e-mail headers in the document, but keep the subject line and the body. One may multiply the subject line to set some weight to the subject words. Tokenize the text using separators such as, spaces, semicolon, colon, tabs, new line etc. Ignore the following features:
    Words with length less than 3 (parameter)
    Words with length greater than 20 (parameter)
    Words that do not start with an alpha character.
    Words that are stop words.
    Words that appear more than 0.2 times number of words in the document. (parameter)
    Words that appear in less than 0.01 times number of documents. (Parameter)
    Words that appear in more than 0.2 times number of documents. (Parameter)
    Stemming, part-of-speech—as features.
    Step viii. Theme names. The output of step v includes an assignment of documents to the themes, and an assignment of words (features) to themes. Each feature x has some probability P_xy of being in theme y. Using the P matrix, construct names to the themes.
  • In e-discovery one may use the following scenarios: Early Case Assessment, Post Case Assessment and provision of helpful User Interfaces.
  • Early Case Assessment (FIG. 2, Including Some or all of the Following Steps a-h):
    a. Select at random 100000 documents
    b. Run near-duplicates (ND)
    c. Run Email threads (ET)
    d. Select pivot and inclusive
    e. Run topic modeling using the above feature selection. The input of the topic modeling is a set of documents. The first phase of the topic modeling is to construct a set of features for each document. The feature getting method described above may be used to construct the set of features.
    f. Run the model on all other documents (optional).
    g. Generate theme names e.g. using step viii above.
    h. Explore the data by browsing themes; one may open a list of documents belonging to a certain theme, from the document one may see all themes connected to that document, and go to other themes.
    The list of documents might be filtered by a condition set by the user. For example filter all documents by dates, relevancy, file size, etc.
    The above procedure assists users in early case assessment when the data is known and one would like to know what is in the data, and assess the contents of the data collection.
    In early case assessment one may randomly sample the dataset to get results faster.
  • Post Case Assessment This process uses some or all of steps I-v above, but in this setting an entire dataset is not used, but rather, only the documents that are relevant to the case. If near-duplicates (ND) and Email threads (ET) have already run, there is no need to re-run them.
  • 1st pass review is a quick review of the documents that can be handled manually or by an automatic predictive coding software; the user wishes to review the results and get an idea on the themes of the documents that passed that review. This phase is essential because the number of such documents might be extremely large. Also, there are cases in which, in some sub-issues, there are only a few documents.
  • The above building block can generate a procedure for such cases. Here, g only documents that passed the 1St review phase are taken, and themes are calculated for them.
  • User Interface using the output of steps I-v and displaying results thereof. Upon running the topic modeling each resulting topic is defined as a theme, and for each theme the list of documents is displayed that are related to that theme. The user has an option to select a meta-data (for example is the document relevant to an issue, custodian, date-range, file type, etc.) and the system will display for each theme the percentage of meta-data in that theme. Such presentation would assist the user while evaluating the theme.
  • An LDA model might have themes that can be classified as CAT_related and DOG_related. A theme has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as “CAT_related”. The word cat itself will have high probability given this theme. The DOG_related theme likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as the (see function word), will have roughly even probability between classes (or can be placed into a separate category). A theme is not strongly defined, neither semantically nor epistemologically. It is identified on the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence. A lexical word may occur in several themes with a different probability, however, with a different typical set of neighboring words in each theme.
  • Each document is assumed to be characterized by a particular set of themes. This is akin to the standard bag of words model assumption, and makes the individual words exchangeable.
  • Processing a large data set requires time and space, in the context of the current invention N documents are selected to create the model, and then the model is applied on the remaining documents.
  • When selecting the documents to build the model, a few options may be possible:
      • O1. Take all documents.
      • O2. Take one documents for each set of exact duplicate documents
      • O3. Take one documents from each EquiSet (e.g. as per U.S. Pat. No. 8,015,124, entitled “A Method for Determining Near Duplicate Data Objects”; and/or WO 2007/086059, entitled “Determining Near Duplicate “Noisy” Data Objects”).
    • O4. Take the inclusive from the data collection. Another option is to randomly sample X documents from the collection, as described above.
      Steps 02, 03, 04 aim to create themes that are known to the user, and also not to weight documents that already appear in a known set.
      The input for the algorithm is a text documents that can be parsed to a bag-of-words. When processing an e-mail, one may notice that the e-mail contains a header (From, to, CC, Subject); and a body. The body of an e-mail can be a formed by a series of e-mails.
  • For example:
    From: A
    To: B
    Subject: CCCCC
    Body1 Body1
       From: B
       To: A
       Subject: CCCCC
       Body2 Body2

    While processing e-mails for topic modeling one can consider removing all e-mail headers within the body, and by setting a weight to the subject by using a multiple subject line. In the above example the processed text would be:
  • CCCCC
    CCCCC
    CCCCC
    Body1 Body1
    Body2 Body2

    Step viii (Theme names) is now described in detail:
    Let P(w_i,t_j) the probability that the feature w_i belongs to theme t_j. In known implementations the theme name is a list of words with the highest probability. The solution is good when the dataset is sparse, i.e. the vocabulary of the themes is different from each other. In e-discovery the issues are highly connected and therefore, there are cases when the “top” words appeared in two or more themes. In settings of the problem “stable marriage” was used as in an algorithm, to pair words to themes. The algorithm may include:
  • Order the theme by some criteria (Size, Quality, #of relevant documents,
    etc.); i.e. theme_3 is better than theme_4.
    (1) Create an empty set S
    (2) Sort themes by some criteria
    (3) For j=0 ; j < maximum words in theme name; j++
    (4) For I = 0 ; I < #number of themes; i++) do
    (5) For theme_i, assign the word with the highest score that is not in S,
    and add that word to S
  • After X words are assigned for each theme, the number of words can be reduced by, for example, taking only those words in each theme that are bigger than the maximum word rank in that theme, divided by some constant.
  • Typically, electronic documents do not bear, or do not need to bear, any pre-annotation or labeling or meta-data, or if they do, such is not employed by the topic modeling which instead is derived by analyzing the actual texts.
  • A particular advantage of certain embodiments of the invention is that collections of electronic documents are hereby analyzed semantically by a processor on a scale that would be impossible manually. Output of topic modeling may include the n most frequent words from the m most frequent topics found in an individual document.
  • It is appreciated that when presenting documents, it need not be the case that all documents whose document score for at least one user-selected theme; is high in a defined sense e.g. over a certain threshold are displayed. Similarly, it need not be the case that all documents whose distributions over the plurality of themes are similar in a defined sense to the distribution of a user-selected document over the plurality of themes, are described. Instead, only a subset of the documents may be displayed, e.g. only such documents as answer at least one individual criterion. So, for example, a search engine could be used on the data collection, and then results of the search query might be presented using the embodiments shown and described herein. Alternatively or in addition, a predicate may be used as a criterion e.g. presenting only documents in English or only documents relevant to a given issue.
  • The methods shown and described herein are particularly useful in processing or analyzing or sorting or searching bodies of knowledge including hundreds, thousands, tens of thousands, or hundreds of thousands of electronic documents or other computerized information repositories, some or many of which are themselves at least tens or hundreds or even thousands of pages long. This is because practically speaking, such large bodies of knowledge can only be processed, analyzed, sorted, or searched using computerized technology.
  • It is appreciated that terminology such as “mandatory”, “required”, “need” and “must” refer to implementation choices made within the context of a particular implementation or application described herewithin for clarity and are not intended to be limiting since in an alternative implantation, the same elements might be defined as not mandatory and not required or might even be eliminated altogether.
  • It is appreciated that software components of the present invention including programs and data may, if desired, be implemented in ROM (read only memory) form including CD-ROMs, EPROMs and EEPROMs, or may be stored in any other suitable typically non-transitory computer-readable medium such as but not limited to disks of various kinds, cards of various kinds and RAMs. Components described herein as software may, alternatively, be implemented wholly or partly in hardware and/or firmware, if desired, using conventional techniques, and vice-versa. Each module or component may be centralized in a single location or distributed over several locations.
  • Included in the scope of the present invention, inter alia, are electromagnetic signals carrying computer-readable instructions for performing any or all of the steps or operations of any of the methods shown and described herein, in any suitable order including simultaneous performance of suitable groups of steps as appropriate; machine-readable instructions for performing any or all of the steps of any of the methods shown and described herein, in any suitable order; program storage devices readable by machine, tangibly embodying a program of instructions executable by the machine to perform any or all of the steps of any of the methods shown and described herein, in any suitable order; a computer program product comprising a computer useable medium having computer readable program code, such as executable code, having embodied therein, and/or including computer readable program code for performing, any or all of the steps of any of the methods shown and described herein, in any suitable order; any technical effects brought about by any or all of the steps of any of the methods shown and described herein, when performed in any suitable order; any suitable apparatus or device or combination of such, programmed to perform, alone or in combination, any or all of the steps of any of the methods shown and described herein, in any suitable order; electronic devices each including a processor and a cooperating input device and/or output device and operative to perform in software any steps shown and described herein; information storage devices or physical records, such as disks or hard drives, causing a computer or other device to be configured so as to carry out any or all of the steps of any of the methods shown and described herein, in any suitable order; a program pre-stored e.g. in memory or on an information network such as the Internet, before or after being downloaded, which embodies any or all of the steps of any of the methods shown and described herein, in any suitable order, and the method of uploading or downloading such, and a system including server/s and/or client/s for using such; a processor configured to perform any combination of the described steps or to execute any combination of the described modules; and hardware which performs any or all of the steps of any of the methods shown and described herein, in any suitable order, either alone or in conjunction with software. Any computer-readable or machine-readable media described herein is intended to include non-transitory computer- or machine-readable media.
  • Any computations or other forms of analysis described herein may be performed by a suitable computerized method. Any step described herein may be computer-implemented. The invention shown and described herein may include (a) using a computerized method to identify a solution to any of the problems or for any of the objectives described herein, the solution optionally includes at least one of a decision, an action, a product, a service or any other information described herein that impacts, in a positive manner, a problem or objectives described herein; and (b) outputting the solution.
  • The system may, if desired, be implemented as a web-based system employing software, computers, routers and telecommunication equipment as appropriate.
  • Any suitable deployment may be employed to provide functionalities e.g. software functionalities shown and described herein. For example, a server may store certain applications, for download to clients, which are executed at the client side, the server side serving only as a storehouse. Some or all functionalities e.g. software functionalities shown and described herein may be deployed in a cloud environment. Clients e.g. mobile communication devices such as smartphones may be operatively associated with, but external to the cloud.
  • The scope of the present invention is not limited to structures and functions specifically described herein and is also intended to include devices which have the capacity to yield a structure, or perform a function, described herein, such that even though users of the device may not use the capacity, they are, if they so desire, able to modify the device to obtain the structure or function.
  • Features of the present invention which are described in the context of separate embodiments may also be provided in combination in a single embodiment.
  • For example, a system embodiment is intended to include a corresponding process embodiment. Also, each system embodiment is intended to include a server-centered “view” or client centered “view”, or “view” from any other node of the system, of the entire functionality of the system, computer-readable medium, apparatus, including only those functionalities performed at that server or client or node.
  • Conversely, features of the invention, including method steps, which are described for brevity in the context of a single embodiment or in a certain order may be provided separately or in any suitable subcombination or in a different order. “e.g.” is used herein in the sense of a specific example which is not intended to be limiting. Devices, apparatus or systems shown coupled in any of the drawings may in fact be integrated into a single platform in certain embodiments or may be coupled via any appropriate wired or wireless coupling such as but not limited to optical fiber, Ethernet, Wireless LAN, HomePNA, power line communication, cell phone, PDA, Blackberry GPRS, Satellite including GPS, or other mobile delivery. It is appreciated that in the description and drawings shown and described herein, functionalities described or illustrated as systems and sub-units thereof can also be provided as methods and steps therewithin, and functionalities described or illustrated as methods and steps therewithin can also be provided as systems and sub-units thereof. The scale used to illustrate various elements in the drawings is merely exemplary and/or appropriate for clarity of presentation and is not intended to be limiting.

Claims (20)

1. A method for computerized identification of themes in a large data set, the system comprising:
reducing the number of data set members in a large data set, using at least one computerized data set member pruning technique other than random selection; and
using a computerized theme identification technique for identifying a plurality of themes in the reduced data set.
2. A method according to claim 1 wherein said computerized data set member pruning technique comprises thinning out at least one document which passes a document similarity criterion relative to at least one other document not being thinned out, thereby to combat skewing as a result of over-influence of similar, hence over-represented, documents upon said theme identification technique.
3. A method according to claim 2 wherein said thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with at least one inclusive email, thereby to thin out emails which are included in said inclusive email hence are deemed to pass the document similarity criterion with regard to said inclusive.
4. A method according to claim 2 wherein said thinning out at least one document which passes a document similarity criterion comprises identifying and discarding near-duplicates thereby to thin out at least one document which is deemed to pass the document similarity criterion with regard to a set of near-duplicates of said document, at least one of which is not being thinned out.
5. A method according to claim 1 wherein said computerized theme identification technique comprises topic modeling.
6. A method according to claim 5 wherein said topic modeling allows documents to have a plurality of topics.
7. A browsing system operative in conjunction with a stored representation of a multiplicity of electronic documents and their distribution over a plurality of themes, the system comprising:
theme-to-document flitting apparatus for retrieving and presenting to a user, documents whose document score for at least one user-selected theme; is high; and
document-level browsing apparatus for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
8. A system according to claim 7 and also comprising:
theme-to-word flitting apparatus for retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high;
word-level browsing apparatus for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes,
thereby to provide 3-tier browsing apparatus facilitating browsing at word, document and topic levels responsive to user-initiated flitting between the levels.
9. A method according to claim 1 and also comprising:
facilitating theme-to-word flitting by retrieving and presenting to a user, words whose word score for at least one user-selected theme; is high.
10. A method according to claim 1 and also comprising:
facilitating theme-to-document flitting for retrieving and presenting to a user, documents whose document score for at least one user-selected theme is high.
11. A method according to claim 1 and also comprising:
facilitating document-level browsing for retrieving and presenting to a user, documents whose distributions over the plurality of themes are similar to the distribution of a user-selected document over the plurality of themes.
12. A method according to claim 1 and also comprising:
facilitating word-level browsing for retrieving and presenting to a user, words whose distributions over the plurality of themes are similar to the distribution of a user-selected word over the plurality of themes.
13. A method according to claim 1 wherein the number of data set members in the large data set is further reduced subsequent to said using step and prior to a manual review process.
14. A method according to claim 1 wherein said reducing is effected using:
random selection; and
at least one computerized data set member pruning technique other than random selection.
15. A method according to claim 14 wherein said random selection is performed after said computerized data set member pruning technique.
16. A method according to claim 14 wherein said random selection is performed before said computerized data set member pruning technique.
17. A method according to claim 5 wherein said topic modeling which allows documents to have a plurality of topics comprises one of the following computerized techniques: Latent Dirichlet allocation (LDA), PLSI, and Pachinko allocation.
18. A method according to claim 3 wherein said thinning out at least one document which passes a document similarity criterion comprises replacing a plurality of emails forming an email thread, with a single inclusive email.
19. A method according to claim 4 wherein said identifying and discarding near-duplicates is effected using Equivio Zoom near-duplicate functionality.
20. A method according to claim 9 and wherein said facilitating comprises retrieving and presenting to a user, only those words whose word score for at least one user-selected theme;
is high and which answer to at least one additional criterion.
US14/161,159 2013-01-22 2014-01-22 System and method for computerized semantic processing of electronic documents including themes Abandoned US20140207782A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/161,159 US20140207782A1 (en) 2013-01-22 2014-01-22 System and method for computerized semantic processing of electronic documents including themes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361755242P 2013-01-22 2013-01-22
US14/161,159 US20140207782A1 (en) 2013-01-22 2014-01-22 System and method for computerized semantic processing of electronic documents including themes

Publications (1)

Publication Number Publication Date
US20140207782A1 true US20140207782A1 (en) 2014-07-24

Family

ID=51208555

Family Applications (3)

Application Number Title Priority Date Filing Date
US14/062,233 Abandoned US20140207786A1 (en) 2013-01-22 2013-10-24 System and methods for computerized information governance of electronic documents
US14/161,221 Active 2034-05-03 US10002182B2 (en) 2013-01-22 2014-01-22 System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents
US14/161,159 Abandoned US20140207782A1 (en) 2013-01-22 2014-01-22 System and method for computerized semantic processing of electronic documents including themes

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US14/062,233 Abandoned US20140207786A1 (en) 2013-01-22 2013-10-24 System and methods for computerized information governance of electronic documents
US14/161,221 Active 2034-05-03 US10002182B2 (en) 2013-01-22 2014-01-22 System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents

Country Status (1)

Country Link
US (3) US20140207786A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180053120A1 (en) * 2016-07-15 2018-02-22 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on statistical analysis
US10002182B2 (en) 2013-01-22 2018-06-19 Microsoft Israel Research And Development (2002) Ltd System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents
US10275444B2 (en) 2016-07-15 2019-04-30 At&T Intellectual Property I, L.P. Data analytics system and methods for text data
US10579721B2 (en) 2016-07-15 2020-03-03 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
AU2018355543B2 (en) * 2017-10-27 2021-01-21 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on statistical analysis
US20210117794A1 (en) * 2016-11-11 2021-04-22 International Business Machines Corporation Facilitating mapping of control policies to regulatory documents
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US11068932B2 (en) * 2017-12-12 2021-07-20 Wal-Mart Stores, Inc. Systems and methods for processing or mining visitor interests from graphical user interfaces displaying referral websites
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7933859B1 (en) 2010-05-25 2011-04-26 Recommind, Inc. Systems and methods for predictive coding
CN104731828B (en) * 2013-12-24 2017-12-05 华为技术有限公司 A kind of cross-cutting Documents Similarity computational methods and device
CN105893611B (en) * 2016-04-27 2020-04-07 南京邮电大学 Method for constructing interest topic semantic network facing social network
US10666792B1 (en) * 2016-07-22 2020-05-26 Pindrop Security, Inc. Apparatus and method for detecting new calls from a known robocaller and identifying relationships among telephone calls
US10558657B1 (en) 2016-09-19 2020-02-11 Amazon Technologies, Inc. Document content analysis based on topic modeling
US10255283B1 (en) * 2016-09-19 2019-04-09 Amazon Technologies, Inc. Document content analysis based on topic modeling
JP6946081B2 (en) * 2016-12-22 2021-10-06 キヤノン株式会社 Information processing equipment, information processing methods, programs
US10902066B2 (en) * 2018-07-23 2021-01-26 Open Text Holdings, Inc. Electronic discovery using predictive filtering

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6208988B1 (en) * 1998-06-01 2001-03-27 Bigchalk.Com, Inc. Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20070156732A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Automatic organization of documents through email clustering
US20090276467A1 (en) * 2008-04-30 2009-11-05 Scholtes Johannes C System and method for near and exact de-duplication of documents
US20090282086A1 (en) * 2006-06-29 2009-11-12 International Business Machines Corporation Method and system for low-redundancy e-mail handling
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
US20120278266A1 (en) * 2011-04-28 2012-11-01 Kroll Ontrack, Inc. Electronic Review of Documents
US20140195518A1 (en) * 2013-01-04 2014-07-10 Opera Solutions, Llc System and Method for Data Mining Using Domain-Level Context
US20150046151A1 (en) * 2012-03-23 2015-02-12 Bae Systems Australia Limited System and method for identifying and visualising topics and themes in collections of documents

Family Cites Families (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7421395B1 (en) * 2000-02-18 2008-09-02 Microsoft Corporation System and method for producing unique account names
US20020065857A1 (en) * 2000-10-04 2002-05-30 Zbigniew Michalewicz System and method for analysis and clustering of documents for search engine
US20020111792A1 (en) * 2001-01-02 2002-08-15 Julius Cherny Document storage, retrieval and search systems and methods
US20070043774A1 (en) * 2001-06-27 2007-02-22 Inxight Software, Inc. Method and Apparatus for Incremental Computation of the Accuracy of a Categorization-by-Example System
US20030195937A1 (en) * 2002-04-16 2003-10-16 Kontact Software Inc. Intelligent message screening
DE10239321B3 (en) 2002-08-27 2004-04-08 Pari GmbH Spezialisten für effektive Inhalation Aerosol therapy device
US20040204939A1 (en) * 2002-10-17 2004-10-14 Daben Liu Systems and methods for speaker change detection
US7743061B2 (en) * 2002-11-12 2010-06-22 Proximate Technologies, Llc Document search method with interactively employed distance graphics display
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
CA2574554A1 (en) 2004-07-21 2006-01-26 Equivio Ltd. A method for determining near duplicate data objects
US9218623B2 (en) * 2005-12-28 2015-12-22 Palo Alto Research Center Incorporated System and method for providing private stable matchings
WO2007086059A2 (en) 2006-01-25 2007-08-02 Equivio Ltd. Determining near duplicate 'noisy' data objects
US7587418B2 (en) 2006-06-05 2009-09-08 International Business Machines Corporation System and method for effecting information governance
US7873640B2 (en) * 2007-03-27 2011-01-18 Adobe Systems Incorporated Semantic analysis documents to rank terms
GB2450546A (en) 2007-06-29 2008-12-31 Philip Giokas Metered dispensation of fluid
US9317593B2 (en) * 2007-10-05 2016-04-19 Fujitsu Limited Modeling topics using statistical distributions
AU2008255269A1 (en) 2008-02-05 2009-08-20 Nuix Pty. Ltd. Document comparison method and apparatus
US8359365B2 (en) 2008-02-11 2013-01-22 Nuix Pty Ltd Systems and methods for load-balancing by secondary processors in parallel document indexing
US8280886B2 (en) * 2008-02-13 2012-10-02 Fujitsu Limited Determining candidate terms related to terms of a query
US8209665B2 (en) * 2008-04-08 2012-06-26 Infosys Limited Identification of topics in source code
US8254698B2 (en) * 2009-04-02 2012-08-28 Check Point Software Technologies Ltd Methods for document-to-template matching for data-leak prevention
US8346685B1 (en) 2009-04-22 2013-01-01 Equivio Ltd. Computerized system for enhancing expert-based processes and methods useful in conjunction therewith
US8458154B2 (en) * 2009-08-14 2013-06-04 Buzzmetrics, Ltd. Methods and apparatus to classify text communications
US8392175B2 (en) * 2010-02-01 2013-03-05 Stratify, Inc. Phrase-based document clustering with automatic phrase extraction
US8745091B2 (en) * 2010-05-18 2014-06-03 Integro, Inc. Electronic document classification
US8316030B2 (en) * 2010-11-05 2012-11-20 Nextgen Datacom, Inc. Method and system for document classification or search using discrete words
US20120215749A1 (en) 2011-02-08 2012-08-23 Pierre Van Beneden System And Method For Managing Records Using Information Governance Policies
US20140164171A1 (en) * 2011-09-12 2014-06-12 Tian Lu System and method for automatic segmentation and matching of customers to vendible items
US9355170B2 (en) * 2012-11-27 2016-05-31 Hewlett Packard Enterprise Development Lp Causal topic miner
US20140207786A1 (en) 2013-01-22 2014-07-24 Equivio Ltd. System and methods for computerized information governance of electronic documents
US20150113388A1 (en) 2013-10-22 2015-04-23 Qualcomm Incorporated Method and apparatus for performing topic-relevance highlighting of electronic text

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6208988B1 (en) * 1998-06-01 2001-03-27 Bigchalk.Com, Inc. Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20070156732A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Automatic organization of documents through email clustering
US20090282086A1 (en) * 2006-06-29 2009-11-12 International Business Machines Corporation Method and system for low-redundancy e-mail handling
US20090276467A1 (en) * 2008-04-30 2009-11-05 Scholtes Johannes C System and method for near and exact de-duplication of documents
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
US20120278266A1 (en) * 2011-04-28 2012-11-01 Kroll Ontrack, Inc. Electronic Review of Documents
US20150046151A1 (en) * 2012-03-23 2015-02-12 Bae Systems Australia Limited System and method for identifying and visualising topics and themes in collections of documents
US20140195518A1 (en) * 2013-01-04 2014-07-10 Opera Solutions, Llc System and Method for Data Mining Using Domain-Level Context

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
USDOJ Adopts Equivio Technology for Near-Duplicate Detection and Email Thread Analysis, January 16, 2008, http://www.equivio.com/press_item.php?ID=14 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10002182B2 (en) 2013-01-22 2018-06-19 Microsoft Israel Research And Development (2002) Ltd System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents
US11010548B2 (en) 2016-07-15 2021-05-18 At&T Intellectual Property I, L.P. Data analytics system and methods for text data
US11663495B2 (en) 2016-07-15 2023-05-30 Intuit Inc. System and method for automatic learning of functions
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US11663677B2 (en) 2016-07-15 2023-05-30 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
US10275444B2 (en) 2016-07-15 2019-04-30 At&T Intellectual Property I, L.P. Data analytics system and methods for text data
US11520975B2 (en) 2016-07-15 2022-12-06 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US20180053120A1 (en) * 2016-07-15 2018-02-22 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on statistical analysis
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US10579721B2 (en) 2016-07-15 2020-03-03 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US10642932B2 (en) 2016-07-15 2020-05-05 At&T Intellectual Property I, L.P. Data analytics system and methods for text data
US11797887B2 (en) * 2016-11-11 2023-10-24 International Business Machines Corporation Facilitating mapping of control policies to regulatory documents
US20210117794A1 (en) * 2016-11-11 2021-04-22 International Business Machines Corporation Facilitating mapping of control policies to regulatory documents
AU2018355543B2 (en) * 2017-10-27 2021-01-21 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on statistical analysis
US11068932B2 (en) * 2017-12-12 2021-07-20 Wal-Mart Stores, Inc. Systems and methods for processing or mining visitor interests from graphical user interfaces displaying referral websites
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US11687721B2 (en) 2019-05-23 2023-06-27 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations

Also Published As

Publication number Publication date
US10002182B2 (en) 2018-06-19
US20140207786A1 (en) 2014-07-24
US20140207783A1 (en) 2014-07-24

Similar Documents

Publication Publication Date Title
US20140207782A1 (en) System and method for computerized semantic processing of electronic documents including themes
US11645317B2 (en) Recommending topic clusters for unstructured text documents
Hoffart et al. Discovering emerging entities with ambiguous names
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
WO2017097231A1 (en) Topic processing method and device
US20190318407A1 (en) Method for product search using the user-weighted, attribute-based, sort-ordering and system thereof
US20140214835A1 (en) System and method for automatically classifying documents
US10482146B2 (en) Systems and methods for automatic customization of content filtering
WO2017051425A1 (en) A computer-implemented method and system for analyzing and evaluating user reviews
US10747759B2 (en) System and method for conducting a textual data search
CA2956627A1 (en) System and engine for seeded clustering of news events
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
US20190213208A1 (en) Interactive patent visualization systems and methods
Tsarev et al. Supervised and unsupervised text classification via generic summarization
Dangre et al. System for Marathi news clustering
Wang et al. High-level semantic image annotation based on hot Internet topics
US20210240334A1 (en) Interactive patent visualization systems and methods
Zhang et al. A semantics-based method for clustering of Chinese web search results
Abuoda et al. Automatic Tag Recommendation for the UN Humanitarian Data Exchange.
Jing Searching for economic effects of user specified events based on topic modelling and event reference
Choudhury et al. Content-based and link-based methods for categorical webpage classification
Afolabi et al. Topic Modelling for Research Perception: Techniques, Processes and a Case Study
Thaker et al. A novel approach for extracting and combining relevant web content
Bartolome et al. Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016

Legal Events

Date Code Title Description
AS Assignment

Owner name: EQUIVIO LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:RAVID, YIFTACH;REEL/FRAME:033052/0402

Effective date: 20140325

AS Assignment

Owner name: MICROSOFT ISRAEL RESEARCH AND DEVELOPMENT (2002) L

Free format text: MERGER;ASSIGNOR:EQUIVIO LTD;REEL/FRAME:039495/0216

Effective date: 20160221

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION