WO2015125088A1 - Document characterization method - Google Patents
Document characterization method Download PDFInfo
- Publication number
- WO2015125088A1 WO2015125088A1 PCT/IB2015/051239 IB2015051239W WO2015125088A1 WO 2015125088 A1 WO2015125088 A1 WO 2015125088A1 IB 2015051239 W IB2015051239 W IB 2015051239W WO 2015125088 A1 WO2015125088 A1 WO 2015125088A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- class
- text
- rules
- characterization
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Definitions
- the present invention relates to natural language processing focused in identifying relevant information in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.
- the state of art discloses several methods and techniques related to classification or characterization of documents according to their content.
- Patent US8380718 discloses a system and method for grouping similar documents. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A sub-group of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence.
- the European patent EP1365329 (B1 ) discloses a classification method wherein one document is classified into at least one document class by selecting terms for use in the classification from among terms that occur in the document. A similarity between the input document and each class is calculated using information saved for every document class. The calculated similarity to each class is corrected. The class to which the input document belongs is determined in accordance with the corrected similarity to each class.
- Document US20131 10843 discloses a method and system for classifying insurance files for identification, sorting and efficient collection of subrogation claims.
- the invention determines whether an insurance claim has merit to warrant claim recovery efforts utilizing software code for partially describing a set of documents having unstructured and structured file data containing terms and phrases having contextual bases, code for transforming the terms and phrases, code for iterating a classification process to determine rules that best classify the set of documents based upon context, code for incorporating the rules into an induction and knowledge representation, thesauri taxonomies and text summarization to classify subrogation claims; code for calculating a base score and a concept vector to identify the selected claims that demonstrate a given probability of subrogation recovery.
- the present invention relates to natural language processing focused in identifying specific information elements in the contents of an input document in digital format, also defined as automatic document characterization, in a step by step process, which uses the previous information to search and define the following data.
- This text processing approach is focused on documents in which different information elements are relevant to identify from the contents that do not follow a specific layout, order, or syntax, known as unstructured documents. Unstructured documents, such as legal documents, do not present a specific sequence of elements and the contents can be found using different expressions depending on the language used.
- the information elements that are searched for in the document allow determining specific characteristics of this document. This is the case of document classes or categories, and other characteristic associated, such as dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, which is related with the document class or multiclass.
- the method for automatic document characterization described here includes document classification, which is a sequential process for associating documents to pre-defined document classes, and has been addressed by different methods and techniques, such as na ' ive Bayes, and Support Vector Machine, among others. But these methods are not properly prepared to the natural overlapping of classes, which in this type of documents is not only feasible, but expected. Also, these methods, except one, focus on specific keywords, not easily considering different language- dependent expressions as equivalent concepts or ideas.
- the process for automatic ocument characterization processes the document contents five times (OCR, Class and Multiclass Determination, Names detection, Document class-related characterization, Document date detection), since different approaches are needed for each of the information elements.
- the first revision aims to refine the input text, acknowledging that the results of the Optical Character Recognition (OCR) might separate whole words by assuming spaces and end of lines.
- OCR Optical Character Recognition
- a dictionary of valid words in a given language is used to validate word merges, thus separated words are corrected, allowing the rest of the revisions to work easily.
- the second revision is focused in the document classes by searching for semantic rules that might occur in the text, and according to rules occurrence, determine one or more classes that might apply.
- the third revision focuses on the names referenced in the contents.
- This revision uses a list of relevant keywords, both that might be included at the beginning or at the end of a name, and also that might precede or follow, but not be included in the name.
- the relevance of each of the defined keywords can be associated to the main class assigned to the document in the previous step, therefore allowing a fine tuned detection process.
- the fourth revision is related to the detailed characterization of the document, based on the class or multiclass that has been previously defined. Given this class or multiclass, there is a list of characteristics that a document has. As an example, in the case of a real state lease agreement, it is useful to know the "lease rent" or the "leasable area", so the search of this information must be done.
- the final revision searches only for valid dates, but also considers some of the preceding text and information defined, to help in the determination of the issue date. It must be taken into consideration that each class may have one or more relevant dates, for example, such as the date in which a lease agreement commences and the date which it ends. At this step, the dates to search will pre-established, since the class of the document will be known.
- This invention aims to produce a much faster, complete and accurate characterization result compared to the manual characterization that a legal expert may provide by reading the document contents.
- This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.
- the accuracy in the characterization is supported by a rule calibration process, that takes into account a set of previously classified documents, and adjust each rule's coefficients to match these documents classes as best as possible, which allows the continuous improvement of this accuracy.
- FIG. 1 is a block diagram that represents the document characterization process
- FIG. 2a is a block diagram that represents the Class and Multiclass determination process
- FIG. 2b is a block diagram that represents the Class and Multiclass determination process with essential rules
- FIG. 3 is a block diagram that represents an example of an embodiment of the invention. In this particular scenario, "Existing purchase” rule was used as an example.
- FIG. 4 is a block diagram that represents the Class and Multiclass calibration process.
- FIG. 5 is a block diagram that represents the Names detection process.
- FIG. 6 is a block diagram that represents the Document class-related characterization.
- FIG. 7 is a block diagram that represents the Document date detection process.
- FIG. 8 is a block diagram that describes a computer system for executing the document characterization, from the reception of the document to be characterized, the retrieval of words from the dictionary and text rules from the database, to the final display of resulting characteristics of the provided document.
- the first step is the revision of the text with the recognition of individual words 101 with an Optical Character Recognition (OCR).
- OCR Optical Character Recognition
- the Determination of Class or Multiclass is made 102, aligned with a Class or Multiclass rule calibration process 102a, which is an iterative process that improves the rule score definition, obtaining a better classification.
- a names detection process 103 is completed, with the objective of obtaining relevant names of people and organizations in the document.
- This process 103 takes into consideration the class that the document has been associated to make the search 105 in a more precise way, considering a relevant set of keywords and expressions 103a to define the document ' s containing names.
- a document class- related characterization 104 which searches for information taking into consideration the class previously defined.
- the final step consists of the search and definition of the document ' s valid issue date and other relevant dates 106, which similar as the process of the search of names, takes into consideration the class that has been defined to make the search.
- This final document characterization 107 includes dates, names of natural or legal persons mentioned in the text and other relevant information mentioned in the text, making in a fast and a accurate way a detailed description of the document. This speed and accuracy comparison is feasible when implementing the process as a computing system available for document characterization for users of this system.
- the document text is reviewed recognizing individual words obtained through an Optical Character Recognition 101 (OCR) processing means. If the document's origin is a word processor, this refinement process may produce just a minimal difference or no difference at all.
- OCR Optical Character Recognition 101
- each individual word is concatenated with the next one and then compared to the list of valid words in the language of the document stored in the dictionary 101a. If this concatenated word is found, then it's kept for the following processes.
- the process depends on a plurality of rules defined by means of specification of logic text-related rules 301 , each of them defining a set of keywords that represent equivalent verbal expressions 301a. These expressions are described by keywords, synonyms, conjugations, different amount of words and/or equivalent verbalizations 301a of the same concept, also considering different levels of separation between these words, thus allowing more flexibility in text rule detection (2a.201 and 2b.201 ).
- Each rule may be related to one or more classes (2a.202 and 2b.202a), but the relevance for each class is determined by a relevance coefficient, that may have a positive or negative value 305.
- a rule can be considered as essential, in which case the coefficient should be high and even the absence of this rule might lead to a penalty regarding the document class in which this rule is considered essential 2b.202b.
- FIG. 2b it is illustrated the duality of an essential rule 2b.202b, due to the different values that it may have for different classes.
- the "existing purchase" rule may be relevant to indicate that the document is effectively a purchase document.
- a document such as a real state lease agreement won't have this rule as essential and the ponderation that this rule has in the document total score may be very low (2a.203 and 2b.203).
- the rule "existing purchase” 301 applies to any type of purchase, such as real estate, chattels or shares.
- this rule may be found in slightly different ways 301a.
- a possible expression that relates to this rule is "... the parties agree to subscribe the present purchase agreement "301 b.
- An alternative expression is "... the seller sells ... and the buyer buys... "301 b, regardless of the item being considered in the sale.
- the methodology for all the other rules work in the similar way, implementing a set of keywords that represent equivalent terms (302 and 304).
- a different type of contract is the lease agreement.
- the main rule that is found on this kind of agreement is "existing lease" which can be found as
- the classification process searches the document from the beginning to the end, identifying each word individually and comparing it to one of the several rules defined 301a. If the word matches any of the words of any of the rules, then the following text is processed to complete each of those candidate rules being evaluated, but also comparing each new word with other rules as well 301 b.
- the classification process ends up with a set of rules found to apply for each document (2a.202 and 2b.202). Then, the final classification score for each document class is calculated by this formula 305.
- This detection using the detection means, is based on the fact that documents mention persons' names preceding them with some specific verbal expressions 502a, such as "Mr.”, or "Mrs.”, and the name might be followed by a personal ID reference, or a verb, that specifies an action this person is being referenced for.
- preceding text might also include equivalent expressions, and usually ends up with "LLC", "Inc.” or other company type acronyms usually found in legal documents.
- a document associated with the "purchase contract” document class that is determined by the occurrence of the rule "existing purchase” and others may include expressions such as "... as the seller", or "... also known as the buyer ", as text that follows the names in the document.
- These additional document class-depending rules improve the complete name detection, especially in the detection of the end of a name. It is also possible that, taking into consideration the previous defined class there is recognition of either relevant or non-relevant names in the document. The type of class previously defined will have huge impact on the way that the mentioned process 502b is done.
- This characterization refers to the detection of other specific data from the document, such as numeric amounts, own names, among others. In some cases it is of great use to obtain particular information of a document, such as the lease rent in a real estate lease agreement or the leasable area.
- the Class or Multiclass has already been defined 602a and the names search is finished 601 , the list of characteristics of a particular document is defined. Given that list, the information of the document is searched 602. This becomes of great utility to the characterization 603 of the document, since it takes into consideration the class of the document to elaborate a complete characterization of the document.
- An example of this definition is the real estate lease agreement 603a.
- the final step is detecting the document issuing date, considers that more than one valid date may be included in the document, but only one of them can be considered as the document issue date. Also, this detection takes into account the existence of many verbal forms to describe a date 701.
- the document date detection process is performed by recognizing each valid date and calculating a score through the detection means 702, that depends in the following criteria: document issue date must be prior to current date ; must be complete, which means it should include day, month, and year, and can be preceded by an expression that references a location.
- the definitive document issue date is the one that gets the highest score and the first, in case of a tie 703. This process, the techniques to detect valid dates in the document, and the list of keywords that allow determination of the document issue date are configurable in a application.
- the Class or Multiclass of the document Given that the Class or Multiclass of the document has been defined 704, it is needed to search one or more dates that are relevant in the document.
- the rules used previously to define the document are used in the search of the key dates, since in some cases the text concerning the rule has relevant information nearby.
- the relevant dates of the document will be already defined with a pre-established list. Then the search of these dates will be made 705, completing the document characterization process 706.
- this process uses a Genetic Algorithm (404), also known as GA, with an objective function that defines the optimality of a specific combination of coefficients assigned to the rules, that is based in measuring the differences between the classes pre-assigned to the training set documents 401 , and the classes assigned automatically with a given candidate combination of coefficients for each rule and each document. Then iteratively narrow this difference to arrive to a final improved set of coefficients 405.
- the improvement of the set of the coefficients is a constant and iterative process, which will enable to characterize the legal documents in a correct way.
- FIG. 8 illustrates and exemplary embodiment of a computing system based on a Web server 800 that stores and executes an embodiment of the present invention.
- This web server stores an application for that implements the characterization method and an application for OCR 803 that will process the documents received for characterization.
- the Web application 801 is the main application that users access to execute both the OCR and Characterization processes, by connecting from a user computer 808, using a network 807, and using this Web application to characterize documents 809 they have in their own computer local disk.
- the Web server reacts to a user request for document characterization by receiving the document, processing it by OCR and retrieving the dictionary of valid words 802, stored in a file in the local server's disk, to the local memory 804, and also retrieving the rules collection and document predefined classes, stored in a database 806 accessible for the server.
- the characterization process is then executed by the CPU 805, resulting in a Web page with the information obtained by the characterization of the supplied document that is sent back to the user's computer and displayed in its screen or display.
- the calibration process operates in a similar way.
- An administrative user loads several pre-characterized documents to the Web Server, and executes the calibration process by another option in the same Web application.
- This calibration process finally modifies the coefficients for the rules updating them in the database.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201461941002P | 2014-02-18 | 2014-02-18 | |
US61/941,002 | 2014-02-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015125088A1 true WO2015125088A1 (en) | 2015-08-27 |
Family
ID=53877689
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2015/051239 WO2015125088A1 (en) | 2014-02-18 | 2015-02-18 | Document characterization method |
Country Status (3)
Country | Link |
---|---|
CL (1) | CL2016002090A1 (en) |
PE (1) | PE20161166A1 (en) |
WO (1) | WO2015125088A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11003889B2 (en) | 2018-10-22 | 2021-05-11 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on signatory role analysis |
US11017221B2 (en) | 2018-07-01 | 2021-05-25 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on embedded dates |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020083090A1 (en) * | 2000-12-27 | 2002-06-27 | Jeffrey Scott R. | Document management system |
US20040205448A1 (en) * | 2001-08-13 | 2004-10-14 | Grefenstette Gregory T. | Meta-document management system with document identifiers |
US20070206884A1 (en) * | 2006-03-03 | 2007-09-06 | Masahiro Kato | Image processing apparatus, recording medium, computer data signal, and image processing method |
US20100046842A1 (en) * | 2008-08-19 | 2010-02-25 | Conwell William Y | Methods and Systems for Content Processing |
-
2015
- 2015-02-18 PE PE2016001498A patent/PE20161166A1/en not_active Application Discontinuation
- 2015-02-18 WO PCT/IB2015/051239 patent/WO2015125088A1/en active Application Filing
-
2016
- 2016-08-18 CL CL2016002090A patent/CL2016002090A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020083090A1 (en) * | 2000-12-27 | 2002-06-27 | Jeffrey Scott R. | Document management system |
US20040205448A1 (en) * | 2001-08-13 | 2004-10-14 | Grefenstette Gregory T. | Meta-document management system with document identifiers |
US20070206884A1 (en) * | 2006-03-03 | 2007-09-06 | Masahiro Kato | Image processing apparatus, recording medium, computer data signal, and image processing method |
US20100046842A1 (en) * | 2008-08-19 | 2010-02-25 | Conwell William Y | Methods and Systems for Content Processing |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11017221B2 (en) | 2018-07-01 | 2021-05-25 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on embedded dates |
US11810070B2 (en) | 2018-07-01 | 2023-11-07 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on embedded dates |
US11003889B2 (en) | 2018-10-22 | 2021-05-11 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on signatory role analysis |
US11769014B2 (en) | 2018-10-22 | 2023-09-26 | International Business Machines Corporation | Classifying digital documents in multi-document transactions based on signatory role analysis |
Also Published As
Publication number | Publication date |
---|---|
CL2016002090A1 (en) | 2016-12-30 |
PE20161166A1 (en) | 2016-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Scaffidi et al. | Red Opal: product-feature scoring from reviews | |
US11663254B2 (en) | System and engine for seeded clustering of news events | |
US20210109958A1 (en) | Conceptual, contextual, and semantic-based research system and method | |
US10891699B2 (en) | System and method in support of digital document analysis | |
US11514096B2 (en) | Natural language processing for entity resolution | |
US7783629B2 (en) | Training a ranking component | |
US8983963B2 (en) | Techniques for comparing and clustering documents | |
US8478052B1 (en) | Image classification | |
US8355997B2 (en) | Method and system for developing a classification tool | |
US9734192B2 (en) | Producing sentiment-aware results from a search query | |
US20040049499A1 (en) | Document retrieval system and question answering system | |
US20130060769A1 (en) | System and method for identifying social media interactions | |
US20060217962A1 (en) | Information processing device, information processing method, program, and recording medium | |
KR101511656B1 (en) | Ascribing actionable attributes to data that describes a personal identity | |
KR20160030943A (en) | Performing an operation relative to tabular data based upon voice input | |
Ding et al. | Auto-categorization of HS code using background net approach | |
US20110219299A1 (en) | Method and system of providing completion suggestion to a partial linguistic element | |
JP2019530063A (en) | System and method for tagging electronic records | |
CN106997390A (en) | A kind of equipment part or parts commodity transaction information search method | |
KR20210047229A (en) | Recommendation System and METHOD Reflecting Purchase Criteria and Product Reviews Sentiment Analysis | |
KR20160149050A (en) | Apparatus and method for selecting a pure play company by using text mining | |
CA2956627A1 (en) | System and engine for seeded clustering of news events | |
CN113591476A (en) | Data label recommendation method based on machine learning | |
Wahyudi et al. | Topic modeling of online media news titles during COVID-19 emergency response in Indonesia using the latent dirichlet allocation (LDA) algorithm | |
WO2015125088A1 (en) | Document characterization method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15751446 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 001498-2016 Country of ref document: PE |
|
WWE | Wipo information: entry into national phase |
Ref document number: NC2016/0001532 Country of ref document: CO |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15751446 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 240417) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15751446 Country of ref document: EP Kind code of ref document: A1 |