US20110258150A1 - Systems and methods for training document analysis system for automatically extracting data from documents - Google Patents

Systems and methods for training document analysis system for automatically extracting data from documents Download PDF

Info

Publication number
US20110258150A1
US20110258150A1 US13/007,430 US201113007430A US2011258150A1 US 20110258150 A1 US20110258150 A1 US 20110258150A1 US 201113007430 A US201113007430 A US 201113007430A US 2011258150 A1 US2011258150 A1 US 2011258150A1
Authority
US
United States
Prior art keywords
data
document
image
documents
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/007,430
Inventor
Depankar Neogi
Steven K. Ladd
Girish Welling
Arjun Kumar
Vartika SINGH
Matthew DUGGAN
Tushar Mahata
Xiaobin YANG
Jian-Wu XU
Janice O'NEIL
Nirupam Sarkar
Gopal Krishna
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gruntworx LLC
Original Assignee
COPANION Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by COPANION Inc filed Critical COPANION Inc
Priority to US13/007,430 priority Critical patent/US20110258150A1/en
Priority to US13/166,966 priority patent/US20110249905A1/en
Assigned to COPANION, INC. reassignment COPANION, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NEOGI, DEPANKAR, SARKAR, NIRUPAM, MAHATA, TUSHAR, YANG, XIAOBIN, DUGGAN, MATTHEW, WELLING, GIRISH, XU, Jian-wu, KUMAR, ARJUN, SINGH, VARTIKA, KRISHNA, GOPAL, LADD, STEVEN K., O'NEIL, JANICE
Publication of US20110258150A1 publication Critical patent/US20110258150A1/en
Assigned to GRUNTWORX, LLC reassignment GRUNTWORX, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: COPANION, INC.
Assigned to GRUNTWORX, LLC reassignment GRUNTWORX, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: COPANION, INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/182Extraction of features or characteristics of the image by coding the contour of the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This invention relates generally to systems and methods to extract data from electronic documents, and more particularly to systems and methods for training document analysis system for automatically extracting data from documents.
  • Data extraction can be primarily clerical in nature, such as in inputting information on customer survey forms. Data extraction can also be an essential portion of larger technical tasks, such as preparing income tax returns, processing healthcare records or handling insurance claims.
  • Electronic Data Interchange attempts to eliminate human processing efforts by coding and transmitting the document information in strictly formatted messages.
  • Electronic Data Interchange is known for custom computer systems, cumbersome software and bloated standards that defeated its rapid spread throughout the supply chain. Perceived as too expensive, the vast majority of businesses have avoided implementing EDI.
  • applications of XML, XBRL and other computer-readable document files are quite limited compared to the use of documents in paper and digital image formats (such as PDF and TIFF.)
  • Outsourcing the second method of data extraction, requires the same worker education, expertise, training, software knowledge and/or cultural understanding.
  • outsourced data extraction workers must recognize documents, find relevant information on the documents, extract and enter the data appropriately and accurately in particular software programs. Since outsourcing is manual, just as is conventional data extraction, it is also complex, time-consuming and error-prone.
  • Outsourcing firms such as Accenture, Datamatics, Hewlett Packard, IBM, Infosys, Tata, and Wipro, often reduce costs by offshoring data extraction work to locations with low wage data extraction workers. For example, extraction of data from US tax and financial documents is a function that has been implemented using thousands of well-educated, English-speaking workers in India and other low wage countries.
  • the first step of outsourcing requires organizations to scan financial, health, tax and/or other documents and save the resulting image files.
  • These image files can be accessed by data extraction workers via several methods.
  • One method stores the image files on the source organizations' computer systems; the data extraction workers view the image files over networks (such as the Internet or private networks.)
  • Another method stores the image files on third-party computers systems; the data extraction workers view the image files over networks.
  • An alternative method transmits the image files from source organizations over networks and stores the image files for viewing by the data extraction workers on the data extraction organizations' computer system.
  • an accountant may scan the various tax forms containing client financial data and transmit the scanned image files to an outsourcing firm.
  • An employee of the outsourcing firm extracts the client financial data and enters it into an income tax software program.
  • the resulting tax software data file is then transmitted back to the accountant.
  • Outsourcing and offshoring are accompanied with concerns over security risks associated with fraud and identity theft. These security concerns apply to employees and temporary workers as well as outsourced workers and offshore workers who have access to documents with sensitive information.
  • the third general method of data extraction involves partial automation, often combining optical character recognition, human inspection and workflow management software.
  • Automation requires customizing and/or programming data extraction software tools to properly recognize and process a specific set of documents for a specific domain. Because such customization projects often cost upwards of hundreds of thousands of dollars, data extraction automation is usually limited to large organizations that can afford significant capital investments.
  • the first step of a partially automated data extraction operation is to scan financial, health, tax and/or other documents and save the resulting image files.
  • the scanned images are compared to a database of known documents. Images that are not identified are routed to data extraction workers for conventional processing. Images that are identified have data extracted using templates, either location-based or label-based, along with optical character recognition (OCR) technology.
  • OCR optical character recognition
  • Optical character recognition is imperfect, often mistaking more than one percent of the characters on clean, high quality documents. Many documents are neither clean nor high quality, suffering from being folded or marred before scanning, distorted during scanning and degraded during post-scanning binarization. As a result, some of the labels needed to identify data are often not recognizable; therefore, some of the data cannot be automatically extracted.
  • Inspection requires workers with the same capabilities of data extraction workers, namely specific education, domain expertise, particular training, software knowledge and/or cultural understanding. Inspection workers must recognize documents, find relevant information on the documents and insure that the data has been accurately extracted and appropriately entered in particular software programs. Typically, any changes made by inspection workers must be reviewed and approved by other, more senior, inspection workers before replacing the data extracted by optical character recognition. Because automation requires human inspection, source documents with sensitive information are exposed in their entirety to data extraction workers.
  • the invention is directed to systems and methods for training document analysis system for automatically extracting data from documents.
  • a method for training a document analysis system to automatically extract data from each document wherein the document analysis system receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to classify each document into a corresponding document category and to extract data from the electronic documents.
  • the method includes: automatically analyzing images and text features extracted from each received electronic document to associate the electronic document with a corresponding document category; and comparing the extracted text features with a set of text features associated with corresponding category of each received document, in which the set of text features includes a set of characters, words, and phrases.
  • the method further includes storing the extracted text features as the data contained in the corresponding electronic document. If, however, the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding electronic document category, the method further includes submitting the unrecognized text features to a training phase in which the text features are recognized as belonging to the set of text features associated with the corresponding electronic document category and then using the now-recognized text features to automatically modify the set of text features associated with the corresponding electronic document category so that the extracting data, regardless of which document category the corresponding document belongs to, improves as the training method is subjected to more and more unrecognized text features and the set of text features is modified accordingly.
  • FIG. 1 is a system diagram of a document data extraction system 100 according to a preferred embodiment of the disclosed subject matter
  • FIG. 2 is a system diagram of the image capture system 110 according to a preferred embodiment of the disclosed subject matter
  • FIG. 3 is a system diagram of the web server system 120 according to a preferred embodiment of the disclosed subject matter
  • FIG. 4 is a system diagram of the document processing system 130 according to a preferred embodiment of the disclosed subject matter
  • FIG. 5 is a system diagram of the image processing system 422 according to a preferred embodiment of the disclosed subject matter
  • FIG. 6 is a system diagram of the classification system 432 according to a preferred embodiment of the disclosed subject matter.
  • FIG. 7 is a system diagram of the grouping system 442 according to a preferred embodiment of the disclosed subject matter.
  • FIG. 8 is a system diagram of the data extraction system 452 according to a preferred embodiment of the disclosed subject matter.
  • FIG. 9 is an illustration of three-step document submission process according to a preferred embodiment of the disclosed subject matter.
  • FIG. 10 is an illustration of the nine types of point patterns according to a preferred embodiment of the disclosed subject matter.
  • FIG. 11 is an illustration of image processing prior to OCR according to a preferred embodiment of the disclosed subject matter
  • FIG. 12 is an illustration of a log polar histogram according to a preferred embodiment of the disclosed subject matter
  • FIG. 13 is a flow diagram of the service control manager 410 according to a preferred embodiment of the disclosed subject matter
  • FIG. 14 is an illustration of label contour matching according to a preferred embodiment of the disclosed subject matter.
  • FIG. 15 is a flow diagram of a CRK classifier according to a preferred embodiment of the disclosed subject matter.
  • FIG. 16 is a schematic of a CRK classifier according to a preferred embodiment of the disclosed subject matter.
  • FIG. 17 is an illustration of relative location matching of labels according to a preferred embodiment of the disclosed subject matter.
  • FIG. 18 is an exemplary computer system on which the described invention may run according to a preferred embodiment of the disclosed subject matter
  • FIG. 19 is an illustration of boxes containing labels and values
  • FIG. 20 is an illustration of check boxes
  • FIG. 21 is an illustration of address blocks
  • FIG. 22 is an illustration of an instruction block
  • FIG. 23 is an illustration of a table
  • FIG. 24 is an illustration of a multi-copy form
  • FIG. 25 is an illustration of an image with (A) confetti, (B) confetti with identified labels and (C) confetti with identified labels and labels with potential table headers grouped horizontally;
  • FIG. 26 is an illustration of a table with a header that needs reconstruction
  • FIG. 27 is an illustration of a table with an instruction block at the bottom
  • FIG. 28 is an illustration of a portion of a table with noise removed and most data correctly extracted
  • FIG. 29 is an illustration of row formation in a table
  • FIG. 30 is an illustration of column formation in a table
  • FIG. 31 is an illustration of header association for a table
  • FIG. 32 is an illustration of a table with extracted data viewed through a debug tool; note the incorrectly formed rows due to the “Corrected” overlay. Rows 1 , 2 , 3 , 5 , 6 , 7 , 8 , and 9 are merged, but row 4 and the rest of the table was extracted properly;
  • FIG. 33 is an illustration of the an image being extracted via a process of progressive refinement and reduced character set OCR
  • FIG. 34 is an illustration of the an image being extracted via a process of progressive refinement based on increasing knowledge about the form
  • FIG. 35 is an illustration of the an image being extracted via a process of progressive refinement based on utilizing knowledge gained from one form to extract data from another form;
  • FIG. 36 is an illustration of data external to the input image that is used to extract and verify data from the input image
  • FIG. 37 is an illustration of a form with an obscured label
  • FIG. 38 is an illustration of the data extracted from the form shown in FIG. 37 ;
  • FIG. 39 is an illustration of a form with a degraded image that results in incorrectly extracted data
  • FIG. 40 is an illustration of the data extracted from the form shown in FIG. 39 ;
  • FIG. 41 is an illustration of a portion of a W-2 form
  • FIG. 42 is an illustration of the internal representation of the data corresponding to the form in FIG. 41 as a partial layout graph
  • FIG. 43 is an illustration of the internal representation of the data corresponding to the form in FIG. 41 after labels are detected;
  • FIG. 44 is an illustration of the labels associated with a 1099-OID form
  • FIG. 45 is an illustration of a table
  • FIG. 46 is an illustration of the table shown in FIG. 45 with columns identified;
  • FIG. 47 is an illustration of the table shown in FIG. 45 with columns and labels identified;
  • FIG. 48 is an illustration of the table shown in FIG. 45 with columns, labels and header identified;
  • FIG. 49 is an illustration of the table shown in FIG. 45 with columns, labels, header and rows identified;
  • FIG. 50 is an illustration of four occurrences of image fields for “Wages, tips, other comp.” box on a single W-2 form;
  • FIG. 51 is an illustration of the data records corresponding to the image fields shown in FIG. 50 .
  • Preferred embodiments of the present invention provides a method and system for extracting data from paper and digital documents into a format that is searchable, editable and manageable.
  • FIG. 1 is a system diagram of a document data extraction system 100 according to a preferred embodiment of the invention.
  • System 100 has an image capture system 110 , and a web server system 120 and a document processing system 130 .
  • the image capture system 110 is connected to the web server system 120 by a network such as a local-area network (LAN,) a wide-area network (WAN) or the Internet.
  • LAN local-area network
  • WAN wide-area network
  • the preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption.
  • SSL Secure Sockets Layer
  • Encryption certificates can be purchased from well respected certificate authorities such as VeriSign and thawte or can be generated by using numerous key generation tools in the market today, many of which are available as open source.
  • the files may be transferred over a non-secure network, albeit in a less secure manner.
  • the web server system 120 is connected to the document processing system 130 via software within a computer system.
  • Other embodiments of the invention may integrate the document processing system 110 with the image capture system 130 . In this case, the web server system 120 is not necessary.
  • System 110 is an image capture system that receives physical documents and scans them.
  • the image capture system 110 is described in greater detail below.
  • System 120 is a web server system that receives the scanned documents and returns the extracted data over the Internet. Some embodiments of the invention may not have a web server system 120 .
  • the web server system 120 is described in greater detail below.
  • System 130 is a document processing system.
  • the document processing system 130 extracts the received data into files and databases per a predetermined scheme.
  • the document processing system 130 is comprised of several modules that are part of a highly distributed architecture which consists of several independent processes, data repositories and databases which communicate and pass messages to each other via well defined standard and proprietary interfaces. Even though the document processing system 130 may be built in a loosely coupled manner to achieve maximum scalability and throughput, the same results can be achieved if the document processing system 130 was more tightly coupled in a single process with each module being a logical entity of the same process. Furthermore, the document processing system 130 supports multiple different product types which may process anywhere from hundreds to millions of documents every day for tens to thousands of customers in different markets.
  • the document processing system 130 utilizes server(s) hosted in a secure data center so that documents from healthcare, insurance, banking, government, tax and other applications are processed per security policies that are HIPAA, GLBA, SAS70, etc. compliant.
  • the document processing system 130 includes mechanisms for learning documents. The document processing system 130 is described in greater detail below.
  • FIG. 2 is system diagram of the image capture system 110 according to a preferred embodiment of the invention.
  • System 110 has a scanning system 212 , a user interface system 222 , a data acquisition system 225 , a data transfer system 232 and an image pre-processing system 235 .
  • Source documents 210 in the form of papers are physically placed on an input tray of a commercial scanner.
  • Source documents in the form of data files are received over a network by the user interface system 222 .
  • the user interface system 222 communicates with the scanning system 212 via software within a computer system, or, optionally over a computer network.
  • the user interface system 222 may be part of the scanning system 212 in some embodiments of the image capture system 110 .
  • the user interface system 222 communicates with the data acquisition system 225 via software within a computer system.
  • the user interface system 222 communicates with the data transfer system 232 via software within a computer system.
  • the data acquisition system 225 communicates with the scanning system 212 via a physical connection, such as a high-speed Universal Serial Bus (USB) 2.0, or, optionally, over a network.
  • the data acquisition may also be part of the scanning system 212 in certain embodiments of the image capture system 110 .
  • the data acquisition system 225 communicates with the image pre-processing system 235 via software within a computer system.
  • the data transfer system 232 communicates with the image pre-processing system 235 via software within a computer system.
  • the data acquisition system and the data transfer system may also be part of the scanning system 212 in some embodiments of the image capture system 110 .
  • Element 210 is a source document in the form of either one or more physical sheets of paper, or a digital file containing images of one or more sheets of paper.
  • the digital file can be in one of many formats, such as PDF, TIFF, BMP, or JPEG.
  • System 212 is a scanning system.
  • conventional scanning systems may be used such as those from Bell+Howell, Canon, Fujitsu, Kodak, Panasonic and Xerox. These embodiments include scanners connected directly to a computer, shared scanners connected to a computer over a network, and smart scanners that include embedded computational functionality to add third-party applications.
  • the scanning system 212 captures an image of the scanned document as a computer file; the file is often in a standard format such as PDF, TIFF, BMP, or JPEG.
  • System 222 is a user interface system. Under preferred embodiments, the user interface system 222 runs in a browser and presents a user with a three-step means for submitting documents to be organized as shown in FIG. 9 .
  • the user interface system 222 provides a mechanism for selecting a job from a list of jobs; additionally, it allows jobs to be added to the job list.
  • the user interface system 222 provides a mechanism for initiating the scanning of physical papers; additionally, it provides a browsing mechanism for selecting a file on a computer or network.
  • one or more sets of papers can be scanned and one or more files can be selected.
  • the user interface system 222 provides a mechanism for sending the job information and selected documents over a network to the server system.
  • the user interface system 222 also presents a user with the status of jobs that have been submitted as submitted or completed; optionally, it presents the expected completion date and time of submitted jobs that have not been completed.
  • the user interface system 222 also presents a user with a mechanism for receiving submitted documents and extracted data.
  • the user interface system 222 also provides a mechanism for deleting files from the system.
  • Other embodiments of the user interface system 222 may run within an application that provides the scan feature as part of a broader function, or within a simple data entry system that is composed of only a touch screen and/or one or more buttons.
  • the user interface system 222 also may also be embodied by a programmable API that provides the same or similar functionality to another application program.
  • the data acquisition system 225 is a data acquisition system.
  • the data acquisition system 225 controls the settings of the scanning system.
  • Many scanning systems in use today require users to manually set scanner settings so that images are captured, for example, at 300 dots per inch (dpi) as binary data (black-and-white.)
  • Commercial scanners and scanning software modify the original source document image that often include high resolution and, possibly, color or gray-scale elements. The resolution is often reduced to limit file size. Color and gray-scale elements are often binarized, e.g. converted to black or white pixels, via a process known as thresholding, also to reduce file size.
  • the data acquisition system sets the scan parameters of the scanning system. The data acquisition system commands the scanning system to begin operation and receives the scanned document computer file from the scanning operation.
  • the data acquisition system 225 could be part of the scanning system 212 , in certain embodiments. Moreover, the operation of the data acquisition system 212 could be automatically triggered by the scan function, in certain embodiments.
  • System 232 is a data transfer system. Under preferred embodiments, the data transfer system 232 manages the SSL connection and associated data transfer with the server system. The data transfer system 232 could be part of the scanning system 212 , in certain embodiments. Moreover, the operation of the data transfer system 232 could be automatically triggered by the scan function, in certain embodiments.
  • the image pre-processing system 235 is an optional image pre-processing system.
  • the image pre-processing system 235 enhances the image quality of scanned images for a given resolution and other scanner settings.
  • the image pre-processing system 235 may be implemented as part of the image capture system as depicted on FIG. 2 or as part of the server system as depicted on FIG. 3 .
  • the image pre-processing system may also be implemented within the scanning system 212 , in certain embodiments. Details of the image pre-processing system 235 are described in further detail below as part of the document processing system 130 .
  • FIG. 3 is a system diagram of the web server system 120 according to a preferred embodiment of the invention.
  • System 120 has a web services system 310 , an authentication system 312 and a content repository 322 .
  • the web services system 310 communicates with the authentication system 312 via software within a computer system.
  • the web services system 310 communicates with the content repository 322 via software within a computer system.
  • System 310 is a web services system.
  • the web services system 310 provides the production system connection to the network that interfaces with the image capture system.
  • a network could be a local-area network (LAN), a wide-area network (WAN) or the Internet.
  • SSL Secure Sockets Layer
  • Standard web services include Apache, RedHat JBoss Web Server, Microsoft IIS, Sun Java System Web Server, IBM Websphere, etc.
  • users upload their source electronic documents or download their organized electronic documents and extracted data in a secure manner using HTTP or HTTPS. Other mechanisms for secure data transfer can also be used.
  • the web service system 310 also relays necessary parameters to the application servers that will process the electronic document.
  • System 312 is an authentication system.
  • the authentication system 312 allows secure and authorized access to the content repository 322 .
  • an LDAP authentication system is used; however, other authentication systems can also be used.
  • an LDAP server is used to process queries and updates to an LDAP information directory. For example, a company could store all of the following very efficiently in an LDAP directory:
  • document organization and access rights are managed by the access control privileges stored in the LDAP repository.
  • the System 322 is a content repository.
  • the content repository 322 can be a simple file system, a relational database, an object oriented database, any other persistent storage system or technology, or a combination of one or more of these.
  • the content repository 322 is based on Java Standard Requests 170 (JSR 170.)
  • JSR 170 is a standard implementation-independent way to access content bi-directionally on a granular level within a content repository.
  • the content repository 322 is a generic application “data store” that can be used for storing both text and binary data (images, word processor documents, PDFs, etc.)
  • data could be stored in a relational database (RDBMS) or a file system or as an XML document.
  • RDBMS relational database
  • most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more.
  • documents in the content repository 322 are available to the end user via a portal.
  • the user can click on a web browser application button “View Source Document” in the portal and view the original scanned document over a secure network.
  • the content repository 322 serves as an off-site secure storage facility for users' electronic documents.
  • FIG. 4 is a system diagram of the document processing system 130 according to a preferred embodiment of the invention.
  • System 130 has a service control manager 410 , a job database 414 , an image processing system 422 , a classification system 432 , a grouping system 442 and a data extraction system 452 .
  • the service control manager 410 communicates with the job database 414 via software within a computer system.
  • the service control manager 410 communicates with the image processing system 422 via software within a computer system.
  • the service control manager 410 communicates with the classification system 432 via software within a computer system.
  • the service control manager 410 communicates with the grouping system 442 via software within a computer system.
  • the service control manager 410 communicates with the data extraction system 452 via software within a computer system.
  • the image processing system 422 communicates with the job database 414 via software within a computer system.
  • the classification system 432 communicates with the job database 414 via software within a computer system.
  • the grouping system 442 communicates with the job database 414 via software within a computer system.
  • the data extraction system 452 communicates with the job database 414 via software within a computer system.
  • the image processing system 422 communicates with the classification system 432 via software within a computer system.
  • the classification system 432 communicates with the grouping system 442 via software within a computer system.
  • the grouping system 442 communicates with the data extraction system 452 via software within a computer system.
  • the document processing system 130 can be implemented as a set of communicating programs or as a single integrated program.
  • System 410 is a service control manager.
  • Service control manager 410 is a system that controls the state machine for each job.
  • the state machine identifies the different states and the steps that a job has to progress through in order to achieve its final objective, in this case being data extracted from an electronic document.
  • the service control manager 410 is designed to be highly scalable and distributed. Under preferred embodiments, the service control manager 410 is multi-threaded to handle hundreds or thousands of jobs at any given time.
  • the service control manager 410 also implements message queues to communicate with other processes regarding their own states. Alternately, the service control manager 410 can be implemented in other architectures; for example, one can implement a complete database driven approach to step through all the different steps required to process such a job.
  • the service control manager 410 subscribes to events for each new incoming job that need to be processed. Once a new job arrives, the service control manager 410 pre-processes the job by taking the electronic document and separating each image (or page) into its own bitmap image for further processing. For example, if an electronic document had 30 pages, the system will create 30 images for processing. Each job in the system is given a unique identity. Furthermore, each page is given a unique page identity that is linked to the job identity. After the service control manager 410 has created image files by pre-processing the document into individual pages, it transitions the state of each page to image processing.
  • Job database 414 is used to store the images and data associated with each of the jobs being processed.
  • a “job” is defined as a set of source documents and all intermediate and final processing outputs.
  • Job database 414 can be file system storage, a relational database, XML document or a combination of these.
  • job database 414 uses a file system storage to store large blob (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • System 422 is an image processing system.
  • the image processing system 422 removes noise from the page image and properly orients the page so that document image analysis can be performed more accurately. The accuracy of the data extraction greatly depends on the quality of the image; thus image processing is included under preferred embodiments.
  • the image processing system 422 performs connected component analysis and, utilizing a line detection system, creates “confetti” images which are small sections of the complete page image. Under preferred embodiments, the confetti images are accompanied by the coordinates of the image sub-section.
  • the image processing system 422 is described in greater detail below.
  • the System 432 is a classification system.
  • the classification system 432 recognizes the page as one of a pre-identified set of types of documents.
  • a major difficulty in categorizing a page as one of a large number of documents is the high dimensionality of the feature space.
  • Conventional approaches that depend on text categorization alone are faced with a native feature space that consists of many unique terms (words as well as phrases) that occur in documents, which can be hundreds or thousands of terms for even a moderate-sized collection of unique documents.
  • multiple systems that categorize income tax documents such as W-2, 1099-INT, K-1 and other forms have experienced poor accuracy because of the thousands of variations of tax documents.
  • the preferred implementation uses a combination of image pattern recognition and text analysis to distinguish documents and machine learning technology to scale to large numbers of documents.
  • the classification system 432 is described in greater detail below.
  • the System 442 is a grouping system.
  • the grouping system 442 groups pages that have been categorized by the classification system 432 as specific instances of a pre-identified set of types of documents into sets of multi-page documents.
  • the grouping system 442 is described in greater detail below.
  • the Data extraction system 452 is a data extraction system.
  • the data extraction system 452 extracts data from pages that have been categorized by the classification system 432 as specific instances of a pre-identified set of types of documents.
  • the document images are not of uniformly high quality.
  • the document images can be skewed, streaked, smudged, populated with artifacts and otherwise degraded in ways that cannot be fully compensated by image processing.
  • the document layout can appear to be random.
  • the relevant content (data labels and data values) can be quite small, impaired by lines and background shading or otherwise not be processed well by OCR.
  • the data extraction system 452 uses OCR data extraction, non-OCR visual recognition, contextual feature matching, business intelligence and output formatting, all with machine learning elements, to accurately extract and present data from a wide range of documents.
  • OCR data extraction non-OCR visual recognition
  • contextual feature matching business intelligence and output formatting
  • FIG. 5 is an image processing system 422 according to a preferred embodiment of the invention.
  • System 422 has an image feature extraction system 510 , a working image database 522 , an image identification system 530 , a trained image database 532 and an image training system 534 .
  • the image feature extraction system 510 is connected to the working image database 522 via software within a computer system.
  • the image feature extraction system 510 is connected to the image identification system 530 via software within a computer system.
  • the image identification system 530 is connected to the working image database 522 via software within a computer system.
  • the image identification system 530 is connected to the trained image database 532 via software within a computer system.
  • the image training system 534 is connected to the working image database 522 via software within a computer system.
  • the image training system 534 is connected to the trained image database 532 via software within a computer system.
  • Image feature extraction system 510 is an image feature extraction system.
  • Image feature extraction system 510 extracts images from the submitted job artifacts.
  • Image feature extraction system 510 normalizes images into a uniform consistent form for further image processing.
  • Image feature extraction system 510 binarizes color and grayscale images.
  • a document can be captured as a color, grayscale or binary image by a scanning device. Common problems seen in images from scanning devices include:
  • the preferred embodiment of the binarization system utilizes local thresholding where the threshold value varies based on the local content in the document image.
  • the preferred implementation is built on an adaptive thresholding technique which exploits local image contrast (reference: IEICE Electronics Express, Vol. 1, No 16, pp. 501-506.)
  • the adaptive nature of this technique is based on flexible weights that are computed based on local mean and standard deviations calculated for the gray values in the primary local zone or window.
  • the preferred embodiment experimentally determines optimum median filters across a large set of document images for each application space.
  • image feature extraction system 510 removes noise in the form of dots, specks and blobs from document images. In the preferred embodiment, minimum and maximum dot sizes to be removed are specified. The preferred embodiment also performs image reversal so that white text or line objects on black backgrounds are detected and inverted to black-on-white. The preferred embodiment also performs two noise removal techniques.
  • the first technique starts with any small region of a binary image.
  • the preferred implementation takes a 35 ⁇ 35 pixel region. In this region all background pixels are assigned value “0.” Pixels adjacent to background are given value “1.” A matrix is developed in this manner. In effect each pixel is given a value called the “distance transform” equal to its distance from the closest background pixel.
  • the preferred implementation runs a smoothing technique on this distance transform. Smoothing is a process by which data points are averaged with their neighbors in a series; this typically has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signals and enhancing low frequency signals. Of the many different methods of smoothing, the preferred implementation uses a Gaussian kernel.
  • the preferred implementation performs Gaussian smoothening with a filter using variance of 0.5 and a 3 ⁇ 3 kernel or convolution mask on the distance transform. Thresholding with a thresholding value of 0.85 is performed on the convolved images and the resulting data is converted to its binary space.
  • the second technique uses connected component analysis to identify small or bad blocks.
  • a sliding mask is created of a known size.
  • the preferred implementation uses a mask that is 35 ⁇ 35 pixels wide. This mask slides over the entire image and is used to detect the number of blobs (connected components) that are less than 10 pixels in size. If the number of blobs is greater than five, then all blobs are removed. This process is repeated by sliding the mask over the entire image.
  • Image feature extraction system 510 corrects skew, small angular rotations, in document images. Skew correction not only improves the visual appearance of the document but also improves baseline determination, simplifies interpretation of page layout and improves text recognition. Several available image processing libraries do skew correction. The preferred implementation of skew detection uses part of the open source Leptonica image processing library.
  • Image feature extraction system 510 corrects document orientation.
  • Documents originally in either portrait or landscape format may be rotated by 0, 90, 180 or 270 degrees during scanning
  • the preferred implementation of orientation correction performs OCR on small words or phrase images at all four orientations: 0, 90, 180 and 270 degrees.
  • Small samples are selected from a document and the confidence is averaged across the sample. The orientation that has the highest confidence determines the correct orientation of the document.
  • Image feature extraction system 510 performs connected component analysis using a very standard technique.
  • the preferred implementation of connected component analysis uses the open source Image Processing Library 98 (IPL98.)
  • Image feature extraction system 510 detects text lines using the technique described by Okun et al. (reference: Robust Text Detection from Binarized Document Images ) to identify candidate text segments blocks of consistent heights. For a page from a book, this method may identify a whole line as a block, while for a form with many boxes this method will identify the text in each box.
  • Image feature extraction system 510 generates confetti information by storing the coordinates of all of the text blocks in the working image database 522 .
  • Image feature extraction system 510 performs image processing on the confetti images. Traditionally, if image processing is performed on document images, the entire document image is subject to a single type of image processing. This “single algorithm” process might, for example, thin the characters on the document image. In some cases, the accuracy of text extraction with OCR might improve after thinning; however, in other cases on the same document, the accuracy of text extraction accuracy of text extraction with OCR might improve with thickening. Image feature extraction system 510 applies multiple morphological operators to individual confetti images. Then, for each variation of each confetti image (including the original, unprocessed versions and all processed versions,) image feature extraction system 510 extracts text with OCR. Optionally, image feature extraction system 510 extracts text with different OCR engines.
  • Image feature extraction system 510 determines the contour of image areas within confetti boxes.
  • the contour of an image within a confetti is illustrated in FIG. 14 .
  • the size of the confetti image area is first normalized.
  • 256 equidistant points on the contour are chosen, and the relative location of these points is recorded in a log-polar histogram as illustrated in FIG. 12 .
  • Values for log r are placed in 3 bins, while values for the angle are placed in 8 bins.
  • the relative location of a point with respect to another is therefore a number from 1 through 24.
  • the feature vector for the shape of the contour as illustrated in FIG. 14 is a 256 ⁇ 256 matrix of numbers from 1 through 24 that considering all the 256 points and their relative locations (reference: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No 24, pp. 509-422.)
  • Working image database 522 is a working image database.
  • Working image database 522 is used to support both the processing of jobs and the image training system 534 .
  • Working image database 522 can be a file system, a relational database, a XML document or a combination of these.
  • the working image database 522 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • System 530 is an image identification system.
  • the image identification system 530 looks for point and line features.
  • the preferred implementation performs image layout analysis using two image properties, the point of intersection of lines and edge points, of text paragraphs. Every unique representation of points is referred as a unique class in the system and represents a unique point pattern in the system database.
  • the preferred implementation uses a heuristically developed convolution method only on black pixels to perform a faster computation
  • the system identifies nine types of points: four T's, four L's, and one cross (X) using nine masks; examples of these nine point patterns are shown in FIG. 10 .
  • the preferred implementation of point pattern matching is performed by creating a string from the points detected in the image and then using the Levenshtein distance to measure the gap between the trained set with the input image.
  • the Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
  • the image identification system 530 selects the extracted text from the sets of extracted text for each confetti image according to rules stored in the trained image database 532 .
  • extracted text values that exceed specified OCR engine-specific thresholds are candidates for selection.
  • the best text value that is produced from the image after applying the morphological operators is chosen based on OCR confidence, similarity and presence in a dictionary.
  • the image identification system 530 selects the text value from a contextually limited lexicon (words and characters) that is stored in the trained image database 532 .
  • the image identification system 530 requests the image feature extraction system 510 to perform a “second pass” OCR operation using an engine specifically tailored for extracting the type of characters that the image identification system 530 identified as present in the confetti image.
  • the “second pass” OCR would be conducted with a currency character recognition system that is tuned to identify numerical and certain special characters.
  • the currency character recognition system utilizes OCR technology tailored to the reduced character set associated with currency values.
  • the currency character set is defined as the digits [0-9] and the special character set [$., ⁇ ( )].
  • the preferred implementation performs character segmentation to break up the image into individual characters. It then uses a normalized bitmap of the image of each character as a feature vector. This feature vector is passed into a neural network based classifier that was trained on more than 10,000 instances of each character that are stored in the trained image database 532 .
  • Label identification by traditional means of matching extracted text to a database of expected values is often not possible; this is caused by the inability of OCR engines to accurately extract text from very small and degraded images.
  • the present invention's use of both multiple versions of the confetti images (original and image processed) and multiple OCR engines significantly reduces but does not eliminate the problem of inaccurate text extraction. Two additional techniques are used to identify text from images.
  • the image identification system 530 performs contour matching by comparing the contour shape features extracted by the feature extraction system 510 , with the corresponding features of known confetti images stored in the trained image database 532 . Similarity between images is determined by a point-wise comparison of feature vectors.
  • the preferred implementation uses a KNN classifier for this process.
  • FIG. 14 illustrates label contour matching.
  • Trained image database 532 is used to support both the processing of jobs and the image training system 534 .
  • Trained image database 532 can be a file system, a relational database, a XML document or a combination of these.
  • the trained image database 532 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • the trained image database 532 grows “smarter” by recognizing more images and more rules pertaining to restricting OCR with contextual information.
  • the trained image database 532 grows.
  • the machine learning system sees more trained images, its image identification accuracy increases.
  • the Image Training System 534 is an image training system.
  • the image training system 534 performs computations on the data in its document database corresponding to the image that are in place and generates datasets used by the image identification system for recognizing the content in source document images.
  • the results of the training and re-training process are image datasets that are updated in the trained image database 532 .
  • the image training system 534 implements a continuous learning process in which images and text that are not properly identified are sent to training The training process results in an expanded data set in the trained image database 532 , thereby improving the accuracy of the system over time. As the trained image database 532 grows, the system requires an asymptotically lower percentage of images to be trained. Preferred implementations use machine learning supported by the image training system 534 that adapts to a growing set of documents images. Additional documents add additional image features that must be analyzed.
  • the learning system receives documents from the working image database 522 that were provided by the image identification system 530 . These documents are not trained and do not have corresponding model data in the trained image database 532 . All such documents are made persistent in the trained image database 532 .
  • Preferred implementations of the training system include tuning and optimization to handle noise generated during both the training phase and the testing phase.
  • the training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • the learning technique in the preferred implementation is supervised learning.
  • Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems.
  • Example input vectors include key words and line patterns of the document layouts.
  • Example target vectors include possible classes of output in the organized document.
  • Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
  • FIG. 6 is a classification system 432 according to a preferred embodiment of the invention.
  • System 432 has class feature extraction systems 610 , working class databases 622 , class identification systems 630 , trained class databases 632 , class training systems 634 , a voting system 640 , a trained voting decision tree 642 and a voting training system 644 .
  • the class feature extraction system (i) 610 is connected to the working class database (i) 622 via software within a computer system.
  • the class feature extraction system (i) 610 is connected to the class identification system (i) 630 via software within a computer system.
  • the class identification system (i) 630 is connected to the working class database (i) 622 via software within a computer system.
  • the class identification system (i) 630 is connected to the trained class database (i) 632 via software within a computer system.
  • the class training system (i) 634 is connected to the working class database (i) 622 via software within a computer system.
  • the class training system (i) 634 is connected to the trained class database (i) 632 via software within a computer system.
  • the class identification system (i) 630 is connected to the voting system 640 via software within a computer system.
  • the voting system 640 is connected to the trained voting decision tree 642 via software within a computer system.
  • the trained voting decision tree 642 is connected to the voting training system 644 via software within a computer system.
  • classification system 432 is composed of four classification subsystems whose outputs are evaluated by the voting system 640 .
  • the four classification subsystems are:
  • Each of the above subsystems has a class feature extraction systems 610 , a working class database 622 , a class identification system 630 , a trained class database 632 and a class training system 634 .
  • Each system 610 is a class feature extraction system.
  • Class feature extraction systems 610 receive extracted text and image features (discussed above.)
  • the CTI classification subsystem and the CRK classification subsystem use the extracted text features.
  • the SVM classification subsystem addresses the problem of classifying documents as OCR results improve; as document quality, scanning practices, image processing or OCR engines improve, the extracted source document text from differs from the extracted text of training documents, causing classification to worse.
  • the SVM class feature extraction systems 610 filters extracted text features, passing on only those text features that match a dictionary entry.
  • the SVM class feature extraction system 610 matches OCR text output of a text document against a large dictionary. If no dictionary match is found, the OCR text is discarded. A feature vector that consists of all OCR text that matches the dictionary is passed to an SVM-based classifier to determine the document class.
  • the SVM classification subsystem is made resilient to OCR errors by introducing typical OCR errors into the dictionary.
  • the classifier remains robust to OCR improvements because the dictionary includes correct English words.
  • the CCS classification subsystem addresses the problem of classifying documents with poor image quality that do not OCR well; such documents have poor text extraction and therefore poor text-based classification.
  • the CCS classification subsystem uses robust image features exclusively to classify documents.
  • the CCS class feature extraction system 610 first creates a code book using seven randomly selected documents. Each of these documents is divided into 10 ⁇ pixel blocks. The K-means algorithm is applied to each block to generate 150 clusters. The mean of these clusters is taken as the representative codeword for that cluster. The clusters are arbitrarily numbered from 1 to 150; the result forms a vocabulary for representing source document images as a feature vector of this vocabulary.
  • Each source document image is divided into four quadrants.
  • a vector is formed for each quadrant following the term frequency inverse document frequency (TF-IDF) model.
  • TF-IDF frequency inverse document frequency
  • K-means approach is used.
  • a test document is encoded to the feature vector form, and its Euclidean distance is computed from each of the clusters. The labels of the closest clusters are assigned to the document.
  • Each system 622 is a working class database.
  • Working class databases 622 are used to support both the processing of jobs and the class training systems 634 .
  • Working class databases 622 can be file systems, relational databases, XML documents or a combination of these.
  • the working class databases 622 use file systems to store large blobs (binary large objects) and relational databases to store pointers to the blobs and other information pertinent to processing the job.
  • Class Identification System 630 is a class identification system. Class identification system 630 functions differently for each of the four classification subsystems.
  • the class identification system 630 presents the extracted text to a key word identification system.
  • the key word identification system receives the confetti text and interfaces with the trained class database 632 .
  • the trained class database 632 consists of a global dictionary, global priority words and the point pattern signatures of all the trained forms, all of which are created by the class training system 634 .
  • stop words are from the list of extracted. Stop words are common words—for example: “a,” “the,” “it,” “not,” and, in the case of income tax documents, for example, phrases and words including “Internal Revenue Service,” “OMB,” “name,” “address,” etc.
  • the stop words are provided by the trained class database 632 and, in the preferred embodiment, are domain specific.
  • the priority of each word is calculated as function of line height (LnHt) of the word, partial of full match (PFM) with form name and total number of words in the form (N).
  • PFM partial of full match
  • N total number of words in the form
  • Partial or full match increases the priority if the word partially or fully matches the form name.
  • the calculation divides by the total number of words in the form (N) to normalize the frequency if the form has a large numbers of words.
  • the vector space creation system stores in a table the priority of each word in the form.
  • a vector is described as (a 1 , a 2 , . . . ak) where a 1 , a 2 . . . ak are the magnitude in the respective dimensions.
  • word-priority vectors are stored:
  • OMB 10 employer 5 employer 5 wages 5 compensation 5 compensation 5 dependent 5 wages 10 social 5 security 5 income 5 tax 5 federal 5 name 5 address 5
  • the normalized valued for the priorities are:
  • the ranking system calculates the cosine distance of two vectors V 1 and V 2 as:
  • V 1 ⁇ V 2 is the dot product of two vectors and
  • represents the magnitude of the vector.
  • the class which has the maximum cosine distance with the form is the class to which the form is classified.
  • the class identification system 630 performs point pattern matching based on the image features collected during image processing. As mentioned earlier, the point pattern matching of documents is performed by creating a string from the points detected in the image and then using Levenshtein distance to measure the gap between the trained set with the input image.
  • the results of the ranking and the point pattern matching are used to determine the class matching values. If the system is not successful in finding a class match within a defined threshold, the document is marked as unclassified.
  • the class identification system 630 first identifies a source document as a member of a particular group of classes then identifies the source document as a member of a particular individual class.
  • the CRK class identification system 630 performs hierarchical classification with a binary classifier system using regularized least squares and a multi-class classifier using K-nearest neighbor.
  • An example flow diagram of an example CRK class identification system 630 used in classifying income tax documents is shown in FIG. 15 .
  • the class identification system 630 identifies a source document using a support vector machine operating on a set of trained data If the lookup fails, the source document is marked as unclassified.
  • the class identification system 630 works much like the CTI class identification system 630 .
  • the CCS class identification system 630 compares the code vectors for each quadrant of source documents with code vectors in the trained class database 632 using the K-means approach.
  • the trained class database 632 is organized into clusters representing documents in the training set with similar image properties as defined by the feature vectors.
  • the mean point of each cluster within the feature vector space is used to represent each cluster.
  • each cluster is tagged with all document classes that occurred within the cluster.
  • the distance of the feature vector of a source document from the mean of each cluster is computed, and the K nearest clusters are considered.
  • the document class tags of these clusters are chosen as plausible classes of the source document.
  • the CCS trained class database 632 stores code vectors of all the trained forms, all of which are created by the CCS class training system 634 .
  • System 632 is a trained class database.
  • Trained class database 632 is used to support both the processing of jobs and the class training system 634 .
  • Trained class database 632 can be a file system, a relational database, a XML document or a combination of these.
  • the trained class database 632 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • the trained class database 632 grows “smarter” by recognizing more documents. As the machine learning system sees more classification data, its classification accuracy increases.
  • System 634 is a class training system.
  • the class training system 634 adapts to a growing set of documents; additional documents add additional features that must be analyzed.
  • Preferred implementations of the class training system 634 include tuning and optimization to handle noise generated during both the training phase and the testing phase.
  • the training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • the learning technique that is used to bootstrap the system in the preferred implementation is supervised learning.
  • Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems.
  • Example input vectors include key words and line patterns of the document layouts.
  • Example target vectors include possible classes of output in the organized document.
  • Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
  • semi-supervised learning is utilized.
  • data that is flowing through the system is analyzed and those data that the system failed to correctly identify are isolated. These data are passed through a retraining phase, and the training data in the system are updated after appropriate regression testing.
  • the learning system receives documents from the trained class database 632 . These documents are not trained and do not have corresponding classification model data in the class database. All such documents are made persistent in the trained class database 632 .
  • the trained class database 632 has several tables which contain the document class information as well as image processing information (which is discussed in greater detail below.) The following tables are part of training database:
  • Class training system 634 utilizes a training process management system that manages the distribution of the training task.
  • a user called a “trainer,” logs into the system in which the trainer has privileges at one of three trainer levels:
  • the training process manager directs document processing based on the document state:
  • the form class state is changed to trained, not synched if allowed by policy.
  • the document class has the following states:
  • the class training system 634 combines the document image, the manually classified information and the corresponding text.
  • New trained data that passes regression testing is inserted by the class training system 634 into the trained class database 632 .
  • Ch-square feature selection attempts to select the most relevant keywords (bag-of-words) for each class
  • This approach ranks the relevance of each word for a particular class so that a sufficient number of features are obtained.
  • Term frequency inverse document frequency is used to represent each document:
  • ⁇ k ⁇ nk number ⁇ ⁇ of ⁇ ⁇ occurrences ⁇ ⁇ of ⁇ ⁇ all ⁇ ⁇ terms ⁇ ⁇ in ⁇ ⁇ the ⁇ ⁇ document
  • Each vector is normalized into unit Euclidean norm.
  • the System 640 is a voting system.
  • the voting system 640 uses the output of each of the classifier subsystems 630 to choose the best classification result for an image, based on empirical observations of each classifier subsystem behavior on a large training dataset. These empirical observations are encoded into a trained voting decision tree 642 .
  • the voting system 640 uses the trained voting decision tree 642 to choose the final classification of an image.
  • the trained decision tree 642 is built using the voting training system 644 .
  • System 642 is a trained voting decision tree.
  • the trained voting decision tree 642 is used to support the voting system 640 .
  • Trained voting decision tree 642 can be encoded as part of a program, file, relational database, XML document or a combination of these.
  • the trained voting decision tree 642 is encoded as a program within a decision making process. As the system grows “smarter” by recognizing more images, the trained voting decision tree 642 evolves, resulting in a system with increasing image identification accuracy.
  • the System 644 is a voting training system.
  • the voting training system 640 considers the real classifications of a training dataset and the respective outputs of each of the classifier subsystems 630 . Using this data, the voting training system 640 builds a decision tree, giving appropriate weights and preference to the correct results of each of the classification subsystems 630 . This approach results in maximized correctness of final classification, especially when each classification subsystem 630 is adept at classifying different, not necessarily disjoint, subsets of documents.
  • FIG. 7 is a grouping system 442 according to a preferred embodiment of the invention.
  • System 442 has a group feature extraction system 710 , a working group database 722 , a group identification system 730 , a trained group database 732 and a group training system 734 .
  • the group feature extraction system 710 is connected to the working group database 722 via software within a computer system.
  • the group feature extraction system 710 is connected to the group identification system 730 via software within a computer system.
  • the group identification system 730 is connected to the working group database 722 via software within a computer system.
  • the group identification system 730 is connected to the trained group database 732 via software within a computer system.
  • the group training system 734 is connected to the working group database 722 via software within a computer system.
  • the group training system 734 is connected to the trained group database 732 via software within a computer system.
  • System 710 is a group feature extraction system.
  • Group feature extraction system 710 receives document information including the class identifier and text data for each page.
  • System 710 identifies data features that potentially indicate that a page belongs to a document set.
  • the preferred implementation identifies page numbers and account numbers.
  • Working group database 722 is used to support both the processing of jobs and the group training system 734 .
  • Working group database 722 can be a file system, a relational database, a XML document or a combination of these.
  • the working group database 722 uses a relational database to store pointers to the information pertinent to processing the job.
  • System 730 is a group identification system.
  • Group identification system 730 utilizes the class identifier, the page numbers and the account numbers extracted by system 710 to group pages of a job that belong together.
  • the preferred implementation uses an iterative grouping process that begins by assuming that all pages belong to independent groups. At each iteration step, the process attempts to merge existing groups using a merging confidence. The process terminates when group membership converges and there is no further change to the set of groups.
  • the group identification system 730 uses a merging confidence that is determined from matching and mismatching criteria that is stored in the trained group database 732 . Matching criteria between two groups contribute towards an increased confidence to merge the groups, while mismatching criteria contribute towards keeping the groups separate. The final merging confidence is used to decide whether to merge the two groups. This process is repeated for every pair of groups, in each iteration step of the process.
  • Trained group database 732 is used to support both the processing of jobs and the group training system 734 .
  • Trained group database 732 can be a file system, a relational database, a XML document or a combination of these.
  • the trained group database 732 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • the trained group database 732 grows “smarter” by recognizing more document group data, the trained group database 732 grows. As the machine learning system sees more data, its group identification accuracy increases.
  • System 734 is a group training system.
  • the group training system 734 extracts matching criteria from a large set of correctly grouped documents and adapts to a growing set of document data.
  • Preferred implementations of the group training system 734 include tuning and optimization to handle noise generated during both the training phase and the testing phase.
  • the training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • FIG. 8 is a data extraction system 452 according to a preferred embodiment of the invention.
  • System 452 has a data feature extraction system 810 , a working data database 822 , a data identification system 830 , a trained data database 832 and a data training system 834 .
  • the data feature extraction system 810 is connected to the working data database 822 via software within a computer system.
  • the data feature extraction system 810 is connected to the data identification system 830 via software within a computer system.
  • the data identification system 830 is connected to the working data database 822 via software within a computer system.
  • the data identification system 830 is connected to the trained data database 832 via software within a computer system.
  • the data training system 834 is connected to the working data database 822 via software within a computer system.
  • the data training system 834 is connected to the trained data database 832 via software within a computer system.
  • the System 810 is a data feature extraction system.
  • the data feature extraction system 810 constructs an Image Form Model, which is a working representation of the layout of the confetti and text in the document image.
  • the data feature extraction system 810 identifies layout features that potentially carry data.
  • the preferred implementation identifies boxes (illustrated in FIG. 19 ), check boxes (illustrated in FIG. 20 ), text, lines and tables.
  • the Image Form Model also contains references to the image features like lines and points that have been identified earlier.
  • the data feature extraction system 810 identifies canonical labels that occur in an image by searching through the extracted text data for corresponding expected labels.
  • data feature extraction system 810 utilizes inexact string matching algorithms that use Levenshtein distance to identify expected labels.
  • An iterative technique that uses increasingly inexact string comparison on an increasingly narrower search space is utilized. If certain canonical labels are still not found because of severe OCR errors, image identification system 530 is used to find canonical labels using contour matching. The success of this technique is enhanced by the narrowed search for the corresponding missing expected labels.
  • the data feature extraction system 810 identifies data-containing features including boxes, real and virtual, check boxes, label-value pairs, and tables.
  • the data feature extraction system 810 also identifies formatted data that are often not associated with a label, e.g. address blocks (illustrated in FIG. 21 ), phone numbers and account numbers.
  • the data feature extraction system 810 also identifies regions of text that are not associated with any data, such as disclaimers and other text blocks that contain instructions for the reader rather than extractable data (referred to as instruction blocks and illustrated in FIG. 22 ).
  • Working data database 822 is used to support both the processing of jobs and the data training system 834 .
  • Working data database 822 can be a file system, a relational database, a XML document or a combination of these.
  • the working data database 822 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • the working data database 822 consists of a flexible data structure that stores all of the features that the data feature extraction system 810 identifies along with the spatial relationships between them.
  • the most primitive element of the data-structure is a Feature data-structure, which is a recursive data-structure that contains a set of Features.
  • a Feature also maintains references to nearby Features; in the preferred implementation, four sets that correspond to references to Features above, below, to the left, and to the right of the Feature.
  • a Feature provides iterators to traverse the five sets associated with it.
  • a Feature also provides the ability to tag on a confidence metric. In the preferred implementation, the confidence is an integer in the range [0-100]. It is assigned by the algorithms that create the Feature, and is used as an estimate of the accuracy of the extracted data.
  • the primitive Feature data-structure is sub-classed into specific features. At the lowest level are the primitive features confetti, word, line, point. At the next level are label and value. Finally, there are features corresponding to each of the data-containing features, box, check-box, label-value pair, and table. There also are features corresponding to the elements of certain composite features like table headers, table rows, and table columns. There are also features corresponding to form-specific items such as address blocks, phone numbers, and instruction blocks.
  • the Feature data-structure supports operations to merge a set of features into another. For example, a label feature and a value feature that correspond to each other are merged into a Label-value pair feature. A set of value features that have been identified as a row of a table are merged into a row feature. A set of labels that have been identified as a table header are merged into a table header feature. In each of these cases, the set of features that were merged into the result are all contained within. They are accessed by enumerating the contained features. As with any feature, the respective algorithm can assign a confidence to the merged feature.
  • System 830 is a data identification system.
  • Data identification system 830 utilizes the Image Form Model created by system 810 to search for correlations between labels and values.
  • the preferred implementation uses the classification of a particular page to determine the expected labels.
  • the expected label set is a subset of the universe of labels, which is available in the trained data database 832 .
  • System 830 uses the expected label set to search for data in the image form model for the image.
  • the layout features that have been identified in System 810 are used to aid the process of correlating labels with data.
  • the data identification system 830 performs relative location matching by comparing the locations of the identified confetti images with locations of unidentified confetti images, both stored in the working data database 822 .
  • FIG. 17 illustrates relative matching of labels.
  • Data identification system 830 includes the ability to handle errors and noise. In some situations, poor image quality results in certain expected labels to be missing. Data identification system 830 uses relative location matching by comparing the relative location of identified labels and unidentified text in the image form model, with learned data in the trained data database 832 .
  • Some images include multiple copies of form data. For example, in the image of a Form W-2 shown in FIG. 24 , the data to be extracted is repeated four times.
  • FIG. 50 illustrates four “Wages, tips, other comp.” boxes that appear on a W-2 form;
  • FIG. 51 show the corresponding data record.
  • the data identification system 830 improves the accuracy of data extraction by utilizing each copy of data on an image with the following process for extracting data from multi-copy forms:
  • data identification system 830 organizes the data extracted from such multi-form images into a set of m records as indicated by the layout. Accuracy of extracted data is improved by using a voting strategy to determine which of the m extracted data to select. In addition, if all extracted data instances are identical, then the extracted data is considered to be correct with high confidence. Conversely, if extracted data instances are different, then the extracted data is flagged.
  • the data identification system 830 extracts data from tables (illustrated in FIG. 23 ) using a layout based strategy.
  • the strategy addresses the following problems with extracting data from tables.
  • the data identification system 830 handles wrapped columns as a special case.
  • step 8 above if tables break repeatedly at a row count of one, then the rows are partitioned into two sets, the odds and evens. Now steps 7 through 11 operate on each of the two sets to get two interleaved tables. These two interleaved tables are merged to form the extracted table.
  • Trained data database 832 is used to support both the processing of jobs and the data training system 834 .
  • Trained data database 832 can be a file system, a relational database, a XML document or a combination of these.
  • the trained data database 832 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • the trained data database 832 grows “smarter” by recognizing more document data, the trained data database 832 grows. As the machine learning system sees more data, its data identification accuracy increases.
  • the trained data database 832 contains information that is used to extract data.
  • the trained data database 832 includes:
  • the System 834 is a data training system.
  • the data training system 834 adapts to a growing set of document data; additional document data add additional features that must be analyzed.
  • Preferred implementations of the data training system 834 include tuning and optimization to handle noise generated during both the training phase and the testing phase.
  • the training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • the invention extracts data from an image via a process of progressive refinement and reduced character set OCR (as illustrated in FIG. 33 ) in order to overcome the imperfections of OCR or low quality documents.
  • the scanned image is processed by generic OCR which, in this example, produces errors in both the label portion and the value portion of the box.
  • the OCR output for the label portion is correctly identified as “Medicare Tax Withheld”.
  • the value related to the identified label is known to be a monetary amount, so the part of the image that corresponds to the value is reprocessed by a restricted-character-set OCR.
  • This OCR process is trained to identify only the characters possible in a monetary amount, i.e. the digits [0-9], and certain special characters [$,. ( ) ⁇ ].
  • the reduced search space greatly increases the accuracy of the restricted-character-set OCR output, and it produces the correct value of 131.52.
  • the invention extracts data from an image via a process of progressive refinement that utilizes a reduced search space as more is learned about the form being extracted (as illustrated in FIG. 34 ).
  • poor OCR is used to identify the correct label.
  • the OCR output is used to identify the class of the form because classification process is very robust to poor OCR.
  • the label search is constrained to only the labels that are expected in W-2 forms. This greatly reduces the search space, and therefore increases the accuracy of extraction.
  • constraints are added to reduce the search space. This reduction in search space permits prior processes to be rerun, significantly improving the overall extraction accuracy.
  • the invention extracts data from an image via a process of progressive refinement that utilizes data external to the form image being extracted (as illustrated in FIG. 35 ).
  • data that was extracted from the 1099-OID form is used to extract data from the 1099-G form.
  • the Recipient's identifier number of the 1099-G form is light and washed out, and results in poor OCR output.
  • the two forms are in the same job, and they both have the same Recipient's name (John Smith).
  • the Recipient's identification number on the 1099-G form can be inferred to be 432-10-9876, the same as the Recipient's identification number on the 1099-OID form.
  • the invention extracts data from an image via a process of progressive refinement that utilizes data not extracted from any image (as illustrated in FIG. 36 ).
  • data that is available in a “pro-forma” file is used to identify data on a form.
  • the pro-forma file contains taxpayer information from the previous year's tax return that has been quality checked, including the taxpayer name, taxpayer Social Security Number, spouse name, spouse Social Security Number, dependent names and Social Security Numbers, and other information about the tax forms included in the previous year's tax return. All this information is available to the data extraction process, and is assumed to be accurate.
  • the pro-forma external data enables the verification and correction of low-confidence OCR-extracted data.
  • the invention utilizes a set of known-value databases to augment the results of conventional data extraction methods such as OCR.
  • the know-value databases are obtained from vendors or public sources; the known-value databases are also built from data extracted from forms that have been submitted by users of the data extraction system.
  • Known-value databases for example, contain information on employers, banks and financial institutions and their corresponding addresses and identification numbers.
  • FIG. 37 shows a 1099-G form in which the payer's name is struck out, making it difficult to OCR correctly. As can be seen in FIG. 38 , the payer's name has not been extracted because of the missing label.
  • a known-value database of the issuers of 1099-G forms (which are the revenue departments of the 50 states) provides the payer's name by a simple lookup. This finding is verified by comparing the lookup results against the relevant OCR output.
  • the invention utilizes known constraints between the semantics of extracted data elements to identify potentially incorrectly extracted data.
  • the constraints are specified by subject matter experts (for example, bankers in the case of loan origination forms); the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system.
  • FIG. 39 is an image of a W-2 form with a faded digit in the value for box 1 “Wages, tips and other compensation.”
  • the extracted value corresponding to the “Wages, tips and other compensation” label is 060.83 (versus the correct value of 9060.83.)
  • the extracted value is flagged as incorrect when comparing it to the extracted value for Federal income tax withheld (106.11).
  • the constraints for a W-2 form specify that Federal income tax withholdings cannot exceed total wages.
  • the invention utilizes known constraints between the semantics of extracted data elements to correct potentially incorrectly extracted data.
  • the constraints are specified by subject matter experts (for example, Certified Public Accountants in the case of income tax forms); the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system. In the above example illustrated in FIG. 39 and FIG.
  • the constraints for a W-2 form specify that, for wages below a threshold amount, in most cases “Wages, tips and other compensation” is equal to “Social security wages” and “Medicare wages and tips.”
  • the constraints indicate that when “Wages, tips and other compensation” is flagged as incorrect and differs by a single digit from “Social security wages,” then the value from “Social security wages” replaces the value of “Wages, tips and other compensation.”
  • the invention utilizes known constraints in the layout of data elements, to narrow the search space and thereby more accurately extract data.
  • the layout constraints are specified by technical experts; the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system.
  • FIG. 41 illustrates the relationship of layout elements in a portion of a W-2 form.
  • the label “Social security wages” is to the left of the label “Social security tax withheld.”
  • This layout relationship and others, specified by experts or determined by analysis, are used to infer missing labels and also identify spurious data such as pencil marks, tick marks and other noise.
  • the invention predicts occurrences of instruction blocks based on detected layout patterns from forms that have been submitted by users of the data extraction system.
  • the invention eliminates such instruction blocks from further data extraction, thus simplifying the extraction process and thereby improving the accuracy of data extraction.
  • the invention detects tables using column layout and the expected header layout based on detected layout patterns from forms that have been submitted by users of the data extraction system.
  • Known constraints in the form of relationships between header elements, are used to predict headers when not correctly detected.
  • the layout of multiple occurrences of a particular extracted artifact e.g. four occurrences of each expected data element in a W-2, is used to identify the four logical records in the W-2.
  • the mechanism that was used to identify a particular data artifact, e.g. label identified by correct OCR text vs predicted label, is used to attach a confidence to the extracted data.
  • the invention utilizes layout data structure to extract data from form images.
  • the use of a layout data structure is illustrated in the context of a portion of a W-2 form image shown in FIG. 41 .
  • the low-level layout graph of confetti is created; its internal representation is partially illustrated in FIG. 42 . While the left, right, top, and bottom connection sets exactly map the layout, for brevity, only the right and down sets for each confetti is shown in FIG. 42 .
  • the layout graph is modified by identifying the detected labels (shown as light grey blocks).
  • the label-value correlations are determined (shown by the dark grey blocks). Note that the illustration shows the right set of each of the features shown.
  • layout relations of the contained features do not cross out of the container; this aspect of the data structure significantly improves the efficiency of the data structure. Also shown are the down sets of each feature. The contained features can be seen to maintain layout relations within the container, leaving it to the container to maintain external layout relations.
  • the invention extracts data from an image via a process of progressive refinement that utilizes contours matching (as described above). While contour matching on its own is of limited value over a large universe of labels, coupled with the progressive refinement technique, contour matching is robust. As an example, the labels from the 1099-OID form of FIG. 35 are shown in FIG. 44 . Since there is significant similarity between the contours for “PAYER's federal identification number” and “RECIPIENT's federal identification number,” it is inappropriate to differentiate these two labels using their contours. However, differentiating “RECIPIENT's name” from “PAYER′S name, street address, city, state, ZIP code and telephone no” is appropriate. Accordingly, contour matching is used in those cases in which the set of options is small.
  • the invention utilizes contour matching along with text-based label matching as part of the progressive refinement process.
  • the search space for labels is restricted to labels that occur in a 1099-OID.
  • all the labels except “RECIPIENT's name” and “Original Issue discount for 2009” were identified by text-based matching. Contour matching is then used to distinguish between these two labels.
  • FIG. 13 is a system diagram of the service control manager 410 .
  • System 410 has a main thread 1301 , task queues 1302 , database client thread controllers 1303 , task queues 1304 , slave controllers 1305 and SCM queue 1306 .
  • the main thread 1301 controls the primary state machine for all the jobs in the system.
  • Task queues 1302 provide message queues for database communication.
  • Database client thread controllers 1303 manage the database server interface.
  • Task queues 1304 provide message queues for communication with slave controllers.
  • Slave controllers 1305 manage various slave processes via the slave controller interface.
  • the SCM queue 1306 provides a mechanism for the various controllers to communicate with the main thread.
  • various threads communicate with each other using message queues. Whenever a new document is received for processing, the main thread is notified and it requests the database client thread to retrieve the job for processing based on the states and the queue of other jobs in the system.
  • a finite state machine for that job is created and the job starts to be processed.
  • the main thread puts the job on a particular task queue based on the state machine instructions. For example, if the job needs to be image processed, then the job will be placed on the image processing task queue. If the slave controller for the image processing slave finds an idle image processing slave process, then the job is picked up from that queue and given to the slave process for processing. Once the slave finishes performing its assigned task, it returns the job to the slave controller which puts the job back on the SCM queue 1306 . The main thread sequentially picks up the job from the SCM queue 1306 and decides on the next state of the job based on the finite state machine states. Once a job is completed, the finite state machine for the job is closed and the extracted document is returned to the content repository 322 and made available to the client's portal as a finished and processed document.
  • FIG. 18 is a diagram that depicts the various components of a computerized document data extraction system, according to certain embodiments of the invention.
  • An exemplary document data extraction system may include a host computer 1801 that contains volatile memory, 1802 , a persistent storage device such as a hard drive, 1808 , a processor, 1803 , and a network interface, 1804 . Using the network interface, the system computer can interact with databases, 1805 , 1806 .
  • FIG. 18 illustrates a system in which the system computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface.
  • the programmatic processes may be executed on a single host, as shown in FIG. 18 , or they may be distributed across multiple hosts.
  • the host computer shown in FIG. 18 may serve as a document data analysis system.
  • the host computer receives electronic documents from multiple users.
  • Workstations may be connected to a graphical display device, 1807 , and to input devices such as a mouse, 1809 , and a keyboard, 1810 .
  • the active user's workstation may comprise a handheld device.
  • the flow charts included in this application describe the logical steps that are embodied as computer executable instructions that could be stored in computer readable medium, such as various memories and disks, that, when executed by a processor, such as a server or server cluster, cause the processor to perform the logical steps.
  • While text extraction and recognition may be performed with OCR and OCR-like techniques it is not limited to such. Other techniques could be used, including image recognition-like techniques.
  • image features include inherent image features, e.g. lines, line crossings, etc. that are put in place by the document authors (or authors of an original source or blank document) to organize the document or the like. They were typically not included as a means of identifying the document, even though the inventors have discovered that they can be used as such, especially with the use of machine learning techniques.
  • Preferred embodiments of the invention may incorporate classification techniques described in the following patent applications, each of which is hereby incorporated by reference herein in its entirety:

Abstract

A method of training a document analysis system to extract data from documents is provided. The method includes: automatically analyzing images and text features extracted from a document to associate the document with a corresponding document category; comparing the extracted text features with a set of text features associated with corresponding category of the document, in which the set of text features includes a set of characters, words, and phrases; if the extracted features are found to consist of the characters, words, and phrases belonging to the set of text features associated with the corresponding document category, storing the extracted text features as the data contained in the corresponding document; and, if the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding document category, submitting the unrecognized text features to a training phase.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Patent Application No. 61/295,210, filed Jan. 15, 2010, which is hereby incorporated by reference herein in its entirety.
  • This application is also related to the following applications filed concurrently herewith on Jan. 14, 2011:
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically extracting data from electronic documents containing multiple layout features;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically extracting data from electronic documents using external data;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically correcting data extracted from electronic documents using known constraints for semantics of extracted data elements;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically processing electronic documents using multiple image transformation algorithms;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically extracting data from electronic documents using multiple character recognition engines;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically extracting data by narrowing data search scope using contour matching;”
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically extracting data from electronic document page including multiple copies of a form;” and
  • U.S. patent application Ser. No. ______, entitled “Systems and methods for automatically grouping electronic document pages.”
  • FIELD OF THE INVENTION
  • This invention relates generally to systems and methods to extract data from electronic documents, and more particularly to systems and methods for training document analysis system for automatically extracting data from documents.
  • BACKGROUND
  • Millions of documents are produced every day that are reviewed, processed, stored, audited and transformed into computer-readable data. Examples include accounts payable, collections, educational forms, financial statements, government documents, human resource records, insurance claims, legal papers, medical records, mortgages, nonprofit reports, payroll records, shipping documents and tax forms.
  • These documents generally require data to be extracted in order to be processed. Data extraction can be primarily clerical in nature, such as in inputting information on customer survey forms. Data extraction can also be an essential portion of larger technical tasks, such as preparing income tax returns, processing healthcare records or handling insurance claims.
  • Various techniques, such as Electronic Data Interchange (EDI,) attempt to eliminate human processing efforts by coding and transmitting the document information in strictly formatted messages. Electronic Data Interchange is known for custom computer systems, cumbersome software and bloated standards that defeated its rapid spread throughout the supply chain. Perceived as too expensive, the vast majority of businesses have avoided implementing EDI. Similarly, applications of XML, XBRL and other computer-readable document files are quite limited compared to the use of documents in paper and digital image formats (such as PDF and TIFF.)
  • Ideally, these documents would be capable of being both read by people and automatically processed by computers. Since paper and digital image files comprise an overwhelming percentage of all documents, it would be most practical to train computers to extract data from human-readable documents.
  • To date, there have been three general methods of performing data extraction on documents: conventional, outsourcing and automation.
  • Conventional data extraction, the first method, requires workers with specific education, domain expertise, particular training, software knowledge and/or cultural understanding. Data extraction workers must recognize documents, identify and extract relevant information on the documents and enter the data appropriately and accurately in particular software programs. Such manual data extraction is complex, time-consuming and error-prone. As a result, the cost of data extraction is often quite high; numerous studies estimate the cost of processing invoices in excess of ten dollars each. The cost is especially high when the data extraction is performed by accountants, lawyers, physicians and other highly paid professionals as part of their work. For example, professional tax preparers report spending hours on each client tax return transcribing salary, interest, dividend and capital gains data; they also admit to human data extraction/entry accuracies of less than 90%.
  • Conventional data extraction also exposes all documents in their entirety to data extraction workers. These documents may have sensitive information related to individuals' and organizations' education, employment, family, financial, health, insurance, legal, tax, and/or other matters.
  • Whereas conventional data extraction is entirely paper-based, outsourcing and automation begin by converting paper to digital image files. This step is straightforward, aided by high quality, fast, affordable scanners that are available from many vendors including Bell+Howell, Canon, Epson, Fujitsu, Kodak, Panasonic and Xerox.
  • Once paper documents are converted to digital image files, document processing can be made more productive through the use of workflow software that routes the documents to the lowest-cost labor available, either in-house or outsourced, on-shore or overseas. Primary processing can be done by junior personnel; exceptions can be handled by more highly trained people. Despite the potential productivity gains that are enabled with workflow software in the form improved labor utilization, manual document processing remains a fundamentally expensive process.
  • Outsourcing, the second method of data extraction, requires the same worker education, expertise, training, software knowledge and/or cultural understanding. As with conventional data extraction, outsourced data extraction workers must recognize documents, find relevant information on the documents, extract and enter the data appropriately and accurately in particular software programs. Since outsourcing is manual, just as is conventional data extraction, it is also complex, time-consuming and error-prone. Outsourcing firms such as Accenture, Datamatics, Hewlett Packard, IBM, Infosys, Tata, and Wipro, often reduce costs by offshoring data extraction work to locations with low wage data extraction workers. For example, extraction of data from US tax and financial documents is a function that has been implemented using thousands of well-educated, English-speaking workers in India and other low wage countries.
  • The first step of outsourcing requires organizations to scan financial, health, tax and/or other documents and save the resulting image files. These image files can be accessed by data extraction workers via several methods. One method stores the image files on the source organizations' computer systems; the data extraction workers view the image files over networks (such as the Internet or private networks.) Another method stores the image files on third-party computers systems; the data extraction workers view the image files over networks. An alternative method transmits the image files from source organizations over networks and stores the image files for viewing by the data extraction workers on the data extraction organizations' computer system.
  • For example, an accountant may scan the various tax forms containing client financial data and transmit the scanned image files to an outsourcing firm. An employee of the outsourcing firm extracts the client financial data and enters it into an income tax software program. The resulting tax software data file is then transmitted back to the accountant.
  • Quality problems with offshore data extraction work have been reported by many customers. Outsourced service providers address these problems by hiring better educated and/or more experienced workers, providing them more extensive training, extracting and entering data two or more times and/or exhaustively checking their work for quality errors. These measures reduce the cost savings expected from offshore outsourcing.
  • Outsourcing and offshoring are accompanied with concerns over security risks associated with fraud and identity theft. These security concerns apply to employees and temporary workers as well as outsourced workers and offshore workers who have access to documents with sensitive information.
  • Although the transmission of scanned image files to the data extraction organization may be secured by cryptographic techniques, the sensitive data and personal identifying information are in the clear, i.e., unencrypted, when read by data extraction workers prior to entry in the appropriate computer systems. Data extraction organizations publicly recognize the need for information security. Some data extraction organizations claim to investigate and perform background checks of employees. Many data extraction organizations claim to strictly limit physical access to the rooms in which the employees enter the data; further, such rooms may be isolated. Paper, writing materials, cameras or other recording technology may be forbidden in the rooms. Additionally, employees may be subject to inspection to ensure that nothing is copied or removed. Since such seemingly comprehensive security precautions are primarily physical in nature, they are imperfect.
  • Because of these imperfections, lapses in physical security have occurred. For example, Social Security Numbers and bank routing numbers are only nine digits; bank account numbers are usually of similar length. Memorizing these important numbers would not be difficult and would allow a nefarious employee to have direct access to the money held in those accounts. For example, in 2004 employees of MphasiS in Pune, India allegedly stole $426,000 from Citibank customers. The owners, managers, staff, guards and contractors of data extraction organizations may misuse some or all of the unencrypted confidential information in their care. Further, breaches of physical and information system security by external parties can occur. Because data extraction organizations are increasingly located in foreign countries, there is often little or no recourse for American citizens victimized in this manner.
  • Information security has been the identified for seven consecutive years as the most important technology initiative by the Top Technology Initiatives survey of the American Institute of Certified Public Accountants (AICPA.) National and state laws have been enacted and new regulations have been implemented to address these security concerns, particularly those related to outsourced data extraction that is performed offshore.
  • The third general method of data extraction involves partial automation, often combining optical character recognition, human inspection and workflow management software.
  • Software tools that facilitate the automated extraction and transformation of document information are available from several vendors including ABBYY, AnyDoc Software, EMC Captiva, Kofax and Nuance. The relative operating cost savings facilitated by these tools is proportional to the amount of automation, which depends on the application, quality of software customization, variety and quality of documents and other factors.
  • Automation requires customizing and/or programming data extraction software tools to properly recognize and process a specific set of documents for a specific domain. Because such customization projects often cost upwards of hundreds of thousands of dollars, data extraction automation is usually limited to large organizations that can afford significant capital investments.
  • The first step of a partially automated data extraction operation is to scan financial, health, tax and/or other documents and save the resulting image files. The scanned images are compared to a database of known documents. Images that are not identified are routed to data extraction workers for conventional processing. Images that are identified have data extracted using templates, either location-based or label-based, along with optical character recognition (OCR) technology.
  • Optical character recognition is imperfect, often mistaking more than one percent of the characters on clean, high quality documents. Many documents are neither clean nor high quality, suffering from being folded or marred before scanning, distorted during scanning and degraded during post-scanning binarization. As a result, some of the labels needed to identify data are often not recognizable; therefore, some of the data cannot be automatically extracted.
  • Using conventional software tools, vendors report being able to extract up to 80-90% of the data on a limited number of typical forms. When a wide range of forms exists, such as the 10,000 plus variations of W-2, 1099, K-1 and other personal income tax forms, automated data extraction is quite limited. Despite years of efforts, several tax document automation vendors claim 50% or less data extraction and admit to numerous errors with conventional data extraction methods.
  • Correcting errors entails human inspection. Inspection requires workers with the same capabilities of data extraction workers, namely specific education, domain expertise, particular training, software knowledge and/or cultural understanding. Inspection workers must recognize documents, find relevant information on the documents and insure that the data has been accurately extracted and appropriately entered in particular software programs. Typically, any changes made by inspection workers must be reviewed and approved by other, more senior, inspection workers before replacing the data extracted by optical character recognition. Because automation requires human inspection, source documents with sensitive information are exposed in their entirety to data extraction workers.
  • SUMMARY OF INVENTION
  • The invention is directed to systems and methods for training document analysis system for automatically extracting data from documents.
  • In a preferred embodiment, a method is provided for training a document analysis system to automatically extract data from each document wherein the document analysis system receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to classify each document into a corresponding document category and to extract data from the electronic documents. The method includes: automatically analyzing images and text features extracted from each received electronic document to associate the electronic document with a corresponding document category; and comparing the extracted text features with a set of text features associated with corresponding category of each received document, in which the set of text features includes a set of characters, words, and phrases.
  • If the extracted text features are found to consist of the characters, words, and phrases belonging to the set of text features associated with the corresponding electronic document category, the method further includes storing the extracted text features as the data contained in the corresponding electronic document. If, however, the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding electronic document category, the method further includes submitting the unrecognized text features to a training phase in which the text features are recognized as belonging to the set of text features associated with the corresponding electronic document category and then using the now-recognized text features to automatically modify the set of text features associated with the corresponding electronic document category so that the extracting data, regardless of which document category the corresponding document belongs to, improves as the training method is subjected to more and more unrecognized text features and the set of text features is modified accordingly.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding part, and in which:
  • FIG. 1 is a system diagram of a document data extraction system 100 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 2 is a system diagram of the image capture system 110 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 3 is a system diagram of the web server system 120 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 4 is a system diagram of the document processing system 130 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 5 is a system diagram of the image processing system 422 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 6 is a system diagram of the classification system 432 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 7 is a system diagram of the grouping system 442 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 8 is a system diagram of the data extraction system 452 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 9 is an illustration of three-step document submission process according to a preferred embodiment of the disclosed subject matter;
  • FIG. 10 is an illustration of the nine types of point patterns according to a preferred embodiment of the disclosed subject matter;
  • FIG. 11 is an illustration of image processing prior to OCR according to a preferred embodiment of the disclosed subject matter;
  • FIG. 12 is an illustration of a log polar histogram according to a preferred embodiment of the disclosed subject matter;
  • FIG. 13 is a flow diagram of the service control manager 410 according to a preferred embodiment of the disclosed subject matter;
  • FIG. 14 is an illustration of label contour matching according to a preferred embodiment of the disclosed subject matter;
  • FIG. 15 is a flow diagram of a CRK classifier according to a preferred embodiment of the disclosed subject matter;
  • FIG. 16 is a schematic of a CRK classifier according to a preferred embodiment of the disclosed subject matter;
  • FIG. 17 is an illustration of relative location matching of labels according to a preferred embodiment of the disclosed subject matter;
  • FIG. 18 is an exemplary computer system on which the described invention may run according to a preferred embodiment of the disclosed subject matter;
  • FIG. 19 is an illustration of boxes containing labels and values;
  • FIG. 20 is an illustration of check boxes;
  • FIG. 21 is an illustration of address blocks;
  • FIG. 22 is an illustration of an instruction block;
  • FIG. 23 is an illustration of a table;
  • FIG. 24 is an illustration of a multi-copy form;
  • FIG. 25 is an illustration of an image with (A) confetti, (B) confetti with identified labels and (C) confetti with identified labels and labels with potential table headers grouped horizontally;
  • FIG. 26 is an illustration of a table with a header that needs reconstruction;
  • FIG. 27 is an illustration of a table with an instruction block at the bottom;
  • FIG. 28 is an illustration of a portion of a table with noise removed and most data correctly extracted;
  • FIG. 29 is an illustration of row formation in a table;
  • FIG. 30 is an illustration of column formation in a table;
  • FIG. 31 is an illustration of header association for a table;
  • FIG. 32 is an illustration of a table with extracted data viewed through a debug tool; note the incorrectly formed rows due to the “Corrected” overlay. Rows 1, 2, 3, 5, 6, 7, 8, and 9 are merged, but row 4 and the rest of the table was extracted properly;
  • FIG. 33 is an illustration of the an image being extracted via a process of progressive refinement and reduced character set OCR;
  • FIG. 34 is an illustration of the an image being extracted via a process of progressive refinement based on increasing knowledge about the form;
  • FIG. 35 is an illustration of the an image being extracted via a process of progressive refinement based on utilizing knowledge gained from one form to extract data from another form;
  • FIG. 36 is an illustration of data external to the input image that is used to extract and verify data from the input image;
  • FIG. 37 is an illustration of a form with an obscured label;
  • FIG. 38 is an illustration of the data extracted from the form shown in FIG. 37;
  • FIG. 39 is an illustration of a form with a degraded image that results in incorrectly extracted data;
  • FIG. 40 is an illustration of the data extracted from the form shown in FIG. 39;
  • FIG. 41 is an illustration of a portion of a W-2 form;
  • FIG. 42 is an illustration of the internal representation of the data corresponding to the form in FIG. 41 as a partial layout graph;
  • FIG. 43 is an illustration of the internal representation of the data corresponding to the form in FIG. 41 after labels are detected;
  • FIG. 44 is an illustration of the labels associated with a 1099-OID form;
  • FIG. 45 is an illustration of a table;
  • FIG. 46 is an illustration of the table shown in FIG. 45 with columns identified;
  • FIG. 47 is an illustration of the table shown in FIG. 45 with columns and labels identified;
  • FIG. 48 is an illustration of the table shown in FIG. 45 with columns, labels and header identified;
  • FIG. 49 is an illustration of the table shown in FIG. 45 with columns, labels, header and rows identified;
  • FIG. 50 is an illustration of four occurrences of image fields for “Wages, tips, other comp.” box on a single W-2 form;
  • FIG. 51 is an illustration of the data records corresponding to the image fields shown in FIG. 50.
  • DETAILED DESCRIPTION
  • While the prior art attempts to reduce the cost of data extraction through the use of low cost labor and partial automation, none of the above methods of data extraction (1) eliminates the human labor and its accompanying requirements of education, domain expertise, training, software knowledge and/or cultural understanding, (2) minimizes the time spent entering and quality checking the data, (3) minimizes errors, (4) protects the privacy of the owners of the data without being dependent on the security systems of data extraction organizations and (5) eliminates the cost for significant up-front engineering efforts. What is needed, therefore, is a method of performing data extraction that overcomes the above-mentioned limitations and that includes the features enumerated above.
  • Preferred embodiments of the present invention provides a method and system for extracting data from paper and digital documents into a format that is searchable, editable and manageable.
  • FIG. 1 is a system diagram of a document data extraction system 100 according to a preferred embodiment of the invention. System 100 has an image capture system 110, and a web server system 120 and a document processing system 130. In the preferred embodiment, the image capture system 110 is connected to the web server system 120 by a network such as a local-area network (LAN,) a wide-area network (WAN) or the Internet. The preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption. Encryption certificates can be purchased from well respected certificate authorities such as VeriSign and thawte or can be generated by using numerous key generation tools in the market today, many of which are available as open source. Alternatively, the files may be transferred over a non-secure network, albeit in a less secure manner. The web server system 120 is connected to the document processing system 130 via software within a computer system. Other embodiments of the invention may integrate the document processing system 110 with the image capture system 130. In this case, the web server system 120 is not necessary.
  • Under typical operation, System 110 is an image capture system that receives physical documents and scans them. The image capture system 110 is described in greater detail below.
  • Under typical operation, System 120 is a web server system that receives the scanned documents and returns the extracted data over the Internet. Some embodiments of the invention may not have a web server system 120. The web server system 120 is described in greater detail below.
  • Under typical operation, System 130 is a document processing system. The document processing system 130 extracts the received data into files and databases per a predetermined scheme. Under preferred embodiments, the document processing system 130 is comprised of several modules that are part of a highly distributed architecture which consists of several independent processes, data repositories and databases which communicate and pass messages to each other via well defined standard and proprietary interfaces. Even though the document processing system 130 may be built in a loosely coupled manner to achieve maximum scalability and throughput, the same results can be achieved if the document processing system 130 was more tightly coupled in a single process with each module being a logical entity of the same process. Furthermore, the document processing system 130 supports multiple different product types which may process anywhere from hundreds to millions of documents every day for tens to thousands of customers in different markets. Under preferred embodiments, the document processing system 130 utilizes server(s) hosted in a secure data center so that documents from healthcare, insurance, banking, government, tax and other applications are processed per security policies that are HIPAA, GLBA, SAS70, etc. compliant. The document processing system 130 includes mechanisms for learning documents. The document processing system 130 is described in greater detail below.
  • FIG. 2 is system diagram of the image capture system 110 according to a preferred embodiment of the invention. System 110 has a scanning system 212, a user interface system 222, a data acquisition system 225, a data transfer system 232 and an image pre-processing system 235. Source documents 210 in the form of papers are physically placed on an input tray of a commercial scanner. Source documents in the form of data files are received over a network by the user interface system 222. The user interface system 222 communicates with the scanning system 212 via software within a computer system, or, optionally over a computer network. The user interface system 222 may be part of the scanning system 212 in some embodiments of the image capture system 110. The user interface system 222 communicates with the data acquisition system 225 via software within a computer system. The user interface system 222 communicates with the data transfer system 232 via software within a computer system. The data acquisition system 225 communicates with the scanning system 212 via a physical connection, such as a high-speed Universal Serial Bus (USB) 2.0, or, optionally, over a network. The data acquisition may also be part of the scanning system 212 in certain embodiments of the image capture system 110. The data acquisition system 225 communicates with the image pre-processing system 235 via software within a computer system. The data transfer system 232 communicates with the image pre-processing system 235 via software within a computer system. The data acquisition system and the data transfer system may also be part of the scanning system 212 in some embodiments of the image capture system 110.
  • Element 210 is a source document in the form of either one or more physical sheets of paper, or a digital file containing images of one or more sheets of paper. The digital file can be in one of many formats, such as PDF, TIFF, BMP, or JPEG.
  • System 212 is a scanning system. Under preferred embodiments, conventional scanning systems may be used such as those from Bell+Howell, Canon, Fujitsu, Kodak, Panasonic and Xerox. These embodiments include scanners connected directly to a computer, shared scanners connected to a computer over a network, and smart scanners that include embedded computational functionality to add third-party applications. The scanning system 212 captures an image of the scanned document as a computer file; the file is often in a standard format such as PDF, TIFF, BMP, or JPEG.
  • System 222 is a user interface system. Under preferred embodiments, the user interface system 222 runs in a browser and presents a user with a three-step means for submitting documents to be organized as shown in FIG. 9. In step one, the user interface system 222 provides a mechanism for selecting a job from a list of jobs; additionally, it allows jobs to be added to the job list. In step two, the user interface system 222 provides a mechanism for initiating the scanning of physical papers; additionally, it provides a browsing mechanism for selecting a file on a computer or network. Optionally, one or more sets of papers can be scanned and one or more files can be selected. In step three, the user interface system 222 provides a mechanism for sending the job information and selected documents over a network to the server system. Under preferred embodiments, the user interface system 222 also presents a user with the status of jobs that have been submitted as submitted or completed; optionally, it presents the expected completion date and time of submitted jobs that have not been completed. The user interface system 222 also presents a user with a mechanism for receiving submitted documents and extracted data. The user interface system 222 also provides a mechanism for deleting files from the system. Other embodiments of the user interface system 222 may run within an application that provides the scan feature as part of a broader function, or within a simple data entry system that is composed of only a touch screen and/or one or more buttons. Furthermore, the user interface system 222 also may also be embodied by a programmable API that provides the same or similar functionality to another application program.
  • System 225 is a data acquisition system. Under preferred embodiments, the data acquisition system 225 controls the settings of the scanning system. Many scanning systems in use today require users to manually set scanner settings so that images are captured, for example, at 300 dots per inch (dpi) as binary data (black-and-white.) Commercial scanners and scanning software modify the original source document image that often include high resolution and, possibly, color or gray-scale elements. The resolution is often reduced to limit file size. Color and gray-scale elements are often binarized, e.g. converted to black or white pixels, via a process known as thresholding, also to reduce file size. Under preferred embodiments, the data acquisition system sets the scan parameters of the scanning system. The data acquisition system commands the scanning system to begin operation and receives the scanned document computer file from the scanning operation. The data acquisition system 225 could be part of the scanning system 212, in certain embodiments. Moreover, the operation of the data acquisition system 212 could be automatically triggered by the scan function, in certain embodiments. Reference “System for Optimal Document Scanning” U.S. patent application Ser. No. 12/351,302.
  • System 232 is a data transfer system. Under preferred embodiments, the data transfer system 232 manages the SSL connection and associated data transfer with the server system. The data transfer system 232 could be part of the scanning system 212, in certain embodiments. Moreover, the operation of the data transfer system 232 could be automatically triggered by the scan function, in certain embodiments.
  • System 235 is an optional image pre-processing system. The image pre-processing system 235 enhances the image quality of scanned images for a given resolution and other scanner settings. The image pre-processing system 235 may be implemented as part of the image capture system as depicted on FIG. 2 or as part of the server system as depicted on FIG. 3. When part of the image capture system, the image pre-processing system may also be implemented within the scanning system 212, in certain embodiments. Details of the image pre-processing system 235 are described in further detail below as part of the document processing system 130.
  • FIG. 3 is a system diagram of the web server system 120 according to a preferred embodiment of the invention. System 120 has a web services system 310, an authentication system 312 and a content repository 322. The web services system 310 communicates with the authentication system 312 via software within a computer system. The web services system 310 communicates with the content repository 322 via software within a computer system.
  • System 310 is a web services system. Under preferred embodiments, the web services system 310 provides the production system connection to the network that interfaces with the image capture system. Such a network could be a local-area network (LAN), a wide-area network (WAN) or the Internet. As described above, the preferred implementation transfers all data over the network using Secure Sockets Layer (SSL) technology with enhanced 128-bit encryption. Standard web services include Apache, RedHat JBoss Web Server, Microsoft IIS, Sun Java System Web Server, IBM Websphere, etc. Under preferred embodiments, users upload their source electronic documents or download their organized electronic documents and extracted data in a secure manner using HTTP or HTTPS. Other mechanisms for secure data transfer can also be used. The web service system 310 also relays necessary parameters to the application servers that will process the electronic document.
  • System 312 is an authentication system. The authentication system 312 allows secure and authorized access to the content repository 322. Under preferred embodiments, an LDAP authentication system is used; however, other authentication systems can also be used. In general, an LDAP server is used to process queries and updates to an LDAP information directory. For example, a company could store all of the following very efficiently in an LDAP directory:
      • The company employee phone book and organizational chart
      • External customer contact information
      • Infrastructure services information, including NIS maps, email aliases, and so on
      • Configuration information for distributed software packages
      • Public certificates and security keys
  • Under a preferred embodiment, document organization and access rights are managed by the access control privileges stored in the LDAP repository.
  • System 322 is a content repository. The content repository 322 can be a simple file system, a relational database, an object oriented database, any other persistent storage system or technology, or a combination of one or more of these. Under a preferred embodiment, the content repository 322 is based on Java Standard Requests 170 (JSR 170.) JSR 170 is a standard implementation-independent way to access content bi-directionally on a granular level within a content repository. The content repository 322 is a generic application “data store” that can be used for storing both text and binary data (images, word processor documents, PDFs, etc.) One key feature of a content repository is that one does not have to worry about how the data is actually stored: data could be stored in a relational database (RDBMS) or a file system or as an XML document. In addition to providing services for storing and retrieving the data, most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more.
  • Under preferred embodiments, documents in the content repository 322 are available to the end user via a portal. For example, in the current implementation of the system, the user can click on a web browser application button “View Source Document” in the portal and view the original scanned document over a secure network. Essentially, the content repository 322 serves as an off-site secure storage facility for users' electronic documents.
  • FIG. 4 is a system diagram of the document processing system 130 according to a preferred embodiment of the invention. System 130 has a service control manager 410, a job database 414, an image processing system 422, a classification system 432, a grouping system 442 and a data extraction system 452. The service control manager 410 communicates with the job database 414 via software within a computer system. The service control manager 410 communicates with the image processing system 422 via software within a computer system. The service control manager 410 communicates with the classification system 432 via software within a computer system. The service control manager 410 communicates with the grouping system 442 via software within a computer system. The service control manager 410 communicates with the data extraction system 452 via software within a computer system. The image processing system 422 communicates with the job database 414 via software within a computer system. The classification system 432 communicates with the job database 414 via software within a computer system. The grouping system 442 communicates with the job database 414 via software within a computer system. The data extraction system 452 communicates with the job database 414 via software within a computer system. The image processing system 422 communicates with the classification system 432 via software within a computer system. The classification system 432 communicates with the grouping system 442 via software within a computer system. The grouping system 442 communicates with the data extraction system 452 via software within a computer system. The document processing system 130 can be implemented as a set of communicating programs or as a single integrated program.
  • System 410 is a service control manager. Service control manager 410 is a system that controls the state machine for each job. The state machine identifies the different states and the steps that a job has to progress through in order to achieve its final objective, in this case being data extracted from an electronic document. In the current system, the service control manager 410 is designed to be highly scalable and distributed. Under preferred embodiments, the service control manager 410 is multi-threaded to handle hundreds or thousands of jobs at any given time. The service control manager 410 also implements message queues to communicate with other processes regarding their own states. Alternately, the service control manager 410 can be implemented in other architectures; for example, one can implement a complete database driven approach to step through all the different steps required to process such a job.
  • In preferred implementations the service control manager 410 subscribes to events for each new incoming job that need to be processed. Once a new job arrives, the service control manager 410 pre-processes the job by taking the electronic document and separating each image (or page) into its own bitmap image for further processing. For example, if an electronic document had 30 pages, the system will create 30 images for processing. Each job in the system is given a unique identity. Furthermore, each page is given a unique page identity that is linked to the job identity. After the service control manager 410 has created image files by pre-processing the document into individual pages, it transitions the state of each page to image processing.
  • System 414 is a job database. Job database 414 is used to store the images and data associated with each of the jobs being processed. A “job” is defined as a set of source documents and all intermediate and final processing outputs. Job database 414 can be file system storage, a relational database, XML document or a combination of these. In preferred implementations, job database 414 uses a file system storage to store large blob (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • System 422 is an image processing system. The image processing system 422 removes noise from the page image and properly orients the page so that document image analysis can be performed more accurately. The accuracy of the data extraction greatly depends on the quality of the image; thus image processing is included under preferred embodiments. The image processing system 422 performs connected component analysis and, utilizing a line detection system, creates “confetti” images which are small sections of the complete page image. Under preferred embodiments, the confetti images are accompanied by the coordinates of the image sub-section. The image processing system 422 is described in greater detail below.
  • System 432 is a classification system. The classification system 432 recognizes the page as one of a pre-identified set of types of documents. A major difficulty in categorizing a page as one of a large number of documents is the high dimensionality of the feature space. Conventional approaches that depend on text categorization alone are faced with a native feature space that consists of many unique terms (words as well as phrases) that occur in documents, which can be hundreds or thousands of terms for even a moderate-sized collection of unique documents. In one domain, multiple systems that categorize income tax documents such as W-2, 1099-INT, K-1 and other forms have experienced poor accuracy because of the thousands of variations of tax documents. The preferred implementation uses a combination of image pattern recognition and text analysis to distinguish documents and machine learning technology to scale to large numbers of documents. The classification system 432 is described in greater detail below.
  • System 442 is a grouping system. The grouping system 442 groups pages that have been categorized by the classification system 432 as specific instances of a pre-identified set of types of documents into sets of multi-page documents. The grouping system 442 is described in greater detail below.
  • System 452 is a data extraction system. The data extraction system 452 extracts data from pages that have been categorized by the classification system 432 as specific instances of a pre-identified set of types of documents. There are many difficulties in extracting data accurately from documents not specifically designed for automatic data extraction. Typically, the document images are not of uniformly high quality. The document images can be skewed, streaked, smudged, populated with artifacts and otherwise degraded in ways that cannot be fully compensated by image processing. The document layout can appear to be random. The relevant content (data labels and data values) can be quite small, impaired by lines and background shading or otherwise not be processed well by OCR. In the above-mentioned domain of tax document automation, vendors using conventional data extraction methods claim 50% or less data extraction and admit to numerous errors. The data extraction system 452 uses OCR data extraction, non-OCR visual recognition, contextual feature matching, business intelligence and output formatting, all with machine learning elements, to accurately extract and present data from a wide range of documents. The data extraction system 452 is described in greater detail below.
  • FIG. 5 is an image processing system 422 according to a preferred embodiment of the invention. System 422 has an image feature extraction system 510, a working image database 522, an image identification system 530, a trained image database 532 and an image training system 534. The image feature extraction system 510 is connected to the working image database 522 via software within a computer system. The image feature extraction system 510 is connected to the image identification system 530 via software within a computer system. The image identification system 530 is connected to the working image database 522 via software within a computer system. The image identification system 530 is connected to the trained image database 532 via software within a computer system. The image training system 534 is connected to the working image database 522 via software within a computer system. The image training system 534 is connected to the trained image database 532 via software within a computer system.
  • System 510 is an image feature extraction system. Image feature extraction system 510 extracts images from the submitted job artifacts. Image feature extraction system 510 normalizes images into a uniform consistent form for further image processing. Image feature extraction system 510 binarizes color and grayscale images. A document can be captured as a color, grayscale or binary image by a scanning device. Common problems seen in images from scanning devices include:
      • poor contrast due to lack of sufficient or controlled lighting
      • non-uniform image background intensity due to uneven illumination
      • immoderate amounts of random noise due to limited sensitivity of the sensors
  • Many document images are rich in color and have complex backgrounds. Accurately processing such documents typically requires time-consuming processing and manual tuning of various parameters. Detecting text in such documents is difficult for typical optical character recognition systems that are optimized for binary images on clean backgrounds. For the data extraction system to work well, document images must be binarized and the text must be readable. Typically, general purpose scanners binarize images using global thresholding utilizing a single threshold value, generally chosen on statistics of the global image. Global thresholding is not adapted well for images that suffer from common illumination or noise problems. Global thresholding often results in characters that are broken, merged or degraded; further, thousands of connected components can be created by binarization noise. Images degraded by global thresholding are typically candidates for low quality data extraction.
  • The preferred embodiment of the binarization system utilizes local thresholding where the threshold value varies based on the local content in the document image. The preferred implementation is built on an adaptive thresholding technique which exploits local image contrast (reference: IEICE Electronics Express, Vol. 1, No 16, pp. 501-506.) The adaptive nature of this technique is based on flexible weights that are computed based on local mean and standard deviations calculated for the gray values in the primary local zone or window. The preferred embodiment experimentally determines optimum median filters across a large set of document images for each application space. Reference “Systems and Methods for Handling and Distinguishing Binarized Background Artifacts in the Vicinity of Document Text and Image Features indicative of a Document Category” US 2009/0119296 A1.
  • The preferred embodiment of image feature extraction system 510 removes noise in the form of dots, specks and blobs from document images. In the preferred embodiment, minimum and maximum dot sizes to be removed are specified. The preferred embodiment also performs image reversal so that white text or line objects on black backgrounds are detected and inverted to black-on-white. The preferred embodiment also performs two noise removal techniques.
  • The first technique starts with any small region of a binary image. The preferred implementation takes a 35×35 pixel region. In this region all background pixels are assigned value “0.” Pixels adjacent to background are given value “1.” A matrix is developed in this manner. In effect each pixel is given a value called the “distance transform” equal to its distance from the closest background pixel. The preferred implementation runs a smoothing technique on this distance transform. Smoothing is a process by which data points are averaged with their neighbors in a series; this typically has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signals and enhancing low frequency signals. Of the many different methods of smoothing, the preferred implementation uses a Gaussian kernel. In particular, the preferred implementation performs Gaussian smoothening with a filter using variance of 0.5 and a 3×3 kernel or convolution mask on the distance transform. Thresholding with a thresholding value of 0.85 is performed on the convolved images and the resulting data is converted to its binary space.
  • The second technique uses connected component analysis to identify small or bad blocks. In this method a sliding mask is created of a known size. The preferred implementation uses a mask that is 35×35 pixels wide. This mask slides over the entire image and is used to detect the number of blobs (connected components) that are less than 10 pixels in size. If the number of blobs is greater than five, then all blobs are removed. This process is repeated by sliding the mask over the entire image.
  • Image feature extraction system 510 corrects skew, small angular rotations, in document images. Skew correction not only improves the visual appearance of the document but also improves baseline determination, simplifies interpretation of page layout and improves text recognition. Several available image processing libraries do skew correction. The preferred implementation of skew detection uses part of the open source Leptonica image processing library.
  • Image feature extraction system 510 corrects document orientation. Documents, originally in either portrait or landscape format may be rotated by 0, 90, 180 or 270 degrees during scanning The preferred implementation of orientation correction performs OCR on small words or phrase images at all four orientations: 0, 90, 180 and 270 degrees. Small samples are selected from a document and the confidence is averaged across the sample. The orientation that has the highest confidence determines the correct orientation of the document.
  • Image feature extraction system 510 performs connected component analysis using a very standard technique. The preferred implementation of connected component analysis uses the open source Image Processing Library 98 (IPL98.)
  • Image feature extraction system 510 detects text lines using the technique described by Okun et al. (reference: Robust Text Detection from Binarized Document Images) to identify candidate text segments blocks of consistent heights. For a page from a book, this method may identify a whole line as a block, while for a form with many boxes this method will identify the text in each box.
  • Image feature extraction system 510 generates confetti information by storing the coordinates of all of the text blocks in the working image database 522.
  • Image feature extraction system 510 performs image processing on the confetti images. Traditionally, if image processing is performed on document images, the entire document image is subject to a single type of image processing. This “single algorithm” process might, for example, thin the characters on the document image. In some cases, the accuracy of text extraction with OCR might improve after thinning; however, in other cases on the same document, the accuracy of text extraction accuracy of text extraction with OCR might improve with thickening. Image feature extraction system 510 applies multiple morphological operators to individual confetti images. Then, for each variation of each confetti image (including the original, unprocessed versions and all processed versions,) image feature extraction system 510 extracts text with OCR. Optionally, image feature extraction system 510 extracts text with different OCR engines. Several OCR software programs are available on the market today. The preferred implementation uses Tesseract, an open source software which allows custom modifications. The extracted text output (text, OCR engine used and corresponding confidence value) is saved for each version of each confetti image. An illustration of source document images before and after image processing is shown in FIG. 11.
  • Image feature extraction system 510 determines the contour of image areas within confetti boxes. The contour of an image within a confetti is illustrated in FIG. 14. The size of the confetti image area is first normalized. In preferred implementations, 256 equidistant points on the contour are chosen, and the relative location of these points is recorded in a log-polar histogram as illustrated in FIG. 12. Values for log r are placed in 3 bins, while values for the angle are placed in 8 bins. The relative location of a point with respect to another is therefore a number from 1 through 24.
  • The feature vector for the shape of the contour as illustrated in FIG. 14 is a 256×256 matrix of numbers from 1 through 24 that considering all the 256 points and their relative locations (reference: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No 24, pp. 509-422.)
  • System 522 is a working image database. Working image database 522 is used to support both the processing of jobs and the image training system 534. Working image database 522 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the working image database 522 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • System 530 is an image identification system. The image identification system 530 looks for point and line features. The preferred implementation performs image layout analysis using two image properties, the point of intersection of lines and edge points, of text paragraphs. Every unique representation of points is referred as a unique class in the system and represents a unique point pattern in the system database. The preferred implementation uses a heuristically developed convolution method only on black pixels to perform a faster computation The system identifies nine types of points: four T's, four L's, and one cross (X) using nine masks; examples of these nine point patterns are shown in FIG. 10.
  • The preferred implementation of point pattern matching is performed by creating a string from the points detected in the image and then using the Levenshtein distance to measure the gap between the trained set with the input image. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
  • The image identification system 530 selects the extracted text from the sets of extracted text for each confetti image according to rules stored in the trained image database 532. In preferred implementations of the image identification system 530, extracted text values that exceed specified OCR engine-specific thresholds are candidates for selection. The best text value that is produced from the image after applying the morphological operators is chosen based on OCR confidence, similarity and presence in a dictionary.
  • In preferred implementations, based on the results of the “first pass” OCR performed by the image feature extraction system 510, the image identification system 530 selects the text value from a contextually limited lexicon (words and characters) that is stored in the trained image database 532. In preferred implementations, the image identification system 530 requests the image feature extraction system 510 to perform a “second pass” OCR operation using an engine specifically tailored for extracting the type of characters that the image identification system 530 identified as present in the confetti image. As an example, if the image identification system 530 identified the confetti image as containing characters associated only with currency values (such as the digits 0-9, dollar sign, period, comma, minus sign, parentheses and asterisk) then the “second pass” OCR would be conducted with a currency character recognition system that is tuned to identify numerical and certain special characters. The currency character recognition system utilizes OCR technology tailored to the reduced character set associated with currency values. In the preferred implementation, the currency character set is defined as the digits [0-9] and the special character set [$.,−( )]. The preferred implementation performs character segmentation to break up the image into individual characters. It then uses a normalized bitmap of the image of each character as a feature vector. This feature vector is passed into a neural network based classifier that was trained on more than 10,000 instances of each character that are stored in the trained image database 532.
  • Label identification by traditional means of matching extracted text to a database of expected values is often not possible; this is caused by the inability of OCR engines to accurately extract text from very small and degraded images. The present invention's use of both multiple versions of the confetti images (original and image processed) and multiple OCR engines significantly reduces but does not eliminate the problem of inaccurate text extraction. Two additional techniques are used to identify text from images.
  • The image identification system 530 performs contour matching by comparing the contour shape features extracted by the feature extraction system 510, with the corresponding features of known confetti images stored in the trained image database 532. Similarity between images is determined by a point-wise comparison of feature vectors. The preferred implementation uses a KNN classifier for this process. FIG. 14 illustrates label contour matching.
  • System 532 is a trained image database. Trained image database 532 is used to support both the processing of jobs and the image training system 534. Trained image database 532 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the trained image database 532 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job. As the system grows “smarter” by recognizing more images and more rules pertaining to restricting OCR with contextual information, the trained image database 532 grows. As the machine learning system sees more trained images, its image identification accuracy increases.
  • System 534 is an image training system. The image training system 534 performs computations on the data in its document database corresponding to the image that are in place and generates datasets used by the image identification system for recognizing the content in source document images. The results of the training and re-training process are image datasets that are updated in the trained image database 532.
  • The image training system 534 implements a continuous learning process in which images and text that are not properly identified are sent to training The training process results in an expanded data set in the trained image database 532, thereby improving the accuracy of the system over time. As the trained image database 532 grows, the system requires an asymptotically lower percentage of images to be trained. Preferred implementations use machine learning supported by the image training system 534 that adapts to a growing set of documents images. Additional documents add additional image features that must be analyzed.
  • The learning system receives documents from the working image database 522 that were provided by the image identification system 530. These documents are not trained and do not have corresponding model data in the trained image database 532. All such documents are made persistent in the trained image database 532.
  • Preferred implementations of the training system include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • The learning technique in the preferred implementation is supervised learning. Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems. Example input vectors include key words and line patterns of the document layouts. Example target vectors include possible classes of output in the organized document. Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
  • FIG. 6 is a classification system 432 according to a preferred embodiment of the invention. System 432 has class feature extraction systems 610, working class databases 622, class identification systems 630, trained class databases 632, class training systems 634, a voting system 640, a trained voting decision tree 642 and a voting training system 644. The class feature extraction system (i) 610 is connected to the working class database (i) 622 via software within a computer system. The class feature extraction system (i) 610 is connected to the class identification system (i) 630 via software within a computer system. The class identification system (i) 630 is connected to the working class database (i) 622 via software within a computer system. The class identification system (i) 630 is connected to the trained class database (i) 632 via software within a computer system. The class training system (i) 634 is connected to the working class database (i) 622 via software within a computer system. The class training system (i) 634 is connected to the trained class database (i) 632 via software within a computer system. The class identification system (i) 630 is connected to the voting system 640 via software within a computer system. The voting system 640 is connected to the trained voting decision tree 642 via software within a computer system. The trained voting decision tree 642 is connected to the voting training system 644 via software within a computer system.
  • Under the preferred embodiment, classification system 432 is composed of four classification subsystems whose outputs are evaluated by the voting system 640. The four classification subsystems are:
      • Combined text and image (CTI) classification subsystem
      • CRK classification subsystem
      • SVM classification subsystem
      • CCS classification subsystem
  • Each of the above subsystems has a class feature extraction systems 610, a working class database 622, a class identification system 630, a trained class database 632 and a class training system 634.
  • Each system 610 is a class feature extraction system. Class feature extraction systems 610 receive extracted text and image features (discussed above.) The CTI classification subsystem and the CRK classification subsystem use the extracted text features.
  • The SVM classification subsystem addresses the problem of classifying documents as OCR results improve; as document quality, scanning practices, image processing or OCR engines improve, the extracted source document text from differs from the extracted text of training documents, causing classification to worse. The SVM class feature extraction systems 610 filters extracted text features, passing on only those text features that match a dictionary entry.
  • In the preferred implementation, the SVM class feature extraction system 610 matches OCR text output of a text document against a large dictionary. If no dictionary match is found, the OCR text is discarded. A feature vector that consists of all OCR text that matches the dictionary is passed to an SVM-based classifier to determine the document class.
  • The SVM classification subsystem is made resilient to OCR errors by introducing typical OCR errors into the dictionary. However, the classifier remains robust to OCR improvements because the dictionary includes correct English words.
  • The CCS classification subsystem addresses the problem of classifying documents with poor image quality that do not OCR well; such documents have poor text extraction and therefore poor text-based classification. The CCS classification subsystem uses robust image features exclusively to classify documents.
  • In the preferred implementation, the CCS class feature extraction system 610 first creates a code book using seven randomly selected documents. Each of these documents is divided into 10×pixel blocks. The K-means algorithm is applied to each block to generate 150 clusters. The mean of these clusters is taken as the representative codeword for that cluster. The clusters are arbitrarily numbered from 1 to 150; the result forms a vocabulary for representing source document images as a feature vector of this vocabulary.
  • Each source document image is divided into four quadrants. A vector is formed for each quadrant following the term frequency inverse document frequency (TF-IDF) model. At the classification step, a K-means approach is used. A test document is encoded to the feature vector form, and its Euclidean distance is computed from each of the clusters. The labels of the closest clusters are assigned to the document.
  • Each system 622 is a working class database. Working class databases 622 are used to support both the processing of jobs and the class training systems 634. Working class databases 622 can be file systems, relational databases, XML documents or a combination of these. In preferred implementations, the working class databases 622 use file systems to store large blobs (binary large objects) and relational databases to store pointers to the blobs and other information pertinent to processing the job.
  • System 630 is a class identification system. Class identification system 630 functions differently for each of the four classification subsystems.
  • In the first case, the CTI classification subsystem, the class identification system 630 presents the extracted text to a key word identification system. The key word identification system receives the confetti text and interfaces with the trained class database 632. The trained class database 632 consists of a global dictionary, global priority words and the point pattern signatures of all the trained forms, all of which are created by the class training system 634.
  • Under the preferred embodiment, stop words are from the list of extracted. Stop words are common words—for example: “a,” “the,” “it,” “not,” and, in the case of income tax documents, for example, phrases and words including “Internal Revenue Service,” “OMB,” “name,” “address,” etc. The stop words are provided by the trained class database 632 and, in the preferred embodiment, are domain specific.
  • In the preferred implementation, the priority of each word is calculated as function of line height (LnHt) of the word, partial of full match (PFM) with form name and total number of words in the form (N). The approximate value of priority is formulated as

  • Pr=(ΣLnHt×PFM)/N
  • The summation is taken to give more priority to the word whose frequency is higher in a particular form. Partial or full match (PFM) increases the priority if the word partially or fully matches the form name. The calculation divides by the total number of words in the form (N) to normalize the frequency if the form has a large numbers of words.
  • The vector space creation system stores in a table the priority of each word in the form. A vector is described as (a1, a2, . . . ak) where a1, a2 . . . ak are the magnitude in the respective dimensions. For example, for input words and corresponding line heights of a W-2 tax form, the following are word-priority vectors are stored:
  • OMB 10
    employer 5
    employer 5
    wages 5
    compensation 5
    compensation 5
    dependent 5
    wages 10
    social 5
    security 5
    income 5
    tax 5
    federal 5
    name 5
    address 5
  • The normalized valued for the priorities are:
  • OMB 0.666667
    employer 0.666667
    wages 1.000000
    compensation 0.666667
    dependent 0.333333
    social 0.333333
    security 0.333333
    income 0.333333
    tax 0.333333
    federal 0.333333
    name 0.333333
    address 0.333333
  • In such a vector space, the words with larger font size or higher frequency will have higher priority.
  • The ranking system calculates the cosine distance of two vectors V1 and V2 as:

  • cos θ=(VV2)/(|V1|*|V2|)
  • where V1˜V2 is the dot product of two vectors and |V| represents the magnitude of the vector. When the cosine distance nears 0, that means the vectors are orthogonal and when it nears 1 it means the vectors are in the same direction or similar.
  • The class which has the maximum cosine distance with the form is the class to which the form is classified.
  • The class identification system 630 performs point pattern matching based on the image features collected during image processing. As mentioned earlier, the point pattern matching of documents is performed by creating a string from the points detected in the image and then using Levenshtein distance to measure the gap between the trained set with the input image.
  • In the preferred embodiment of the CTI classification subsystem, the results of the ranking and the point pattern matching are used to determine the class matching values. If the system is not successful in finding a class match within a defined threshold, the document is marked as unclassified.
  • In the second case, the CRK classification subsystem, the class identification system 630 first identifies a source document as a member of a particular group of classes then identifies the source document as a member of a particular individual class. The CRK class identification system 630 performs hierarchical classification with a binary classifier system using regularized least squares and a multi-class classifier using K-nearest neighbor. An example flow diagram of an example CRK class identification system 630 used in classifying income tax documents is shown in FIG. 15.
  • In the third case, the SVM classification subsystem, the class identification system 630 identifies a source document using a support vector machine operating on a set of trained data If the lookup fails, the source document is marked as unclassified.
  • In the fourth case, the CCS classification subsystem, the class identification system 630 works much like the CTI class identification system 630. The CCS class identification system 630 compares the code vectors for each quadrant of source documents with code vectors in the trained class database 632 using the K-means approach. The trained class database 632 is organized into clusters representing documents in the training set with similar image properties as defined by the feature vectors. The mean point of each cluster within the feature vector space is used to represent each cluster. In addition, each cluster is tagged with all document classes that occurred within the cluster. The distance of the feature vector of a source document from the mean of each cluster is computed, and the K nearest clusters are considered. The document class tags of these clusters are chosen as plausible classes of the source document.
  • The CCS trained class database 632 stores code vectors of all the trained forms, all of which are created by the CCS class training system 634.
  • System 632 is a trained class database. Trained class database 632 is used to support both the processing of jobs and the class training system 634. Trained class database 632 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the trained class database 632 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job. As the system grows “smarter” by recognizing more documents, the trained class database 632 grows. As the machine learning system sees more classification data, its classification accuracy increases.
  • System 634 is a class training system. The class training system 634 adapts to a growing set of documents; additional documents add additional features that must be analyzed. Preferred implementations of the class training system 634 include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • The learning technique that is used to bootstrap the system in the preferred implementation is supervised learning. Applications in which training data comprises examples of input vectors along with their corresponding target vectors are known as supervised learning problems. Example input vectors include key words and line patterns of the document layouts. Example target vectors include possible classes of output in the organized document. Supervised learning avoids the unstable states that can be reached by unsupervised learning and reinforcement learning systems.
  • To maintain the system, semi-supervised learning is utilized. In the preferred implementation, data that is flowing through the system is analyzed and those data that the system failed to correctly identify are isolated. These data are passed through a retraining phase, and the training data in the system are updated after appropriate regression testing.
  • In high volume, a fully automated process is utilized. Here, the data that is needed for retraining are automatically identified and fed to the retraining phase. The new training data are automatically injected into a regression test system to ensure correctness. If the regression test passes, the production system is automatically updated with the new training data.
  • The learning system receives documents from the trained class database 632. These documents are not trained and do not have corresponding classification model data in the class database. All such documents are made persistent in the trained class database 632.
  • The trained class database 632 has several tables which contain the document class information as well as image processing information (which is discussed in greater detail below.) The following tables are part of training database:
      • Form class (classification view)
      • Page table (details of the page of the electronic document)
      • Manual classification table (manual work information)
      • Manual training table (trainers' information)
      • Confetti table (confetti information, original text, corrected text, etc.)
  • Class training system 634 utilizes a training process management system that manages the distribution of the training task. Under preferred embodiments, a user, called a “trainer,” logs into the system in which the trainer has privileges at one of three trainer levels:
      • Top tier: add new classes to the system and perform classification and training
      • Middle tier: perform manual classification and training
      • Bottom tier: only perform training (manual text correction).
  • The training process manager directs document processing based on the document state:
      • Unclassified page is scheduled for manual classification
      • Manual classification is done as per policy and form class is assigned
      • Job database is updated with form class information and page/job states are changed so that the page can go to next state
      • If the form class state is not trained, the form is scheduled for training, else no action is needed
  • After form training, the form class state is changed to trained, not synched if allowed by policy. The document class has the following states:
      • Untrained
      • Partially trained
      • Trained, need synch with classification database
      • Trained, synched with classification database
  • Each document that requires training is manually identified and the extracted text is corrected as needed. The trainer follows two independent steps:
      • Manually classifying the form and assigning a class and subclass
      • Manually correcting text extracted by OCR (name required training for now) Manual identification and text correction is comprised of a number of steps:
      • Receive pages from the training manager which manages the flow of pages between various trainers and implements training policy and restrictions
      • Manual classification user interface (UI) which presents the page and asks the user to classify it
      • Manual text correction UI which presents the page with marked up confetti; the user views the confetti and corrects the text extracted from the confetti
      • Training viewer UI is used to view the training database in an UI; the preferred implementation includes reports and representations of the training database
      • Classification verification UI presents a page and its classification to a trainer
  • All user interfaces are integrated into a single system.
  • The class training system 634 combines the document image, the manually classified information and the corresponding text.
  • New trained data that passes regression testing is inserted by the class training system 634 into the trained class database 632.
  • In the case of the CRK class training system 634, Ch-square feature selection attempts to select the most relevant keywords (bag-of-words) for each class
  • X 2 = N ( AD - BC ) 2 ( A + C ) ( B + D ) ( A + B ) ( C + D )
  • Where
      • A=number of times word t co-occurs with class c
      • B=number of times word t occurs without class c
      • C=number of times class c occurs without word t
      • D=number of times neither word t nor class c occur
      • N=total number of words
  • This approach ranks the relevance of each word for a particular class so that a sufficient number of features are obtained.
  • Term frequency—inverse document frequency is used to represent each document:
  • tf i = n i k n k , idf i = log D { d : d t i }
  • Where
      • nk=number of occurrences of feature keyword i
  • k nk = number of occurrences of all terms in the document
      • |D|=total number of documents in the data
      • |D{d:d ∈ ti}|=total number of documents in the data
  • Each vector is normalized into unit Euclidean norm.
  • In the tax document classification example shown in FIG. 15, using these features, four regularized least square classifiers are trained for organizer, brokerage, IRS and misc categories at level 1. Finally, a KNN classifier is used to refine the IRS classes. The cosine distance is used as a similarity measure.
  • System 640 is a voting system. The voting system 640 uses the output of each of the classifier subsystems 630 to choose the best classification result for an image, based on empirical observations of each classifier subsystem behavior on a large training dataset. These empirical observations are encoded into a trained voting decision tree 642. The voting system 640 uses the trained voting decision tree 642 to choose the final classification of an image. The trained decision tree 642 is built using the voting training system 644.
  • System 642 is a trained voting decision tree. The trained voting decision tree 642 is used to support the voting system 640. Trained voting decision tree 642 can be encoded as part of a program, file, relational database, XML document or a combination of these. In preferred implementations, the trained voting decision tree 642 is encoded as a program within a decision making process. As the system grows “smarter” by recognizing more images, the trained voting decision tree 642 evolves, resulting in a system with increasing image identification accuracy.
  • System 644 is a voting training system. The voting training system 640 considers the real classifications of a training dataset and the respective outputs of each of the classifier subsystems 630. Using this data, the voting training system 640 builds a decision tree, giving appropriate weights and preference to the correct results of each of the classification subsystems 630. This approach results in maximized correctness of final classification, especially when each classification subsystem 630 is adept at classifying different, not necessarily disjoint, subsets of documents.
  • FIG. 7 is a grouping system 442 according to a preferred embodiment of the invention. System 442 has a group feature extraction system 710, a working group database 722, a group identification system 730, a trained group database 732 and a group training system 734. The group feature extraction system 710 is connected to the working group database 722 via software within a computer system. The group feature extraction system 710 is connected to the group identification system 730 via software within a computer system. The group identification system 730 is connected to the working group database 722 via software within a computer system. The group identification system 730 is connected to the trained group database 732 via software within a computer system. The group training system 734 is connected to the working group database 722 via software within a computer system. The group training system 734 is connected to the trained group database 732 via software within a computer system.
  • System 710 is a group feature extraction system. Group feature extraction system 710 receives document information including the class identifier and text data for each page. System 710 identifies data features that potentially indicate that a page belongs to a document set. The preferred implementation identifies page numbers and account numbers.
  • System 722 is a working group database. Working group database 722 is used to support both the processing of jobs and the group training system 734. Working group database 722 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the working group database 722 uses a relational database to store pointers to the information pertinent to processing the job.
  • System 730 is a group identification system. Group identification system 730 utilizes the class identifier, the page numbers and the account numbers extracted by system 710 to group pages of a job that belong together. The preferred implementation uses an iterative grouping process that begins by assuming that all pages belong to independent groups. At each iteration step, the process attempts to merge existing groups using a merging confidence. The process terminates when group membership converges and there is no further change to the set of groups.
  • The group identification system 730 uses a merging confidence that is determined from matching and mismatching criteria that is stored in the trained group database 732. Matching criteria between two groups contribute towards an increased confidence to merge the groups, while mismatching criteria contribute towards keeping the groups separate. The final merging confidence is used to decide whether to merge the two groups. This process is repeated for every pair of groups, in each iteration step of the process.
  • System 732 is a trained group database. Trained group database 732 is used to support both the processing of jobs and the group training system 734. Trained group database 732 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the trained group database 732 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job. As the system grows “smarter” by recognizing more document group data, the trained group database 732 grows. As the machine learning system sees more data, its group identification accuracy increases.
  • System 734 is a group training system. The group training system 734 extracts matching criteria from a large set of correctly grouped documents and adapts to a growing set of document data. Preferred implementations of the group training system 734 include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • FIG. 8 is a data extraction system 452 according to a preferred embodiment of the invention. System 452 has a data feature extraction system 810, a working data database 822, a data identification system 830, a trained data database 832 and a data training system 834. The data feature extraction system 810 is connected to the working data database 822 via software within a computer system. The data feature extraction system 810 is connected to the data identification system 830 via software within a computer system. The data identification system 830 is connected to the working data database 822 via software within a computer system. The data identification system 830 is connected to the trained data database 832 via software within a computer system. The data training system 834 is connected to the working data database 822 via software within a computer system. The data training system 834 is connected to the trained data database 832 via software within a computer system.
  • System 810 is a data feature extraction system. The data feature extraction system 810 constructs an Image Form Model, which is a working representation of the layout of the confetti and text in the document image. The data feature extraction system 810 identifies layout features that potentially carry data. The preferred implementation identifies boxes (illustrated in FIG. 19), check boxes (illustrated in FIG. 20), text, lines and tables. The Image Form Model also contains references to the image features like lines and points that have been identified earlier.
  • The data feature extraction system 810 identifies canonical labels that occur in an image by searching through the extracted text data for corresponding expected labels. In order to be robust to OCR errors, data feature extraction system 810 utilizes inexact string matching algorithms that use Levenshtein distance to identify expected labels. An iterative technique that uses increasingly inexact string comparison on an increasingly narrower search space is utilized. If certain canonical labels are still not found because of severe OCR errors, image identification system 530 is used to find canonical labels using contour matching. The success of this technique is enhanced by the narrowed search for the corresponding missing expected labels.
  • The data feature extraction system 810 identifies data-containing features including boxes, real and virtual, check boxes, label-value pairs, and tables. The data feature extraction system 810 also identifies formatted data that are often not associated with a label, e.g. address blocks (illustrated in FIG. 21), phone numbers and account numbers. The data feature extraction system 810 also identifies regions of text that are not associated with any data, such as disclaimers and other text blocks that contain instructions for the reader rather than extractable data (referred to as instruction blocks and illustrated in FIG. 22).
  • System 822 is a working data database. Working data database 822 is used to support both the processing of jobs and the data training system 834. Working data database 822 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the working data database 822 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job.
  • The working data database 822 consists of a flexible data structure that stores all of the features that the data feature extraction system 810 identifies along with the spatial relationships between them. The most primitive element of the data-structure is a Feature data-structure, which is a recursive data-structure that contains a set of Features. A Feature also maintains references to nearby Features; in the preferred implementation, four sets that correspond to references to Features above, below, to the left, and to the right of the Feature. A Feature provides iterators to traverse the five sets associated with it. A Feature also provides the ability to tag on a confidence metric. In the preferred implementation, the confidence is an integer in the range [0-100]. It is assigned by the algorithms that create the Feature, and is used as an estimate of the accuracy of the extracted data.
  • The primitive Feature data-structure is sub-classed into specific features. At the lowest level are the primitive features confetti, word, line, point. At the next level are label and value. Finally, there are features corresponding to each of the data-containing features, box, check-box, label-value pair, and table. There also are features corresponding to the elements of certain composite features like table headers, table rows, and table columns. There are also features corresponding to form-specific items such as address blocks, phone numbers, and instruction blocks.
  • The Feature data-structure supports operations to merge a set of features into another. For example, a label feature and a value feature that correspond to each other are merged into a Label-value pair feature. A set of value features that have been identified as a row of a table are merged into a row feature. A set of labels that have been identified as a table header are merged into a table header feature. In each of these cases, the set of features that were merged into the result are all contained within. They are accessed by enumerating the contained features. As with any feature, the respective algorithm can assign a confidence to the merged feature.
  • System 830 is a data identification system. Data identification system 830 utilizes the Image Form Model created by system 810 to search for correlations between labels and values. The preferred implementation uses the classification of a particular page to determine the expected labels. The expected label set is a subset of the universe of labels, which is available in the trained data database 832. System 830 uses the expected label set to search for data in the image form model for the image. The layout features that have been identified in System 810 are used to aid the process of correlating labels with data.
  • The data identification system 830 performs relative location matching by comparing the locations of the identified confetti images with locations of unidentified confetti images, both stored in the working data database 822. FIG. 17 illustrates relative matching of labels.
  • The preferred implementation of data identification system 830 includes the ability to handle errors and noise. In some situations, poor image quality results in certain expected labels to be missing. Data identification system 830 uses relative location matching by comparing the relative location of identified labels and unidentified text in the image form model, with learned data in the trained data database 832.
  • Some images include multiple copies of form data. For example, in the image of a Form W-2 shown in FIG. 24, the data to be extracted is repeated four times. FIG. 50 illustrates four “Wages, tips, other comp.” boxes that appear on a W-2 form; FIG. 51 show the corresponding data record. The data identification system 830 improves the accuracy of data extraction by utilizing each copy of data on an image with the following process for extracting data from multi-copy forms:
      • 1. Identify if the image is a multi-copy form.
      • 2. Extract data as with a single-copy form to get sets of canonical label-value pairs.
      • 3. Group the extracted data into records corresponding to the layout of the multiple copies in the image:
        • a) Count the number of occurrences of each canonical label extracted in step 2.
        • b) The maximum number of occurrences m determined in step 3a is the number of records.
        • c) Create the m records.
        • d) Seed each record with the corresponding canonical label-value pairs that determined the number of records.
        • e) Set the boundary of each of the records to be the rectilinear convex hull of the canonical label-value pair that seeded it.
        • f) Add the remaining extracted canonical label-value pairs to the records:
          • (i) Sort all canonical label value pairs extracted in raster order.
          • (ii) For each canonical label value pair not yet added to a record
            • If the canonical label value pair is enclosed by the record;
            • Add the canonical label-value pair to the record;
            • Continue.
          • Otherwise
            • Add the canonical label value pair to a nearby record;
            • Extend the boundary of the record to be the rectilinear convex hull of the record and the new canonical label-value pair.
            • If the resulting boundary intersects with another record, backtrack and then add the canonical label-value pair to the next record.
  • After the above process, data identification system 830 organizes the data extracted from such multi-form images into a set of m records as indicated by the layout. Accuracy of extracted data is improved by using a voting strategy to determine which of the m extracted data to select. In addition, if all extracted data instances are identical, then the extracted data is considered to be correct with high confidence. Conversely, if extracted data instances are different, then the extracted data is flagged.
  • The data identification system 830 extracts data from tables (illustrated in FIG. 23) using a layout based strategy. The strategy addresses the following problems with extracting data from tables.
      • 1. Table headers often have poor OCR relative to the actual data in the tables. This means that it is often the case that the values can be correctly determined by the machine, but the corresponding label cannot.
      • 2. Table formats change.
      • 3. A single table row can span multiple text lines in the image. Conventional approaches to extract tables do not handle such wrapped tables in a robust manner.
      • 4. Tables are often interspersed with instruction blocks, aggregate rows, incomplete rows, and overlapping columns.
      • 5. An image with localized noise can still contain large amounts of extractable table data.
  • The process for extracting tables is given below:
      • 1. Start with the image as confetti (FIG. 25-A).
      • 2. Find neatly aligned columns, which are a set of vertically aligned confetti
      • 3. Identify labels in the image, then consume all confetti within the label area (FIG. 25-B).
      • 4. Identify potential table headers by grouping labels horizontally, consume all confetti in the header area (FIG. 25-C).
      • 5. Remove instruction blocks from consideration. These areas do not correspond to any extractable data and are identified using heuristics associated with text density, font size and font type (FIG. 27).
      • 6. Remove noisy confetti (confetti with poor OCR, overlaid text, and other situations where the data is bad or does not exist) (FIG. 28).
      • 7. Form horizontal projections of the remaining confetti. Use the gaps in the projection data to identify rows (FIG. 29).
      • 8. Collect rows that are in close proximity as candidate table formations.
      • 9. Grow columns within each candidate table formation using gaps in their vertical projections until an obstruction is hit. Break the table formation at that point (FIG. 30).
      • 10. Associate the table formation with a header if possible (FIG. 31).
      • 11. Associate each column with a label.
      • 12. Identify missing labels by matching the pattern of labels to those in the trained data database 832 (FIG. 32).
  • The data identification system 830 handles wrapped columns as a special case. In step 8 above, if tables break repeatedly at a row count of one, then the rows are partitioned into two sets, the odds and evens. Now steps 7 through 11 operate on each of the two sets to get two interleaved tables. These two interleaved tables are merged to form the extracted table.
  • System 832 is a trained data database. Trained data database 832 is used to support both the processing of jobs and the data training system 834. Trained data database 832 can be a file system, a relational database, a XML document or a combination of these. In preferred implementations, the trained data database 832 uses a file system to store large blobs (binary large objects) and a relational database to store pointers to the blobs and other information pertinent to processing the job. As the system grows “smarter” by recognizing more document data, the trained data database 832 grows. As the machine learning system sees more data, its data identification accuracy increases.
  • The trained data database 832 contains information that is used to extract data. The trained data database 832 includes:
      • 1. For each type of form, a set of canonical labels associated with each data element that should be extracted from that type of form. Examples of canonical labels for a Form W-2 include the Social Security Number, Taxpayer Name, Wages, and Federal Income Tax Withheld.
      • 2. For each canonical label, a set of expected labels that correspond to learned variations of the canonical label. Examples of variations in expected labels for the Social Security Number canonical label are Social Security Number, Soc. Security No. and SSN.
      • 3. For each type of form, the learned variations in the relative locations of expected labels.
      • 4. For each type of form, the learned variations in the types of data containing features that may occur. The data containing features include boxes, virtual boxes, check boxes, text, lines, and tables.
  • System 834 is a data training system. The data training system 834 adapts to a growing set of document data; additional document data add additional features that must be analyzed. Preferred implementations of the data training system 834 include tuning and optimization to handle noise generated during both the training phase and the testing phase. The training phase is also called learning phase since the parameters and weights are tuned to improve the learning and adaptability of the system by fitting the model that minimizes the error function of the dataset.
  • The invention extracts data from an image via a process of progressive refinement and reduced character set OCR (as illustrated in FIG. 33) in order to overcome the imperfections of OCR or low quality documents. The scanned image is processed by generic OCR which, in this example, produces errors in both the label portion and the value portion of the box. However, using standard techniques, the OCR output for the label portion is correctly identified as “Medicare Tax Withheld”. In this example, the value related to the identified label is known to be a monetary amount, so the part of the image that corresponds to the value is reprocessed by a restricted-character-set OCR. This OCR process is trained to identify only the characters possible in a monetary amount, i.e. the digits [0-9], and certain special characters [$,. ( )−]. The reduced search space greatly increases the accuracy of the restricted-character-set OCR output, and it produces the correct value of 131.52.
  • The invention extracts data from an image via a process of progressive refinement that utilizes a reduced search space as more is learned about the form being extracted (as illustrated in FIG. 34). In the example shown in FIG. 34, poor OCR is used to identify the correct label. First, the OCR output is used to identify the class of the form because classification process is very robust to poor OCR. After the form has been determined to be, for example, a W-2, the label search is constrained to only the labels that are expected in W-2 forms. This greatly reduces the search space, and therefore increases the accuracy of extraction.
  • In general, as more information is known about a form, constraints are added to reduce the search space. This reduction in search space permits prior processes to be rerun, significantly improving the overall extraction accuracy.
  • The invention extracts data from an image via a process of progressive refinement that utilizes data external to the form image being extracted (as illustrated in FIG. 35). In the example shown in FIG. 35, data that was extracted from the 1099-OID form is used to extract data from the 1099-G form. The Recipient's identifier number of the 1099-G form is light and washed out, and results in poor OCR output. In this example, the two forms are in the same job, and they both have the same Recipient's name (John Smith). The Recipient's identification number on the 1099-G form can be inferred to be 432-10-9876, the same as the Recipient's identification number on the 1099-OID form.
  • The invention extracts data from an image via a process of progressive refinement that utilizes data not extracted from any image (as illustrated in FIG. 36). In the example shown in FIG. 36, data that is available in a “pro-forma” file is used to identify data on a form. The pro-forma file contains taxpayer information from the previous year's tax return that has been quality checked, including the taxpayer name, taxpayer Social Security Number, spouse name, spouse Social Security Number, dependent names and Social Security Numbers, and other information about the tax forms included in the previous year's tax return. All this information is available to the data extraction process, and is assumed to be accurate. The pro-forma external data enables the verification and correction of low-confidence OCR-extracted data.
  • The invention utilizes a set of known-value databases to augment the results of conventional data extraction methods such as OCR. The know-value databases are obtained from vendors or public sources; the known-value databases are also built from data extracted from forms that have been submitted by users of the data extraction system. Known-value databases, for example, contain information on employers, banks and financial institutions and their corresponding addresses and identification numbers. FIG. 37 shows a 1099-G form in which the payer's name is struck out, making it difficult to OCR correctly. As can be seen in FIG. 38, the payer's name has not been extracted because of the missing label. A known-value database of the issuers of 1099-G forms (which are the revenue departments of the 50 states) provides the payer's name by a simple lookup. This finding is verified by comparing the lookup results against the relevant OCR output.
  • The invention utilizes known constraints between the semantics of extracted data elements to identify potentially incorrectly extracted data. The constraints are specified by subject matter experts (for example, bankers in the case of loan origination forms); the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system. For example, FIG. 39 is an image of a W-2 form with a faded digit in the value for box 1 “Wages, tips and other compensation.” As shown in FIG. 40, the extracted value corresponding to the “Wages, tips and other compensation” label is 060.83 (versus the correct value of 9060.83.) The extracted value is flagged as incorrect when comparing it to the extracted value for Federal income tax withheld (106.11). The constraints for a W-2 form specify that Federal income tax withholdings cannot exceed total wages.
  • The invention utilizes known constraints between the semantics of extracted data elements to correct potentially incorrectly extracted data. The constraints are specified by subject matter experts (for example, Certified Public Accountants in the case of income tax forms); the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system. In the above example illustrated in FIG. 39 and FIG. 40, the constraints for a W-2 form specify that, for wages below a threshold amount, in most cases “Wages, tips and other compensation” is equal to “Social security wages” and “Medicare wages and tips.” In this example, the constraints indicate that when “Wages, tips and other compensation” is flagged as incorrect and differs by a single digit from “Social security wages,” then the value from “Social security wages” replaces the value of “Wages, tips and other compensation.”
  • The invention utilizes known constraints in the layout of data elements, to narrow the search space and thereby more accurately extract data. The layout constraints are specified by technical experts; the constraints are also determined by analysis of data extracted from forms that have been submitted by users of the data extraction system. FIG. 41 illustrates the relationship of layout elements in a portion of a W-2 form. In FIG. 41, for example, the label “Social security wages” is to the left of the label “Social security tax withheld.” This layout relationship and others, specified by experts or determined by analysis, are used to infer missing labels and also identify spurious data such as pencil marks, tick marks and other noise.
  • The invention predicts occurrences of instruction blocks based on detected layout patterns from forms that have been submitted by users of the data extraction system. The invention eliminates such instruction blocks from further data extraction, thus simplifying the extraction process and thereby improving the accuracy of data extraction.
  • The invention detects tables using column layout and the expected header layout based on detected layout patterns from forms that have been submitted by users of the data extraction system. Known constraints, in the form of relationships between header elements, are used to predict headers when not correctly detected.
  • The layout of multiple occurrences of a particular extracted artifact, e.g. four occurrences of each expected data element in a W-2, is used to identify the four logical records in the W-2.
  • The mechanism that was used to identify a particular data artifact, e.g. label identified by correct OCR text vs predicted label, is used to attach a confidence to the extracted data.
      • 1. Infer labels
      • 2. Identify instruction blocks, pencil marks, other “noise” etc. and eliminate from search space
      • 3. Map canonical and detected labels
      • 4. Detect tables
      • 5. Record detection
      • 6. Attach confidence to extracted data
  • The invention utilizes layout data structure to extract data from form images. The use of a layout data structure is illustrated in the context of a portion of a W-2 form image shown in FIG. 41. First, the low-level layout graph of confetti is created; its internal representation is partially illustrated in FIG. 42. While the left, right, top, and bottom connection sets exactly map the layout, for brevity, only the right and down sets for each confetti is shown in FIG. 42. Second, labels are detected. Third, as illustrated in FIG. 43, the layout graph is modified by identifying the detected labels (shown as light grey blocks). Fourth, the label-value correlations are determined (shown by the dark grey blocks). Note that the illustration shows the right set of each of the features shown. Note also that the layout relations of the contained features do not cross out of the container; this aspect of the data structure significantly improves the efficiency of the data structure. Also shown are the down sets of each feature. The contained features can be seen to maintain layout relations within the container, leaving it to the container to maintain external layout relations.
  • The invention extracts data from an image via a process of progressive refinement that utilizes contours matching (as described above). While contour matching on its own is of limited value over a large universe of labels, coupled with the progressive refinement technique, contour matching is robust. As an example, the labels from the 1099-OID form of FIG. 35 are shown in FIG. 44. Since there is significant similarity between the contours for “PAYER's federal identification number” and “RECIPIENT's federal identification number,” it is inappropriate to differentiate these two labels using their contours. However, differentiating “RECIPIENT's name” from “PAYER′S name, street address, city, state, ZIP code and telephone no” is appropriate. Accordingly, contour matching is used in those cases in which the set of options is small.
  • The invention utilizes contour matching along with text-based label matching as part of the progressive refinement process. Once the 1099-OID form in FIG. 35 is correctly classified, for example, the search space for labels is restricted to labels that occur in a 1099-OID. As part of the progressive refinement process, in this example, all the labels except “RECIPIENT's name” and “Original Issue discount for 2009” were identified by text-based matching. Contour matching is then used to distinguish between these two labels.
  • FIG. 13 is a system diagram of the service control manager 410. System 410 has a main thread 1301, task queues 1302, database client thread controllers 1303, task queues 1304, slave controllers 1305 and SCM queue 1306.
  • The main thread 1301 controls the primary state machine for all the jobs in the system.
  • Task queues 1302 provide message queues for database communication.
  • Database client thread controllers 1303 manage the database server interface.
  • Task queues 1304 provide message queues for communication with slave controllers.
  • Slave controllers 1305 manage various slave processes via the slave controller interface.
  • The SCM queue 1306 provides a mechanism for the various controllers to communicate with the main thread.
  • In the preferred implementation, various threads communicate with each other using message queues. Whenever a new document is received for processing, the main thread is notified and it requests the database client thread to retrieve the job for processing based on the states and the queue of other jobs in the system.
  • In the preferred implementation, once the job is loaded in memory, a finite state machine for that job is created and the job starts to be processed. The main thread puts the job on a particular task queue based on the state machine instructions. For example, if the job needs to be image processed, then the job will be placed on the image processing task queue. If the slave controller for the image processing slave finds an idle image processing slave process, then the job is picked up from that queue and given to the slave process for processing. Once the slave finishes performing its assigned task, it returns the job to the slave controller which puts the job back on the SCM queue 1306. The main thread sequentially picks up the job from the SCM queue 1306 and decides on the next state of the job based on the finite state machine states. Once a job is completed, the finite state machine for the job is closed and the extracted document is returned to the content repository 322 and made available to the client's portal as a finished and processed document.
  • Alternatively, it is possible for a single process to implement all the functionality of the slaves as outlined in the description of the preferred implementation. The ideas outlined for the preferred implementation are all valid for such an implementation.
  • FIG. 18 is a diagram that depicts the various components of a computerized document data extraction system, according to certain embodiments of the invention. An exemplary document data extraction system may include a host computer 1801 that contains volatile memory, 1802, a persistent storage device such as a hard drive, 1808, a processor, 1803, and a network interface, 1804. Using the network interface, the system computer can interact with databases, 1805, 1806. Although FIG. 18 illustrates a system in which the system computer is separate from the various databases, some or all of the databases may be housed within the host computer, eliminating the need for a network interface. The programmatic processes may be executed on a single host, as shown in FIG. 18, or they may be distributed across multiple hosts.
  • The host computer shown in FIG. 18 may serve as a document data analysis system. The host computer receives electronic documents from multiple users. Workstations may be connected to a graphical display device, 1807, and to input devices such as a mouse, 1809, and a keyboard, 1810. Alternately, the active user's workstation may comprise a handheld device.
  • In some embodiments, the flow charts included in this application describe the logical steps that are embodied as computer executable instructions that could be stored in computer readable medium, such as various memories and disks, that, when executed by a processor, such as a server or server cluster, cause the processor to perform the logical steps.
  • While text extraction and recognition may be performed with OCR and OCR-like techniques it is not limited to such. Other techniques could be used, including image recognition-like techniques.
  • As described above, preferred embodiments extract image features from a document and use this to assist in dataifying the document category and extracting data from the document. These image features include inherent image features, e.g. lines, line crossings, etc. that are put in place by the document authors (or authors of an original source or blank document) to organize the document or the like. They were typically not included as a means of identifying the document, even though the inventors have discovered that they can be used as such, especially with the use of machine learning techniques.
  • While many applications can benefit from extracting both image and text features so that the extracted features may be used to dataify documents and extract data from those documents, for some applications, image features alone may suffice. Specifically, some problem domains may have document categories where the inherent image features are sufficiently distinctive to dataify a document and extract data with high enough confidence (even without processing text features.)
  • Preferred embodiments of the invention may incorporate classification techniques described in the following patent applications, each of which is hereby incorporated by reference herein in its entirety:
  • U.S. Patent Application Publication No. 2009/0116736, entitled “Systems and Methods to Automatically Classify Electronic Documents Using Extracted Image and Text Features and Using a Machine Learning Subsystem;”
  • U.S. Patent Application Publication No. 2009/0116757, entitled “Systems and Methods for Classifying Electronic Documents by Extracting and Recognizing Text and Image Features Indicative of Document Categories;”
  • U.S. Patent Application Publication No. 2009/0116755, entitled “Systems and Methods for Enabling Manual Classification of Unrecognized Documents to Complete Workflow for Electronic Jobs and to Assist Machine Learning of a Recognition System Using Automatically Extracted Features of Unrecognized Documents;”
  • U.S. Patent Application Publication No. 2009/0116756, entitled “Systems and Methods for Training a Document Classification System Using Documents from a Plurality of Users;”
  • U.S. Patent Application Publication No. 2009/0116746, entitled “Systems and Methods for Parallel Processing of Document Recognition and Classification Using Extracted Image and Text Features;” and
  • U.S. Patent Application Publication No. 2009/0119296, entitled “Systems and Methods for Handling and Distinguishing Binarized, Background Artifacts in the Vicinity of Document Text and Image Features Indicative of a Document Category.”
  • Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention. Features of the disclosed embodiments can be combined and rearranged in various ways.

Claims (1)

1. In a document analysis system that receives and processes jobs from a plurality of users, in which each job may contain multiple electronic documents, to classify each document into a corresponding document category and to extract data from the electronic documents, a method of training the document analysis system to automatically extract data from each document, the method comprising:
automatically analyzing images and text features extracted from each received electronic document to associate the electronic document with a corresponding document category;
comparing the extracted text features with a set of text features associated with corresponding category of each received document, in which the set of text features includes a set of characters, words, and phrases;
if the extracted text features are found to consist of the characters, words, and phrases belonging to the set of text features associated with the corresponding electronic document category, storing the extracted text features as the data contained in the corresponding electronic document; and
if the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding electronic document category, submitting the unrecognized text features to a training phase in which the text features are recognized as belonging to the set of text features associated with the corresponding electronic document category and then using the now-recognized text features to automatically modify the set of text features associated with the corresponding electronic document category so that the extracting data, regardless of which document category the corresponding document belongs to, improves as the training method is subjected to more and more unrecognized text features and the set of text features is modified accordingly.
US13/007,430 2010-01-15 2011-01-14 Systems and methods for training document analysis system for automatically extracting data from documents Abandoned US20110258150A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/007,430 US20110258150A1 (en) 2010-01-15 2011-01-14 Systems and methods for training document analysis system for automatically extracting data from documents
US13/166,966 US20110249905A1 (en) 2010-01-15 2011-06-23 Systems and methods for automatically extracting data from electronic documents including tables

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US29521010P 2010-01-15 2010-01-15
US13/007,430 US20110258150A1 (en) 2010-01-15 2011-01-14 Systems and methods for training document analysis system for automatically extracting data from documents

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/166,966 Continuation-In-Part US20110249905A1 (en) 2010-01-15 2011-06-23 Systems and methods for automatically extracting data from electronic documents including tables

Publications (1)

Publication Number Publication Date
US20110258150A1 true US20110258150A1 (en) 2011-10-20

Family

ID=44788245

Family Applications (11)

Application Number Title Priority Date Filing Date
US13/007,443 Abandoned US20110255789A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic documents containing multiple layout features
US13/007,422 Abandoned US20110255788A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic documents using external data
US13/007,434 Abandoned US20110255784A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from eletronic documents using multiple character recognition engines
US13/007,481 Abandoned US20110255790A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically grouping electronic document pages
US13/007,452 Active 2031-09-08 US8571317B2 (en) 2010-01-15 2011-01-14 Systems and methods for automatically processing electronic documents using multiple image transformation algorithms
US13/007,399 Abandoned US20110258170A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically correcting data extracted from electronic documents using known constraints for semantics of extracted data elements
US13/007,430 Abandoned US20110258150A1 (en) 2010-01-15 2011-01-14 Systems and methods for training document analysis system for automatically extracting data from documents
US13/007,466 Abandoned US20110255794A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data by narrowing data search scope using contour matching
US13/007,330 Abandoned US20110258182A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic document page including multiple copies of a form
US13/007,407 Abandoned US20110258195A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements
US14/064,935 Active US8897563B1 (en) 2010-01-15 2013-10-28 Systems and methods for automatically processing electronic documents

Family Applications Before (6)

Application Number Title Priority Date Filing Date
US13/007,443 Abandoned US20110255789A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic documents containing multiple layout features
US13/007,422 Abandoned US20110255788A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic documents using external data
US13/007,434 Abandoned US20110255784A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from eletronic documents using multiple character recognition engines
US13/007,481 Abandoned US20110255790A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically grouping electronic document pages
US13/007,452 Active 2031-09-08 US8571317B2 (en) 2010-01-15 2011-01-14 Systems and methods for automatically processing electronic documents using multiple image transformation algorithms
US13/007,399 Abandoned US20110258170A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically correcting data extracted from electronic documents using known constraints for semantics of extracted data elements

Family Applications After (4)

Application Number Title Priority Date Filing Date
US13/007,466 Abandoned US20110255794A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data by narrowing data search scope using contour matching
US13/007,330 Abandoned US20110258182A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically extracting data from electronic document page including multiple copies of a form
US13/007,407 Abandoned US20110258195A1 (en) 2010-01-15 2011-01-14 Systems and methods for automatically reducing data search space and improving data extraction accuracy using known constraints in a layout of extracted data elements
US14/064,935 Active US8897563B1 (en) 2010-01-15 2013-10-28 Systems and methods for automatically processing electronic documents

Country Status (1)

Country Link
US (11) US20110255789A1 (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361033A (en) * 2014-10-27 2015-02-18 深圳职业技术学院 Automatic cancer-related information collection method and system
US9165207B2 (en) 2013-02-25 2015-10-20 Google Inc. Screenshot orientation detection
US20160085812A1 (en) * 2014-09-24 2016-03-24 Samsung Electronics Co., Ltd. Method of managing content in electronic apparatus and electronic apparatus controlled according to the method
US20170177995A1 (en) * 2014-03-20 2017-06-22 The Regents Of The University Of California Unsupervised high-dimensional behavioral data classifier
AU2016269570B2 (en) * 2015-12-29 2017-12-07 Accenture Global Solutions Limited Document processing
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
US10204143B1 (en) 2011-11-02 2019-02-12 Dub Software Group, Inc. System and method for automatic document management
CN109446522A (en) * 2018-10-22 2019-03-08 东莞市七宝树教育科技有限公司 A kind of examination question automatic classification system and method
CN110175623A (en) * 2019-04-10 2019-08-27 阿里巴巴集团控股有限公司 Desensitization process method and device based on image recognition
US20190294874A1 (en) * 2018-03-23 2019-09-26 Abbyy Production Llc Automatic definition of set of categories for document classification
US10452995B2 (en) 2015-06-29 2019-10-22 Microsoft Technology Licensing, Llc Machine learning classification on hardware accelerators with stacked memory
CN110705251A (en) * 2019-10-14 2020-01-17 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
US10540588B2 (en) 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
CN110851675A (en) * 2019-10-10 2020-02-28 厦门市美亚柏科信息股份有限公司 Data extraction method, device and medium
US10606651B2 (en) 2015-04-17 2020-03-31 Microsoft Technology Licensing, Llc Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit
WO2020082187A1 (en) * 2018-10-26 2020-04-30 Element Ai Inc. Sensitive data detection and replacement
US10839302B2 (en) 2015-11-24 2020-11-17 The Research Foundation For The State University Of New York Approximate value iteration with complex returns by bounding
WO2021105805A1 (en) * 2019-11-25 2021-06-03 Vanitha S Robot and method for providing content in a newspaper and other physical documents
WO2021221614A1 (en) * 2020-04-28 2021-11-04 Hewlett-Packard Development Company, L.P. Document orientation detection and correction
WO2022026135A1 (en) * 2020-07-29 2022-02-03 Docusign, Inc. Automated document tagging in a digital management platform
US11288590B2 (en) * 2016-05-24 2022-03-29 International Business Machines Corporation Automatic generation of training sets using subject matter experts on social media
CN116152833A (en) * 2022-12-30 2023-05-23 北京百度网讯科技有限公司 Training method of form restoration model based on image and form restoration method
CN116186543A (en) * 2023-03-01 2023-05-30 深圳崎点数据有限公司 Financial data processing system and method based on image recognition
US11704352B2 (en) 2021-05-03 2023-07-18 Bank Of America Corporation Automated categorization and assembly of low-quality images into electronic documents
US11798258B2 (en) 2021-05-03 2023-10-24 Bank Of America Corporation Automated categorization and assembly of low-quality images into electronic documents
US11881041B2 (en) 2021-09-02 2024-01-23 Bank Of America Corporation Automated categorization and processing of document images of varying degrees of quality

Families Citing this family (229)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8996416B2 (en) * 2002-01-22 2015-03-31 Lavante, Inc. OCR enabled management of accounts payable and/or accounts receivable auditing data
US8069129B2 (en) 2007-04-10 2011-11-29 Ab Initio Technology Llc Editing and compiling business rules
US10453043B2 (en) * 2008-06-25 2019-10-22 Thomson Reuters Global Resources Unlimited Company System and method for online bill payment
CN104679807B (en) 2008-06-30 2018-06-05 起元技术有限责任公司 Data log record in calculating based on figure
US8478706B2 (en) * 2009-01-30 2013-07-02 Ab Initio Technology Llc Processing data using vector fields
US20110255789A1 (en) * 2010-01-15 2011-10-20 Copanion, Inc. Systems and methods for automatically extracting data from electronic documents containing multiple layout features
US8675923B2 (en) * 2010-07-21 2014-03-18 Intuit Inc. Providing feedback about an image of a financial document
US8924395B2 (en) * 2010-10-06 2014-12-30 Planet Data Solutions System and method for indexing electronic discovery data
US8495060B1 (en) * 2010-12-07 2013-07-23 Trend Micro, Inc. Prioritization of reports using content data change from baseline
US8266141B2 (en) * 2010-12-09 2012-09-11 Microsoft Corporation Efficient use of computational resources for interleaving
US20120265846A1 (en) * 2011-04-15 2012-10-18 Springboard Non Profit Consumer Credit Management System and method of coordinating a debt-relief program
US9294307B2 (en) 2011-10-07 2016-03-22 Microsoft Technology Licensing, Llc Synchronization of conversation data
US9208146B2 (en) * 2012-01-17 2015-12-08 Sin El Gim System for providing universal communication that employs a dictionary database
US9715625B2 (en) 2012-01-27 2017-07-25 Recommind, Inc. Hierarchical information extraction using document segmentation and optical character recognition correction
JP5454827B1 (en) * 2012-02-24 2014-03-26 日本電気株式会社 Document evaluation apparatus, document evaluation method, and program
US9330323B2 (en) * 2012-04-29 2016-05-03 Hewlett-Packard Development Company, L.P. Redigitization system and service
US9613267B2 (en) * 2012-05-31 2017-04-04 Xerox Corporation Method and system of extracting label:value data from a document
JP2014036314A (en) * 2012-08-08 2014-02-24 Canon Inc Scan service system, scan service method, and scan service program
WO2014022919A1 (en) * 2012-08-10 2014-02-13 Transaxy Inc. System for entering data into a data processing system
US9147275B1 (en) 2012-11-19 2015-09-29 A9.Com, Inc. Approaches to text editing
US9043349B1 (en) * 2012-11-29 2015-05-26 A9.Com, Inc. Image-based character recognition
US9703822B2 (en) 2012-12-10 2017-07-11 Ab Initio Technology Llc System for transform generation
US8885951B1 (en) 2012-12-14 2014-11-11 Tony Cristofano System and method for data identification and extraction of forms
US9430453B1 (en) * 2012-12-19 2016-08-30 Emc Corporation Multi-page document recognition in document capture
DE102012025351B4 (en) * 2012-12-21 2020-12-24 Docuware Gmbh Processing of an electronic document
US9158744B2 (en) * 2013-01-04 2015-10-13 Cognizant Technology Solutions India Pvt. Ltd. System and method for automatically extracting multi-format data from documents and converting into XML
US9342930B1 (en) 2013-01-25 2016-05-17 A9.Com, Inc. Information aggregation for recognized locations
US9104710B2 (en) * 2013-03-15 2015-08-11 Src, Inc. Method for cross-domain feature correlation
CN103177258B (en) * 2013-03-29 2016-08-17 河南理工大学 A kind of method automatically extracting geography line according to vector contour line data
US8947745B2 (en) 2013-07-03 2015-02-03 Symbol Technologies, Inc. Apparatus and method for scanning and decoding information in an identified location in a document
CN104298982B (en) * 2013-07-16 2019-03-08 深圳市腾讯计算机系统有限公司 A kind of character recognition method and device
US10943689B1 (en) 2013-09-06 2021-03-09 Labrador Diagnostics Llc Systems and methods for laboratory testing and result management
CA2924826A1 (en) 2013-09-27 2015-04-02 Ab Initio Technology Llc Evaluating rules applied to data
US9286372B2 (en) 2013-11-06 2016-03-15 Sap Se Content management with RDBMS
US10114800B1 (en) * 2013-12-05 2018-10-30 Intuit Inc. Layout reconstruction using spatial and grammatical constraints
US9529874B2 (en) 2013-12-19 2016-12-27 International Business Machines Corporation Verification of transformed content
US20150178856A1 (en) * 2013-12-20 2015-06-25 Alfredo David Flores System and Method for Collecting and Submitting Tax Related Information
RU2641225C2 (en) * 2014-01-21 2018-01-16 Общество с ограниченной ответственностью "Аби Девелопмент" Method of detecting necessity of standard learning for verification of recognized text
US9355313B2 (en) 2014-03-11 2016-05-31 Microsoft Technology Licensing, Llc Detecting and extracting image document components to create flow document
US9760953B1 (en) 2014-03-12 2017-09-12 Intuit Inc. Computer implemented methods systems and articles of manufacture for identifying tax return preparation application questions based on semantic dependency
US10387969B1 (en) 2014-03-12 2019-08-20 Intuit Inc. Computer implemented methods systems and articles of manufacture for suggestion-based interview engine for tax return preparation application
US20210005324A1 (en) * 2018-08-08 2021-01-07 Hc1.Com Inc. Methods and systems for a health monitoring command center and workforce advisor
US10922766B2 (en) 2014-05-11 2021-02-16 Zoccam Technologies, Inc. Systems and methods for database management of transaction information and payment data
US10922767B2 (en) * 2014-05-11 2021-02-16 Zoccam Technologies, Inc. Systems and methods for database management of transaction information and payment instruction data
US9378435B1 (en) * 2014-06-10 2016-06-28 David Prulhiere Image segmentation in optical character recognition using neural networks
US20160034756A1 (en) * 2014-07-30 2016-02-04 Lexmark International, Inc. Coarse Document Classification
US11430072B1 (en) 2014-07-31 2022-08-30 Intuit Inc. System and method of generating estimates used to calculate taxes
US10867355B1 (en) 2014-07-31 2020-12-15 Intuit Inc. Computer implemented methods systems and articles of manufacture for preparing electronic tax return with assumption data
TWI536798B (en) * 2014-08-11 2016-06-01 虹光精密工業股份有限公司 Image filing method
US10540725B1 (en) 2014-08-18 2020-01-21 Intuit Inc. Methods systems and articles of manufacture for handling non-standard screen changes in preparing an electronic tax return
US10977743B1 (en) 2014-08-18 2021-04-13 Intuit Inc. Computer implemented methods systems and articles of manufacture for instance and suggestion differentiation during preparation of electronic tax return
US10970793B1 (en) 2014-08-18 2021-04-06 Intuit Inc. Methods systems and articles of manufacture for tailoring a user experience in preparing an electronic tax return
US11861734B1 (en) 2014-08-18 2024-01-02 Intuit Inc. Methods systems and articles of manufacture for efficiently calculating a tax return in a tax return preparation application
US11354755B2 (en) * 2014-09-11 2022-06-07 Intuit Inc. Methods systems and articles of manufacture for using a predictive model to determine tax topics which are relevant to a taxpayer in preparing an electronic tax return
US10013721B1 (en) 2014-10-31 2018-07-03 Intuit Inc. Identification of electronic tax return errors based on declarative constraints
US10796381B1 (en) 2014-10-31 2020-10-06 Intuit Inc. Systems and methods for determining impact correlations from a tax calculation graph of a tax preparation system
US10169826B1 (en) 2014-10-31 2019-01-01 Intuit Inc. System and method for generating explanations for tax calculations
US10255641B1 (en) 2014-10-31 2019-04-09 Intuit Inc. Predictive model based identification of potential errors in electronic tax return
US11392629B2 (en) 2014-11-18 2022-07-19 Oracle International Corporation Term selection from a document to find similar content
US10387970B1 (en) 2014-11-25 2019-08-20 Intuit Inc. Systems and methods for analyzing and generating explanations for changes in tax return results
US10235722B1 (en) 2014-11-26 2019-03-19 Intuit Inc. Systems and methods for analyzing and determining estimated taxes
US10296984B1 (en) 2014-11-26 2019-05-21 Intuit Inc. Systems, methods and articles of manufacture for determining relevancy of tax topics in a tax preparation system
US10235721B1 (en) 2014-11-26 2019-03-19 Intuit Inc. System and method for automated data gathering for tax preparation
US11222384B1 (en) 2014-11-26 2022-01-11 Intuit Inc. System and method for automated data estimation for tax preparation
US10157426B1 (en) 2014-11-28 2018-12-18 Intuit Inc. Dynamic pagination of tax return questions during preparation of electronic tax return
US10572952B1 (en) 2014-12-01 2020-02-25 Intuit Inc. Computer implemented methods systems and articles of manufacture for cross-field validation during preparation of electronic tax return
US9430766B1 (en) 2014-12-09 2016-08-30 A9.Com, Inc. Gift card recognition using a camera
US9881079B2 (en) * 2014-12-24 2018-01-30 International Business Machines Corporation Quantification based classifier
EP3149659A4 (en) 2015-02-04 2018-01-10 Vatbox, Ltd. A system and methods for extracting document images from images featuring multiple documents
US10891323B1 (en) * 2015-02-10 2021-01-12 West Corporation Processing and delivery of private electronic documents
CN107430610B (en) * 2015-02-13 2021-08-03 澳大利亚国家Ict有限公司 Learning from distributed data
IL237548B (en) 2015-03-04 2020-05-31 Au10Tix Ltd Methods for categorizing input images for use e.g. as a gateway to authentication systems
US10872384B1 (en) 2015-03-30 2020-12-22 Intuit Inc. System and method for generating explanations for year-over-year tax changes
US10140666B1 (en) 2015-03-30 2018-11-27 Intuit Inc. System and method for targeted data gathering for tax preparation
US10740853B1 (en) 2015-04-28 2020-08-11 Intuit Inc. Systems for allocating resources based on electronic tax return preparation program user characteristics
US11113771B1 (en) 2015-04-28 2021-09-07 Intuit Inc. Systems, methods and articles for generating sub-graphs of a tax calculation graph of a tax preparation system
US10664924B1 (en) 2015-04-30 2020-05-26 Intuit Inc. Computer-implemented methods, systems and articles of manufacture for processing sensitive electronic tax return data
US10664925B2 (en) 2015-06-30 2020-05-26 Intuit Inc. Systems, methods and articles for determining tax recommendations
US10607298B1 (en) 2015-07-30 2020-03-31 Intuit Inc. System and method for indicating sections of electronic tax forms for which narrative explanations can be presented
US10402913B2 (en) 2015-07-30 2019-09-03 Intuit Inc. Generation of personalized and hybrid responses to queries submitted from within tax return preparation system during preparation of electronic tax return
US10025565B2 (en) 2015-08-19 2018-07-17 Integrator Software Integrated software development environments, systems, methods, and memory models
US10043218B1 (en) 2015-08-19 2018-08-07 Basil M. Sabbah System and method for a web-based insurance communication platform
US11379929B2 (en) * 2015-08-26 2022-07-05 Hrb Innovations, Inc. Advice engine
US10127264B1 (en) 2015-09-17 2018-11-13 Ab Initio Technology Llc Techniques for automated data analysis
US10387366B2 (en) * 2015-10-08 2019-08-20 Via Alliance Semiconductor Co., Ltd. Neural network unit with shared activation function units
US10740854B1 (en) 2015-10-28 2020-08-11 Intuit Inc. Web browsing and machine learning systems for acquiring tax data during electronic tax return preparation
US10558880B2 (en) 2015-11-29 2020-02-11 Vatbox, Ltd. System and method for finding evidencing electronic documents based on unstructured data
US10509811B2 (en) 2015-11-29 2019-12-17 Vatbox, Ltd. System and method for improved analysis of travel-indicating unstructured electronic documents
US11138372B2 (en) 2015-11-29 2021-10-05 Vatbox, Ltd. System and method for reporting based on electronic documents
US10387561B2 (en) 2015-11-29 2019-08-20 Vatbox, Ltd. System and method for obtaining reissues of electronic documents lacking required data
CN105424726B (en) * 2016-01-12 2018-06-22 苏州富鑫林光电科技有限公司 Luminescent panel detection method based on machine vision
US20170213294A1 (en) * 2016-01-27 2017-07-27 Intuit Inc. Methods, systems and computer program products for calculating an estimated result of a tax return
US10475131B1 (en) 2016-01-27 2019-11-12 Intuit Inc. Methods, systems and computer program products for calculating an estimated result of a tax return
US11087409B1 (en) 2016-01-29 2021-08-10 Ocrolus, LLC Systems and methods for generating accurate transaction data and manipulation
US9508043B1 (en) 2016-02-05 2016-11-29 International Business Machines Corporation Extracting data from documents using proximity of labels and data and font attributes
US11030183B2 (en) * 2016-03-14 2021-06-08 Dr Holdco 2, Inc. Automatic content-based append detection
EP3458971A4 (en) * 2016-05-18 2019-11-06 Vatbox, Ltd. System and method for automatically monitoring requests indicated in electronic documents
US10410295B1 (en) 2016-05-25 2019-09-10 Intuit Inc. Methods, systems and computer program products for obtaining tax data
US11176620B1 (en) 2016-06-28 2021-11-16 Intuit Inc. Systems and methods for generating an error report listing errors in the preparation of a payroll tax form
US11049190B2 (en) 2016-07-15 2021-06-29 Intuit Inc. System and method for automatically generating calculations for fields in compliance forms
US11222266B2 (en) 2016-07-15 2022-01-11 Intuit Inc. System and method for automatic learning of functions
US10140277B2 (en) 2016-07-15 2018-11-27 Intuit Inc. System and method for selecting data sample groups for machine learning of context of data fields for various document types and/or for test data generation for quality assurance systems
US10579721B2 (en) 2016-07-15 2020-03-03 Intuit Inc. Lean parsing: a natural language processing system and method for parsing domain-specific languages
US10725896B2 (en) 2016-07-15 2020-07-28 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on code coverage
US9984471B2 (en) * 2016-07-26 2018-05-29 Intuit Inc. Label and field identification without optical character recognition (OCR)
US10013643B2 (en) * 2016-07-26 2018-07-03 Intuit Inc. Performing optical character recognition using spatial information of regions within a structured document
US10796231B2 (en) 2016-07-26 2020-10-06 Intuit Inc. Computer-implemented systems and methods for preparing compliance forms to meet regulatory requirements
US10872315B1 (en) 2016-07-27 2020-12-22 Intuit Inc. Methods, systems and computer program products for prioritization of benefit qualification questions
US10762472B1 (en) 2016-07-27 2020-09-01 Intuit Inc. Methods, systems and computer program products for generating notifications of benefit qualification change
US11055794B1 (en) 2016-07-27 2021-07-06 Intuit Inc. Methods, systems and computer program products for estimating likelihood of qualifying for benefit
US11087411B2 (en) 2016-07-27 2021-08-10 Intuit Inc. Computerized tax return preparation system and computer generated user interfaces for tax topic completion status modifications
US10769592B1 (en) 2016-07-27 2020-09-08 Intuit Inc. Methods, systems and computer program products for generating explanations for a benefit qualification change
CN106294590B (en) * 2016-07-29 2019-05-31 重庆邮电大学 A kind of social networks junk user filter method based on semi-supervised learning
US10360507B2 (en) 2016-09-22 2019-07-23 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities
US10542017B1 (en) * 2016-10-13 2020-01-21 Symantec Corporation Systems and methods for personalizing security incident reports
US10664926B2 (en) * 2016-10-26 2020-05-26 Intuit Inc. Methods, systems and computer program products for generating and presenting explanations for tax questions
US11138676B2 (en) 2016-11-29 2021-10-05 Intuit Inc. Methods, systems and computer program products for collecting tax data
JP2018112839A (en) * 2017-01-10 2018-07-19 富士通株式会社 Image processing program, image recognition program, image processing device, image recognition device, image recognition method, and image processing method
US11295396B1 (en) 2017-01-30 2022-04-05 Intuit Inc. Computer-implemented methods systems and articles of manufacture for image-initiated preparation of electronic tax return
US11176621B1 (en) 2017-01-30 2021-11-16 Intuit Inc. Computer-implemented methods systems and articles of manufacture for addressing optical character recognition triggered import errors during preparation of electronic tax return
US10977744B1 (en) * 2017-01-30 2021-04-13 Intuit Inc. Computer-implemented methods systems and articles of manufacture for validating electronic tax return data
US20180285982A1 (en) * 2017-03-28 2018-10-04 Intuit Inc. Automated field-mapping of account names for form population
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
US10504220B2 (en) 2017-05-25 2019-12-10 General Electric Company Neural network feature recognition system
US10902284B2 (en) 2017-05-31 2021-01-26 Hcl Technologies Limited Identifying optimum pre-process techniques for text extraction
US10474890B2 (en) * 2017-07-13 2019-11-12 Intuit, Inc. Simulating image capture
CN107741843A (en) * 2017-10-10 2018-02-27 中国航发控制系统研究所 A kind of inspection method and check device of embedded software Specification
AU2018355543B2 (en) * 2017-10-27 2021-01-21 Intuit Inc. System and method for identifying a subset of total historical users of a document preparation system to represent a full set of test scenarios based on statistical analysis
US10706228B2 (en) * 2017-12-01 2020-07-07 International Business Machines Corporation Heuristic domain targeted table detection and extraction technique
WO2019106639A1 (en) * 2017-12-03 2019-06-06 Seedx Technologies Inc. Systems and methods for sorting of seeds
US11503757B2 (en) * 2017-12-03 2022-11-22 Seedx Technologies Inc. Systems and methods for sorting of seeds
EP3707642A1 (en) 2017-12-03 2020-09-16 Seedx Technologies Inc. Systems and methods for sorting of seeds
US11544799B2 (en) * 2017-12-05 2023-01-03 Sureprep, Llc Comprehensive tax return preparation system
US11238540B2 (en) 2017-12-05 2022-02-01 Sureprep, Llc Automatic document analysis filtering, and matching system
US11314887B2 (en) 2017-12-05 2022-04-26 Sureprep, Llc Automated document access regulation system
AU2018100324B4 (en) * 2017-12-18 2018-07-19 LIS Pty Ltd Image Analysis
WO2019160608A1 (en) * 2018-02-16 2019-08-22 Munich Reinsurance America, Inc. Computer-implemented methods, computer-readable media, and systems for identifying causes of loss
US11030705B1 (en) * 2018-02-28 2021-06-08 Intuit Inc. Quick serve tax application
US10489644B2 (en) 2018-03-15 2019-11-26 Sureprep, Llc System and method for automatic detection and verification of optical character recognition data
US11048762B2 (en) * 2018-03-16 2021-06-29 Open Text Holdings, Inc. User-defined automated document feature modeling, extraction and optimization
US10762142B2 (en) 2018-03-16 2020-09-01 Open Text Holdings, Inc. User-defined automated document feature extraction and optimization
CN109086327B (en) * 2018-07-03 2022-05-17 中国科学院信息工程研究所 Method and device for rapidly generating webpage visual structure graph
US10586133B2 (en) * 2018-07-23 2020-03-10 Scribe Fusion, LLC System and method for processing character images and transforming font within a document
WO2020033409A1 (en) * 2018-08-06 2020-02-13 Walmart Apollo, Llc Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
US10853638B2 (en) * 2018-08-31 2020-12-01 Accenture Global Solutions Limited System and method for extracting structured information from image documents
US11763321B2 (en) 2018-09-07 2023-09-19 Moore And Gasperecz Global, Inc. Systems and methods for extracting requirements from regulatory content
US20200110795A1 (en) * 2018-10-05 2020-04-09 Adobe Inc. Facilitating auto-completion of electronic forms with hierarchical entity data models
US11640859B2 (en) 2018-10-17 2023-05-02 Tempus Labs, Inc. Data based cancer research and treatment systems and methods
US10395772B1 (en) 2018-10-17 2019-08-27 Tempus Labs Mobile supplementation, extraction, and analysis of health records
CN113272806A (en) * 2018-11-07 2021-08-17 艾利文Ai有限公司 Removing sensitive data from a file for use as a training set
US11450069B2 (en) 2018-11-09 2022-09-20 Citrix Systems, Inc. Systems and methods for a SaaS lens to view obfuscated content
US10902200B2 (en) 2018-11-12 2021-01-26 International Business Machines Corporation Automated constraint extraction and testing
JP6929823B2 (en) * 2018-11-16 2021-09-01 株式会社東芝 Reading system, reading method, program, storage medium, and mobile
JP7154982B2 (en) * 2018-12-06 2022-10-18 キヤノン株式会社 Information processing device, control method, and program
EP3895068A4 (en) * 2018-12-12 2022-07-13 Hewlett-Packard Development Company, L.P. Scanning devices with zonal ocr user interfaces
US20200250766A1 (en) * 2019-02-06 2020-08-06 Teachers Insurance And Annuity Association Of America Automated customer enrollment using mobile communication devices
US10402641B1 (en) 2019-03-19 2019-09-03 Capital One Services, Llc Platform for document classification
RU2702967C1 (en) * 2019-03-28 2019-10-14 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for checking an electronic set of documents
US11373029B2 (en) * 2019-04-01 2022-06-28 Hyland Uk Operations Limited System and method integrating machine learning algorithms to enrich documents in a content management system
US11783005B2 (en) 2019-04-26 2023-10-10 Bank Of America Corporation Classifying and mapping sentences using machine learning
US11423220B1 (en) 2019-04-26 2022-08-23 Bank Of America Corporation Parsing documents using markup language tags
US11055822B2 (en) 2019-05-03 2021-07-06 International Business Machines Corporation Artificially intelligent, machine learning-based, image enhancement, processing, improvement and feedback algorithms
US11163956B1 (en) 2019-05-23 2021-11-02 Intuit Inc. System and method for recognizing domain specific named entities using domain specific word embeddings
US11310250B2 (en) 2019-05-24 2022-04-19 Bank Of America Corporation System and method for machine learning-based real-time electronic data quality checks in online machine learning and AI systems
JP7379876B2 (en) * 2019-06-17 2023-11-15 株式会社リコー Character recognition device, document file generation method, document file generation program
JP2022537300A (en) 2019-06-21 2022-08-25 エヌフェレンス,インコーポレイテッド Systems and methods for computing using personal healthcare data
US11487902B2 (en) 2019-06-21 2022-11-01 nference, inc. Systems and methods for computing with private healthcare data
WO2021003378A1 (en) * 2019-07-02 2021-01-07 Insurance Services Office, Inc. Computer vision systems and methods for blind localization of image forgery
EP3999929A4 (en) * 2019-07-16 2023-06-21 nference, inc. Systems and methods for populating a structured database based on an image representation of a data table
CN110414480A (en) * 2019-08-09 2019-11-05 威盛电子股份有限公司 Training image production method and electronic device
WO2021035224A1 (en) * 2019-08-22 2021-02-25 Tempus Labs, Inc. Unsupervised learning and prediction of lines of therapy from high-dimensional longitudinal medications data
US11556711B2 (en) 2019-08-27 2023-01-17 Bank Of America Corporation Analyzing documents using machine learning
US11449559B2 (en) 2019-08-27 2022-09-20 Bank Of America Corporation Identifying similar sentences for machine learning
US11423231B2 (en) 2019-08-27 2022-08-23 Bank Of America Corporation Removing outliers from training data for machine learning
US11526804B2 (en) 2019-08-27 2022-12-13 Bank Of America Corporation Machine learning model training for reviewing documents
US11392628B1 (en) * 2019-09-09 2022-07-19 Ciitizen, Llc Custom tags based on word embedding vector spaces
US11941706B2 (en) 2019-09-16 2024-03-26 K1X, Inc. Machine learning system for summarizing tax documents with non-structured portions
RU2739342C1 (en) * 2019-09-17 2020-12-23 Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк) Method and system for intelligent document processing
US11042562B2 (en) * 2019-10-11 2021-06-22 Sap Se Scalable data extractor
US11126837B1 (en) 2019-10-17 2021-09-21 Automation Anywhere, Inc. Computerized recognition of checkboxes in digitized documents
US20210134407A1 (en) * 2019-10-30 2021-05-06 Veda Data Solutions, Inc. Efficient crawling using path scheduling, and applications thereof
CN112783825B (en) * 2019-11-04 2024-01-02 富泰华工业(深圳)有限公司 Data archiving method, device, computer device and storage medium
RU2737720C1 (en) * 2019-11-20 2020-12-02 Общество с ограниченной ответственностью "Аби Продакшн" Retrieving fields using neural networks without using templates
US11182604B1 (en) 2019-11-26 2021-11-23 Automation Anywhere, Inc. Computerized recognition and extraction of tables in digitized documents
US11528267B2 (en) 2019-12-06 2022-12-13 Bank Of America Corporation System for automated image authentication and external database verification
US11210507B2 (en) 2019-12-11 2021-12-28 Optum Technology, Inc. Automated systems and methods for identifying fields and regions of interest within a document image
US11227153B2 (en) 2019-12-11 2022-01-18 Optum Technology, Inc. Automated systems and methods for identifying fields and regions of interest within a document image
US11544415B2 (en) 2019-12-17 2023-01-03 Citrix Systems, Inc. Context-aware obfuscation and unobfuscation of sensitive content
US11539709B2 (en) * 2019-12-23 2022-12-27 Citrix Systems, Inc. Restricted access to sensitive content
CN111199538B (en) * 2019-12-25 2022-11-25 杭州中威电子股份有限公司 Privacy protection degree evaluation method for multilayer compressed sensing image
CN111222340B (en) * 2020-01-15 2021-12-07 东华大学 Breast electronic medical record entity recognition system based on multi-standard active learning
US11481691B2 (en) 2020-01-16 2022-10-25 Hyper Labs, Inc. Machine learning-based text recognition system with fine-tuning model
US11582266B2 (en) 2020-02-03 2023-02-14 Citrix Systems, Inc. Method and system for protecting privacy of users in session recordings
US11783128B2 (en) 2020-02-19 2023-10-10 Intuit Inc. Financial document text conversion to computer readable operations
JP2021144512A (en) * 2020-03-12 2021-09-24 富士フイルムビジネスイノベーション株式会社 Document processing device and program
US20210294851A1 (en) * 2020-03-23 2021-09-23 UiPath, Inc. System and method for data augmentation for document understanding
CN111539438B (en) * 2020-04-28 2024-01-12 北京百度网讯科技有限公司 Text content identification method and device and electronic equipment
US11341318B2 (en) 2020-07-07 2022-05-24 Kudzu Software Llc Interactive tool for modifying an automatically generated electronic form
US11403455B2 (en) 2020-07-07 2022-08-02 Kudzu Software Llc Electronic form generation from electronic documents
US11495014B2 (en) * 2020-07-22 2022-11-08 Optum, Inc. Systems and methods for automated document image orientation correction
US11335110B2 (en) * 2020-08-05 2022-05-17 Verizon Patent And Licensing Inc. Systems and methods for processing a table of information in a document
WO2022041163A1 (en) 2020-08-29 2022-03-03 Citrix Systems, Inc. Identity leak prevention
CN112085019A (en) * 2020-08-31 2020-12-15 深圳思谋信息科技有限公司 Character recognition model generation system, method and device and computer equipment
US10956673B1 (en) * 2020-09-10 2021-03-23 Moore & Gasperecz Global Inc. Method and system for identifying citations within regulatory content
US11295175B1 (en) * 2020-09-25 2022-04-05 International Business Machines Corporation Automatic document separation
US20220108772A1 (en) * 2020-10-01 2022-04-07 Gsi Technology Inc. Functional protein classification for pandemic research
US20220121881A1 (en) * 2020-10-19 2022-04-21 Fulcrum Global Technologies Inc. Systems and methods for enabling relevant data to be extracted from a plurality of documents
US11314922B1 (en) 2020-11-27 2022-04-26 Moore & Gasperecz Global Inc. System and method for generating regulatory content requirement descriptions
US20220147814A1 (en) 2020-11-09 2022-05-12 Moore & Gasperecz Global Inc. Task specific processing of regulatory content
KR102500725B1 (en) * 2020-11-17 2023-02-16 주식회사 한글과컴퓨터 Electronic apparatus that generates a summary of an electronic document based on key keywords and operating method thereof
US11501550B2 (en) 2020-11-24 2022-11-15 International Business Machines Corporation Optical character recognition segmentation
CN112329777B (en) * 2021-01-06 2021-05-04 平安科技(深圳)有限公司 Character recognition method, device, equipment and medium based on direction detection
WO2022150838A1 (en) * 2021-01-08 2022-07-14 Schlumberger Technology Corporation Exploration and production document content and metadata scanner
AU2021428503A1 (en) * 2021-02-18 2023-09-21 Xero Limited Systems and methods for generating document numerical representations
US11860950B2 (en) * 2021-03-30 2024-01-02 Sureprep, Llc Document matching and data extraction
US11630644B2 (en) 2021-05-27 2023-04-18 Bank Of America Corporation Service for configuring custom software
CN113361253B (en) * 2021-05-28 2024-04-09 北京金山数字娱乐科技有限公司 Recognition model training method and device
US11893012B1 (en) 2021-05-28 2024-02-06 Amazon Technologies, Inc. Content extraction using related entity group metadata from reference objects
WO2023059759A1 (en) * 2021-10-06 2023-04-13 Schlumberger Technology Corporation Well completion selection and design using data insights
CN113762224B (en) * 2021-11-09 2022-04-29 四川野马科技有限公司 Engineering cost achievement quality inspection system and method thereof
KR20230080113A (en) * 2021-11-29 2023-06-07 신은영 System and method for extracting location and type of question automatically in learning contents in the form of electronic documents
US11881042B2 (en) 2021-11-30 2024-01-23 International Business Machines Corporation Semantic template matching
US20230196469A1 (en) * 2021-12-17 2023-06-22 Get Heal, Inc. System and Method for Processing Insurance Cards
US11687700B1 (en) * 2022-02-01 2023-06-27 International Business Machines Corporation Generating a structure of a PDF-document
US11863615B2 (en) 2022-03-18 2024-01-02 T-Mobile Usa, Inc. Content management systems providing zero recovery time objective
US11934421B2 (en) 2022-06-03 2024-03-19 Cognizant Technology Solutions India Pvt. Ltd. Unified extraction platform for optimized data extraction and processing
US11823477B1 (en) 2022-08-30 2023-11-21 Moore And Gasperecz Global, Inc. Method and system for extracting data from tables within regulatory content
SE2251012A1 (en) * 2022-08-31 2024-03-01 Seamless Distrib Systems Ab System and method for form-filling by character recognition of identity documents
CN116052193B (en) * 2023-04-03 2023-06-30 杭州实在智能科技有限公司 RPA interface dynamic form picking and matching method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642288A (en) * 1994-11-10 1997-06-24 Documagix, Incorporated Intelligent document recognition and handling
US5999664A (en) * 1997-11-14 1999-12-07 Xerox Corporation System for searching a corpus of document images by user specified document layout components
US20010042083A1 (en) * 1997-08-15 2001-11-15 Takashi Saito User-defined search template for extracting information from documents
US6976207B1 (en) * 1999-04-28 2005-12-13 Ser Solutions, Inc. Classification method and apparatus
US20060190489A1 (en) * 2005-02-23 2006-08-24 Janet Vohariwatt System and method for electronically processing document images
US20070118391A1 (en) * 2005-10-24 2007-05-24 Capsilon Fsg, Inc. Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents
US20070168382A1 (en) * 2006-01-03 2007-07-19 Michael Tillberg Document analysis system for integration of paper records into a searchable electronic database
US7305612B2 (en) * 2003-03-31 2007-12-04 Siemens Corporate Research, Inc. Systems and methods for automatic form segmentation for raster-based passive electronic documents
US20080062472A1 (en) * 2006-09-12 2008-03-13 Morgan Stanley Document handling
US20100145902A1 (en) * 2008-12-09 2010-06-10 Ita Software, Inc. Methods and systems to train models to extract and integrate information from data sources

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848184A (en) * 1993-03-15 1998-12-08 Unisys Corporation Document page analyzer and method
US5604824A (en) * 1994-09-22 1997-02-18 Houston Advanced Research Center Method and apparatus for compression and decompression of documents and the like using splines and spline-wavelets
US5963966A (en) * 1995-11-08 1999-10-05 Cybernet Systems Corporation Automated capture of technical documents for electronic review and distribution
US5850480A (en) * 1996-05-30 1998-12-15 Scan-Optics, Inc. OCR error correction methods and apparatus utilizing contextual comparison
US6507662B1 (en) * 1998-09-11 2003-01-14 Quid Technologies Llc Method and system for biometric recognition based on electric and/or magnetic properties
US7149347B1 (en) * 2000-03-02 2006-12-12 Science Applications International Corporation Machine learning of document templates for data extraction
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
US6735337B2 (en) * 2001-02-02 2004-05-11 Shih-Jong J. Lee Robust method for automatic reading of skewed, rotated or partially obscured characters
US20040013302A1 (en) * 2001-12-04 2004-01-22 Yue Ma Document classification and labeling using layout graph matching
US7142728B2 (en) * 2002-05-17 2006-11-28 Science Applications International Corporation Method and system for extracting information from a document
US20040059462A1 (en) * 2002-09-20 2004-03-25 Norris Michael O. Hand held OCR apparatus and method
US20040064404A1 (en) * 2002-10-01 2004-04-01 Menachem Cohen Computer-based method for automatic remote coding of debtor credit databases with bankruptcy filing information
US7142713B1 (en) * 2002-10-24 2006-11-28 Foundationip, Llc Automated docketing system
US20040162831A1 (en) * 2003-02-06 2004-08-19 Patterson John Douglas Document handling system and method
US20060122983A1 (en) * 2004-12-03 2006-06-08 King Martin T Locating electronic instances of documents based on rendered instances, document fragment digest generation, and digest based document fragment determination
US7546259B1 (en) * 2004-05-28 2009-06-09 Thomson Financial Llc Apparatus, method and system for a securities tracking management system
WO2006017229A2 (en) * 2004-07-12 2006-02-16 Kyos Systems Inc. Forms based computer interface
US7751624B2 (en) * 2004-08-19 2010-07-06 Nextace Corporation System and method for automating document search and report generation
US8510283B2 (en) * 2006-07-31 2013-08-13 Ricoh Co., Ltd. Automatic adaption of an image recognition system to image capture devices
US8086038B2 (en) * 2007-07-11 2011-12-27 Ricoh Co., Ltd. Invisible junction features for patch recognition
US8276088B2 (en) * 2007-07-11 2012-09-25 Ricoh Co., Ltd. User interface for three-dimensional navigation
US20070061319A1 (en) * 2005-09-09 2007-03-15 Xerox Corporation Method for document clustering based on page layout attributes
US20070067278A1 (en) * 2005-09-22 2007-03-22 Gtess Corporation Data file correlation system and method
US8176004B2 (en) * 2005-10-24 2012-05-08 Capsilon Corporation Systems and methods for intelligent paperless document management
US8233751B2 (en) * 2006-04-10 2012-07-31 Patel Nilesh V Method and system for simplified recordkeeping including transcription and voting based verification
US7697758B2 (en) * 2006-09-11 2010-04-13 Google Inc. Shape clustering and cluster-level manual identification in post optical character recognition processing
US7650035B2 (en) * 2006-09-11 2010-01-19 Google Inc. Optical character recognition based on shape clustering and multiple optical character recognition processes
US7849398B2 (en) * 2007-04-26 2010-12-07 Xerox Corporation Decision criteria for automated form population
US8538184B2 (en) * 2007-11-06 2013-09-17 Gruntworx, Llc Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category
US8250469B2 (en) * 2007-12-03 2012-08-21 Microsoft Corporation Document layout extraction
US8392816B2 (en) * 2007-12-03 2013-03-05 Microsoft Corporation Page classifier engine
US8000956B2 (en) * 2008-02-08 2011-08-16 Xerox Corporation Semantic compatibility checking for automatic correction and discovery of named entities
US7480411B1 (en) * 2008-03-03 2009-01-20 International Business Machines Corporation Adaptive OCR for books
US8756229B2 (en) * 2009-06-26 2014-06-17 Quantifind, Inc. System and methods for units-based numeric information retrieval
US20110255789A1 (en) * 2010-01-15 2011-10-20 Copanion, Inc. Systems and methods for automatically extracting data from electronic documents containing multiple layout features

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642288A (en) * 1994-11-10 1997-06-24 Documagix, Incorporated Intelligent document recognition and handling
US20010042083A1 (en) * 1997-08-15 2001-11-15 Takashi Saito User-defined search template for extracting information from documents
US5999664A (en) * 1997-11-14 1999-12-07 Xerox Corporation System for searching a corpus of document images by user specified document layout components
US6976207B1 (en) * 1999-04-28 2005-12-13 Ser Solutions, Inc. Classification method and apparatus
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus
US7305612B2 (en) * 2003-03-31 2007-12-04 Siemens Corporate Research, Inc. Systems and methods for automatic form segmentation for raster-based passive electronic documents
US20060190489A1 (en) * 2005-02-23 2006-08-24 Janet Vohariwatt System and method for electronically processing document images
US20070118391A1 (en) * 2005-10-24 2007-05-24 Capsilon Fsg, Inc. Business Method Using The Automated Processing of Paper and Unstructured Electronic Documents
US20070168382A1 (en) * 2006-01-03 2007-07-19 Michael Tillberg Document analysis system for integration of paper records into a searchable electronic database
US20080062472A1 (en) * 2006-09-12 2008-03-13 Morgan Stanley Document handling
US20100145902A1 (en) * 2008-12-09 2010-06-10 Ita Software, Inc. Methods and systems to train models to extract and integrate information from data sources

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Wong K. et al., "Document Analysis System", IBM J. RES. DEVELOP., VOL. 26, NO. 6, NOVEMBER 1982, pp. 647-656. *
Wu V. et al., "TextFinder: An Automatic System to Detect and Recognize Text In Images", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 11., NOVEMBER 1999, pp. 1224-1229. *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10204143B1 (en) 2011-11-02 2019-02-12 Dub Software Group, Inc. System and method for automatic document management
US9165207B2 (en) 2013-02-25 2015-10-20 Google Inc. Screenshot orientation detection
US20170177995A1 (en) * 2014-03-20 2017-06-22 The Regents Of The University Of California Unsupervised high-dimensional behavioral data classifier
US10489707B2 (en) * 2014-03-20 2019-11-26 The Regents Of The University Of California Unsupervised high-dimensional behavioral data classifier
US20160085812A1 (en) * 2014-09-24 2016-03-24 Samsung Electronics Co., Ltd. Method of managing content in electronic apparatus and electronic apparatus controlled according to the method
CN104361033A (en) * 2014-10-27 2015-02-18 深圳职业技术学院 Automatic cancer-related information collection method and system
US10606651B2 (en) 2015-04-17 2020-03-31 Microsoft Technology Licensing, Llc Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit
US10452995B2 (en) 2015-06-29 2019-10-22 Microsoft Technology Licensing, Llc Machine learning classification on hardware accelerators with stacked memory
US10540588B2 (en) 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
US10839302B2 (en) 2015-11-24 2020-11-17 The Research Foundation For The State University Of New York Approximate value iteration with complex returns by bounding
AU2016269570B2 (en) * 2015-12-29 2017-12-07 Accenture Global Solutions Limited Document processing
US10713431B2 (en) 2015-12-29 2020-07-14 Accenture Global Solutions Limited Digital document processing based on document source or document type
US11288590B2 (en) * 2016-05-24 2022-03-29 International Business Machines Corporation Automatic generation of training sets using subject matter experts on social media
US20190294874A1 (en) * 2018-03-23 2019-09-26 Abbyy Production Llc Automatic definition of set of categories for document classification
CN109446522A (en) * 2018-10-22 2019-03-08 东莞市七宝树教育科技有限公司 A kind of examination question automatic classification system and method
AU2019366169B2 (en) * 2018-10-26 2023-03-30 Servicenow Canada Inc. Sensitive data detection and replacement
WO2020082187A1 (en) * 2018-10-26 2020-04-30 Element Ai Inc. Sensitive data detection and replacement
CN109299271A (en) * 2018-10-30 2019-02-01 腾讯科技(深圳)有限公司 Training sample generation, text data, public sentiment event category method and relevant device
CN110175623A (en) * 2019-04-10 2019-08-27 阿里巴巴集团控股有限公司 Desensitization process method and device based on image recognition
CN110851675A (en) * 2019-10-10 2020-02-28 厦门市美亚柏科信息股份有限公司 Data extraction method, device and medium
CN110705251A (en) * 2019-10-14 2020-01-17 支付宝(杭州)信息技术有限公司 Text analysis method and device executed by computer
WO2021105805A1 (en) * 2019-11-25 2021-06-03 Vanitha S Robot and method for providing content in a newspaper and other physical documents
WO2021221614A1 (en) * 2020-04-28 2021-11-04 Hewlett-Packard Development Company, L.P. Document orientation detection and correction
WO2022026135A1 (en) * 2020-07-29 2022-02-03 Docusign, Inc. Automated document tagging in a digital management platform
US11704352B2 (en) 2021-05-03 2023-07-18 Bank Of America Corporation Automated categorization and assembly of low-quality images into electronic documents
US11798258B2 (en) 2021-05-03 2023-10-24 Bank Of America Corporation Automated categorization and assembly of low-quality images into electronic documents
US11881041B2 (en) 2021-09-02 2024-01-23 Bank Of America Corporation Automated categorization and processing of document images of varying degrees of quality
CN116152833A (en) * 2022-12-30 2023-05-23 北京百度网讯科技有限公司 Training method of form restoration model based on image and form restoration method
CN116186543A (en) * 2023-03-01 2023-05-30 深圳崎点数据有限公司 Financial data processing system and method based on image recognition

Also Published As

Publication number Publication date
US20110258170A1 (en) 2011-10-20
US20110255782A1 (en) 2011-10-20
US20110255784A1 (en) 2011-10-20
US8571317B2 (en) 2013-10-29
US20110255790A1 (en) 2011-10-20
US20110258195A1 (en) 2011-10-20
US20110255794A1 (en) 2011-10-20
US20110255788A1 (en) 2011-10-20
US20110258182A1 (en) 2011-10-20
US8897563B1 (en) 2014-11-25
US20110255789A1 (en) 2011-10-20

Similar Documents

Publication Publication Date Title
US8897563B1 (en) Systems and methods for automatically processing electronic documents
US20110249905A1 (en) Systems and methods for automatically extracting data from electronic documents including tables
US8538184B2 (en) Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category
US11676411B2 (en) Systems and methods for neuronal visual-linguistic data retrieval from an imaged document
US11816165B2 (en) Identification of fields in documents with neural networks without templates
US11676185B2 (en) System and methods of an expense management system based upon business document analysis
AU2020200251B2 (en) Label and field identification without optical character recognition (OCR)
US8843494B1 (en) Method and system for using keywords to merge document clusters
US9659213B2 (en) System and method for efficient recognition of handwritten characters in documents
US11379690B2 (en) System to extract information from documents
Mehri Historical document image analysis: a structural approach based on texture
US11961094B2 (en) Fraud detection via automated handwriting clustering
US20220156756A1 (en) Fraud detection via automated handwriting clustering
Yin et al. The Image Preprocessing and Check of Amount for VAT Invoices
Campbell Computational Analysis of Documents
Zhu Content recognition and context modeling for document analysis and retrieval

Legal Events

Date Code Title Description
AS Assignment

Owner name: COPANION, INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NEOGI, DEPANKAR;LADD, STEVEN K.;WELLING, GIRISH;AND OTHERS;SIGNING DATES FROM 20110228 TO 20110527;REEL/FRAME:026651/0929

AS Assignment

Owner name: GRUNTWORX, LLC, NORTH CAROLINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:COPANION, INC.;REEL/FRAME:027676/0596

Effective date: 20110707

AS Assignment

Owner name: GRUNTWORX, LLC, NORTH CAROLINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:COPANION, INC.;REEL/FRAME:028157/0982

Effective date: 20110727

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION