US 20040042667A1
(19) United States
(12) Patent Application Publication (io) Pub. No.: US 2004/0042667 Al
Lee et al. (43) Pub. Date: Mar. 4,2004
(54) EXTRACTING INFORMATION FROM SYMBOLICALLY COMPRESSED DOCUMENT IMAGES
(76) Inventors: Dar-Shyang Lee, Union City, CA (US);
Jonathan J. Hull, San Carlos, CA (US)
Correspondence Address:
BLAKELY, SOKOLOFF, TAYLOR & ZAFMAN
LLP
Seventh Floor
12400 Wilshire Boulevard
Los Angeles, CA 90025 (US)
(21) Appl. No.: 10/676,881
(22) Filed: Sep. 30, 2003
Related U.S. Application Data
(62) Division ol application No. 09/289,772, filed on Apr. 8, 1999, now Pat. No. 6,658,151.
Publication Classification
(51) Int. CI.7 G06K 9/68
(52) U.S. CI 382/230; 382/218; 382/243
(57) ABSTRACT
A method and apparatus for extracting information from symbolically compressed document images. A deciphering module generates first and second text strings by deciphering respective sequences of template identifiers in first and second symbolically compressed document images. A conditional n-gram module receives the first and second text strings from the deciphering module and extracts n-gram terms therefrom based on a predicate condition. A comparison module generates a measure of similarity between the first and second symbolically compressed document images based on the n-gram terms extracted by the conditional n-gram module.