US20050192994A1 - Data compression method and apparatus - Google Patents

Data compression method and apparatus Download PDF

Info

Publication number
US20050192994A1
US20050192994A1 US11/110,554 US11055405A US2005192994A1 US 20050192994 A1 US20050192994 A1 US 20050192994A1 US 11055405 A US11055405 A US 11055405A US 2005192994 A1 US2005192994 A1 US 2005192994A1
Authority
US
United States
Prior art keywords
data
compression
schema
columns
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/110,554
Other versions
US7720878B2 (en
Inventor
Donald Caldwell
Kenneth Church
Glenn Fowler
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US11/110,554 priority Critical patent/US7720878B2/en
Publication of US20050192994A1 publication Critical patent/US20050192994A1/en
Application granted granted Critical
Publication of US7720878B2 publication Critical patent/US7720878B2/en
Adjusted expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99942Manipulating data structure, e.g. compression, compaction, compilation

Definitions

  • the present invention relates to data compression systems and methods.
  • Data compression systems which encode a digital data stream into compressed digital code signals and which decode the compressed digital code signals back into the original data, are known in the prior art.
  • the methods utilized in data compression systems serve to reduce the amount of storage space required to hold the digital information and/or result in a savings in the amount of time required to transmit a given amount of information.
  • the extensive transactional records accounted for by companies such as banks and telephone companies are often stored for archival purposes in massive computer databases. This storage space is conserved, resulting in a significant monetary savings, if the data is compressed prior to storage and decompressed from the stored compressed files for later use.
  • Gzip a compression scheme used pervasively on the Internet today is “gzip,” designed by Jean-Loup Gailly. See “DEFLATE Compressed Data Format Specification version 1.3”, RFC 1951, Network Working Group May 1996; “GZIP file format specification version 4.3,” RFC 1952, Network Working Group, May 1996.
  • Gzip utilizes a variation of the well-known LZ77 (Lempel-Ziv 1977) compression technique which replaces duplicated strings of bytes within a frame of a pre-defined distance with a pointer to the original string.
  • Gzip also uses Huffman coding on the block of bytes and stores the Huffman code tree with the compressed data block.
  • Gzip normally achieves a compression ratio of about 2:1 or 3:1, the compression ratio being the size of the clear text relative to the size of the compressed text.
  • Gzip is a popular but suboptimal compression scheme. Nevertheless, the inventors, while conducting experiments on compressing massive data sets of telephone call detail records, managed to achieve compression ratios of around 15:1 when using gzip. The substantial reduction in size effected by merely using a conventional compression technique such as gzip suggested to the inventors that additional improvements to the compression ratio could be devised by a careful analysis of the structure of the data itself.
  • the present invention achieves improved compression ratios by utilizing metadata to transform the data in a manner that optimizes known compression techniques.
  • the metadata not only leads to better compression results, it can be maintained by an automatic procedure.
  • a schema is generated which is utilized to reorder and partition the data into low entropy and high entropy portions which are separately compressed by conventional compression methods.
  • the high entropy portion is further reordered and partitioned to take advantage of row and column dependencies in the data.
  • the present invention takes advantage of the fact that some fields have more information than others, and some interactions among fields are important, but most are not. Parsimony dictates that unless the interactions are important, it is generally better to model the first order effects and ignore the higher order interactions. Through the proper analysis of such interactions, the present invention enables improvements in both space and time over conventional compression techniques.
  • FIGS. 1A and 1B are block diagrams of a compression system in accordance with an embodiment of the present invention.
  • FIG. 2 is a flowchart setting forth a method of generating a schema in accordance with an embodiment of the present invention.
  • FIGS. 3A and 3B sets forth an illustration of difference encoding (DIFE).
  • FIG. 4 sets forth pseudocode of a design for the PZIP compressor in accordance with an embodiment of the present invention.
  • FIG. 5 sets forth programming code from the PZIP compressor in accordance with an embodiment of the present invention.
  • FIG. 6 sets forth programming code from the PZIP decompressor in accordance with an embodiment of the present invention.
  • FIG. 7 sets forth an illustration of the data layout of PZIP.
  • FIG. 8 is a example of an induced schema partition file in accordance with an embodiment of the present invention.
  • the input data can be any stream or sequence of digital data character signals that contains information in some tabular form.
  • Data processing and communication systems conventionally process characters of the alphabets over which compression is to be effected as bytes or binary digits in a convenient code such as the ASCII format.
  • input characters may be received in the form of eight-bit bytes over an alphabet of 256 characters.
  • the input data to be compressed should be arranged in the form of a table of information of a known or readily ascertainable geometry.
  • the input data 100 is processed and transformed into one or more streams of compressed data 140 .
  • the input data is initially arranged at 110 in accordance with what the inventors refer to as a “schema.”
  • the schema 120 represents coded instructions for partitioning and reordering the data in a manner that optimizes the compression of the input data. Methods for devising such a schema are provided below.
  • the resulting data streams are, either concurrently or subsequently, compressed at 130 using any of a number of known compression schemes.
  • the compressed data signals 140 may then be stored in electronic files on some storage medium or may be transmitted to a remote location for decoding/decompression.
  • 1B demonstrates the corresponding decompression of the compressed data 140 into a copy of the input data 180 .
  • the compressed data is first decompressed at 150 using the analogue to whatever compression method was utilized at 130 .
  • the resulting data is then reordered and combined using the schema 120 to recreate the input data at 180 .
  • the particular compression method used at 130 does not matter for purposes of the present invention, although the particular method utilized will affect the nature of the schema used to optimize the compression results.
  • the inventors have performed experiments with the gzip compression method, described above, although a richer set of compression methods may also be used, such as vdelta, Huffman coding, run-length encoding, etc.
  • the results will depend on the schema chosen for transforming the data prior to compression.
  • the present invention emanates from the recognition that data prior to compression is often presented in a form that is suboptimal for conventional compression techniques.
  • Transforming the data prior to compression is a method not unsimilar to that of taking a log before performing linear regression.
  • Data compression like linear regression, is a practical—but imperfect—modeling tool. Transforms make it easier to capture the generalizations of interest, while making it less likely that the tool will be misled by irrelevant noise, outliers, higher order interactions, and so on.
  • the entropy of a data set which Shannon demonstrated was the space required by the best possible encoding scheme, does not depend on the enumeration order or any other invertible transform. Nevertheless, in accordance with the present invention, such transforms can make a significant difference for most practical coders/predictors.
  • the invention has the advantage that existing data interfaces can be preserved by embedding data transformations within the compressor. Applications can deal with the schema unchanged, while the compressor can deal with transformed data that better suits its algorithms. With improved compression rates a good implementation can trade the extra time spent transforming data against the IO time saved by moving less data.
  • a schema transform that is especially useful for most tabular data files is transposing data originally in row major order into columns of fields that are compressed separately.
  • Data files containing tables of records such as the following simple example from The Awk Programming Language, Alfred V. Aho, Brian W. Kemighan, Peter J. Weinberger, Addison Wesley, 1988, are often stored in row major order. Name Rate Hours Beth 4.00 0 Dan 3.75 0 Kathy 4.00 10 Mark 5.00 20 Mary 5.50 22 Susie 4.25 18
  • the following C code outputs a series of employee records in row major order: struct employee ⁇ char name[30]; int age; int salary; ⁇ employees[1000]; fwrite(employees, sizeof(employees), 1, stdout)
  • Row major order is extremely common and is favored by nearly all commercial databases including those offered by Informix, Oracle, and Sybase.
  • Row major order is often suboptimal for compression purposes.
  • the inventors have determined that, as a general rule of thumb, it is better to compress two columns of fields separately when the columns contain data that is independent.
  • Y be another sequence of N bits, generated by the same process, but with a parameter P Y .
  • the question is whether X and Y should be compressed separately or whether they should be combined, for example by interleaving the data into row major order and compressing the columns together.
  • an optimal schema can be generated by the method set forth in FIG. 2 .
  • a representative sample of the data to be compressed is chosen.
  • the data is first divided into two classes: that portion which has low information content which can be dealt with as a whole and a smaller portion containing high information content which is processed further.
  • the low entropy columns are separated from the high entropy columns, and this is accomplished by counting the rate of change of the columns and separating based on a previously chosen threshold.
  • the low entropy columns then no longer need be included in the processing and can be designated to be run-length encoded and compressed as a whole.
  • This not only improves the compression ratio but it can also increase the speed of compression.
  • the high information content data can be further transformed to take advantage of both row and column dependencies.
  • the high entropy data is fed to a dynamic programming module that searches for combinations of columns which minimizes the compressed size of the data. This can be made more feasible by a preprocessing step that prunes the initial search space.
  • An optimal ordering of the columns for example, can be generated in a bottom-up fashion by an initial transformation ⁇ .
  • This ⁇ -transformed training data is used to build a matrix containing the sizes of the compressed subintervals of the training data. This matrix is used to dynamically generate the optimal partition of the schema.
  • n the number of high entropy columns chosen be n.
  • Each i j is the end of one interval, i.e. interval 1 is columns 1 through i 1
  • interval 2 is columns i 1 +1 through i 2 , etc.
  • ⁇ tilde over (H) ⁇ G be a n ⁇ n matrix where each cell contains the size after compressing the interval from i to j of the ⁇ -transformed data after compression.
  • ⁇ tilde over (H) ⁇ DP [i,j] be the size after compressing using the best decomposition of columns i through j seen so far.
  • the goal is to compute ⁇ tilde over (H) ⁇ DP [1,n] and then to compute P, the optimal partition.
  • the schema with this optimal cross-entropy can be calculated by performing the additional step of saving the best partition achieved so far during the execution of the program.
  • ARGMIN is the result of finding k such that the compressed sizes of the subintervals partitioned by k is minimal i-j.
  • the optimal partition can then be recursively recovered from the Partition array.
  • DIFE difference encoding
  • FIG. 4 sets forth an initial compression design of an implementation of the present invention which the inventors have called “PZIP.”
  • PZIP was built on top of the SFIO IO library, as described in D. G. Korn and K. P. Vo, SFIO: Safe/Fast String/File IO, Proc. of Summer '91 USENIX Conf. in Nashville, Tenn., 1991.
  • Sfio provides an interface similar to the UNIX stdio, but also supports user discipline functions that can access IO buffers both before and after read and write operations.
  • a gzip sfio discipline “sfdcgzip” was implemented using the zlib read and write routines. The call
  • the PZIP data layout is illustrated in FIG. 7 .
  • the compressor arranges the data for fast decompression: each window contains the high frequency data first, then the DIFE low frequency data.
  • the number of records per window may vary but never exceeds the maximum. This way the decompressor can preallocate all internal buffer space before reading the first window.
  • Migration from gzip to PZIP is simplified by the fact that when PZIP uncompress encounters a gzip file without PZIP headers (pzip data is eventually passed to gzip, so PZIP files can identify themselves as gzip compressed data) it simply copies the compressed data with no further interpretation.
  • a project could convert to PZIP compression and still access old gzip data. Old data can be converted from gzip to PZIP during off hours (to regain space) while newly arrived data can be directed PZIPped.
  • the most decompression time is spent in the submatrix inner loop.
  • the loop can be sped up by combining the low frequency DIFE decoding with the high frequency partition matrix reassembly.
  • the high frequency submatrices for a single window are laid out end to end in a contiguous buffer buf.
  • the inner loop selects column values from buf to reassemble records and write them to the output stream.
  • Conditionals slow down tight inner loops. Tables computed before the inner loop can eliminate some conditionals. PZIP uses two tables to eliminate all but the loop termination conditional.
  • pat is the current low frequency pattern buffer, the same size as the record size. High frequency columns in the pattern buffer are initialized to a default value. This allows each reassembled output record to be initialized by a single memcpy ( ), which would be much more efficient than separate memcpy ( ) calls for each contiguous low frequency column range, especially considering that 90% of the columns are usually low frequency.
  • a final data optimization separates the DIFE repetition counts and column offsets from the changed column values so they can be gzip compressed with different tables.
  • the low frequency column values are placed in the buffer val. See FIG. 6 .
  • Decompression can be sped up if only a portion of the high frequency columns are required; this happens often when only a few fields are present in a data query.
  • high_freq_cols can be set to the required number of high frequency columns, and mix and inc can be precomputed according to this lower number.
  • PZIP induces partitions using a separate program which the inventors have named “PIN.”
  • PIN takes as input a 4 Mb window of training data (a sample of the actual data) and the fixed record size n. It implements a dynamic program, as described above, and, using gzip as a compressor to test compression rates, produces an optimal partition and corresponding compression rate for the training data.
  • the partition is written to a file with the following syntax: the first line is the fixed record size; each remaining line describes one class in the induced schema partition.
  • a class description is a space-separated list of the column positions in the class counting from 0, where i-j is shorthand for the column positions i through j inclusive.
  • the partition file is simply a list of record byte position groups that, for the purposes of PZIP, correspond to the byte columns that should be compressed separately. See FIG. 8 for an example of such a partition file.
  • PZIP To handle data of all sizes, PZIP must operate on one section, or window, at a time. Window size affects both compression and decompression performance. At some point the PZIP compressor must perform submatrix transformations to prepare the partitions for gzip, and a similar operation does the inverse for decompression. This means a buffer to buffer copy, with linear access to one and random access to the other. Random buffer access can be expensive on some systems. Most virtual memory computers provide levels of increasingly slower memory access: local cache, ⁇ 10-100 Kb, local memory, ⁇ 1-4 Mb, and shared memory, >4 Mb. The access time difference between levels can be up to an order of magnitude on some systems.
  • the implementation of PIN is divided into four steps.
  • the first step reads the training data window, computes the column frequencies, and generates a submatrix from the high frequency columns.
  • the next step determines a permutation of the column positions from which an optimal partition will be determined.
  • An optimal solution to this step is NP-complete, so a suboptimal heuristic search is done.
  • T [ t 11 ... ... t 1 ⁇ n ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ t m1 ... ... t mn ] represent the training data with m rows (the number of records) and n columns (the fixed record length), where t ij is byte j of record i, and
  • mn.
  • a compressor ⁇ applied to T, ⁇ (T), reads the elements of T in row major order and produces a byte vector with
  • the search starts by computing
  • a class is expanded by selecting in order columns from the remaining classes, and keeping the one that minimizes the sizes of the two new classes (the one expanded and the one reduced by the donated column), if any.
  • a column is added by checking the compressed size of the class for all positions within the class, e.g.,
  • PIN benefits from the removal of low frequency columns since they can be ignored when inducing an optimal partition. For example, a project with 781 byte records had only 81 columns with frequency greater than 10%. This reduces the 4 Mb window PIN run times form 8 hours on an SGI Challenge to under 10 minutes on an Intel Pentium II personal computer. Moreover, an unexpected discovery came out of the timing analysis.
  • the gzip command provides levels of compression ranging from 0 (best time, worst space) to 9 (worst time, best space), the default level being 9. It turns out that the zlib default level is 6. When this was discovered PZIP was immediately run with level 9 , and the result was little or no compression improvement, and much worse run time. So data partitioning has the added benefit of being able to run gzip at a lower compression level, saving time at little cost to space.
  • FIG. 8 illustrates the format of an induced schema utilized by the inventors in experiments on compressing telephone call detail records.
  • the records each have a length of 752 bytes.
  • the first line of the schema has been utilized to specify the size of the record and can be used to validate the input data.
  • Each subsequent line, denoted field 1 , field 2 , etc. represents and identifies portions of each data record that should be extracted and compressed separately.
  • field 5 indicates that the bytes at position 331 and 330 out of the 752 bytes in each record should be taken and compressed together separately from the rest of the data file.
  • Approximately 10% of the 752 bytes of each record are set forth as separate fields for compression; the rest are processed as a unit (designated implicitly as field 18 ).
  • the induced schema looks similar to a standard database schema, but there are interesting differences.
  • the induced schema emphasize what is probable over what is possible. For example, telephone number records are usually 10 digits long, but there are some exceptional international numbers that can consume up to 16 digits.
  • a database schema for such records would usually need to allocate 16 bytes in all cases to accommodate these exceptions, but the induced schema of the present invention tends to split the telephone numbers into two fields, a 10 digit column for the general case, and a 6 digit column for the exceptions.
  • the present invention results in an improvement in both these bottlenecks to quick compression, an improvement which can more than offset the time for rearranging the columns. It is notable that if one often poses queries to the database that require only a few fields containing high entropy columns, the decoding time can be further dramatically improved by decompressing only the columns of interest.
  • variable length record data may have over variable length record data.
  • Fixed length record data is often viewed as a waste of space, too sparse for production use. Much effort is then put into optimizing the data schema, and in the process complicating the data interface. PZIP shows that in many cases this view of fixed length data is wrong. In fact, variable length data may become more compressible when converted to a sparse, fixed length format. Intense semantic schema analysis can be replaced by an automated record partition, resulting in compression space improvements of 2 to 10 times and decompression speed improvements of 2 to 3 times over gzip for a large class of data.

Abstract

An improved data compression method and apparatus is provided, particularly with regard to the compression of data in tabular form such as database records. The present invention achieves improved compression ratios by utilizing metadata to transform the data in a manner that optimizes known compression techniques. In one embodiment of the invention, a schema is generated which is utilized to reorder and partition the data into low entropy and high entropy portions which are separately compressed by conventional compression methods. The high entropy portion is further reordered and partitioned to take advantage of row and column dependencies in the data. The present invention enables not only much greater compression ratios but increased speed than is achieved by compressing the untransformed data.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims priority to Provisional Application Ser. No. 60/111,781, filed on Dec. 10, 1998, the content of which is incorporated by reference herein.
  • FIELD OF THE INVENTION
  • The present invention relates to data compression systems and methods.
  • BACKGROUND OF THE INVENTION
  • Data compression systems, which encode a digital data stream into compressed digital code signals and which decode the compressed digital code signals back into the original data, are known in the prior art. The methods utilized in data compression systems serve to reduce the amount of storage space required to hold the digital information and/or result in a savings in the amount of time required to transmit a given amount of information. For example, the extensive transactional records accounted for by companies such as banks and telephone companies are often stored for archival purposes in massive computer databases. This storage space is conserved, resulting in a significant monetary savings, if the data is compressed prior to storage and decompressed from the stored compressed files for later use.
  • Various methods and systems are known in the art for compressing and subsequently reconstituting data. For example, a compression scheme used pervasively on the Internet today is “gzip,” designed by Jean-Loup Gailly. See “DEFLATE Compressed Data Format Specification version 1.3”, RFC 1951, Network Working Group May 1996; “GZIP file format specification version 4.3,” RFC 1952, Network Working Group, May 1996. Gzip utilizes a variation of the well-known LZ77 (Lempel-Ziv 1977) compression technique which replaces duplicated strings of bytes within a frame of a pre-defined distance with a pointer to the original string. Gzip also uses Huffman coding on the block of bytes and stores the Huffman code tree with the compressed data block. Gzip normally achieves a compression ratio of about 2:1 or 3:1, the compression ratio being the size of the clear text relative to the size of the compressed text.
  • Gzip is a popular but suboptimal compression scheme. Nevertheless, the inventors, while conducting experiments on compressing massive data sets of telephone call detail records, managed to achieve compression ratios of around 15:1 when using gzip. The substantial reduction in size effected by merely using a conventional compression technique such as gzip suggested to the inventors that additional improvements to the compression ratio could be devised by a careful analysis of the structure of the data itself.
  • SUMMARY OF THE INVENTION
  • It is an object of the invention to provide an improved data compression method and apparatus, particularly with regard to the compression of data in tabular form such as database records. The present invention achieves improved compression ratios by utilizing metadata to transform the data in a manner that optimizes known compression techniques. The metadata not only leads to better compression results, it can be maintained by an automatic procedure. In one embodiment of the invention, a schema is generated which is utilized to reorder and partition the data into low entropy and high entropy portions which are separately compressed by conventional compression methods. The high entropy portion is further reordered and partitioned to take advantage of row and column dependencies in the data. The present invention takes advantage of the fact that some fields have more information than others, and some interactions among fields are important, but most are not. Parsimony dictates that unless the interactions are important, it is generally better to model the first order effects and ignore the higher order interactions. Through the proper analysis of such interactions, the present invention enables improvements in both space and time over conventional compression techniques.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1A and 1B are block diagrams of a compression system in accordance with an embodiment of the present invention.
  • FIG. 2 is a flowchart setting forth a method of generating a schema in accordance with an embodiment of the present invention.
  • FIGS. 3A and 3B sets forth an illustration of difference encoding (DIFE).
  • FIG. 4 sets forth pseudocode of a design for the PZIP compressor in accordance with an embodiment of the present invention.
  • FIG. 5 sets forth programming code from the PZIP compressor in accordance with an embodiment of the present invention.
  • FIG. 6 sets forth programming code from the PZIP decompressor in accordance with an embodiment of the present invention.
  • FIG. 7 sets forth an illustration of the data layout of PZIP.
  • FIG. 8 is a example of an induced schema partition file in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • With reference to FIGS. 1A and 1B, a simple block diagram of a compression system in accordance with an embodiment of the present invention is shown. The input data can be any stream or sequence of digital data character signals that contains information in some tabular form. Data processing and communication systems conventionally process characters of the alphabets over which compression is to be effected as bytes or binary digits in a convenient code such as the ASCII format. For example, input characters may be received in the form of eight-bit bytes over an alphabet of 256 characters. The input data to be compressed should be arranged in the form of a table of information of a known or readily ascertainable geometry. This encompasses most, if not all, forms of computer information records, such as spreadsheets and database records (of both fixed and variable length) as well as the data constructs utilized in most popular programming languages. Although the units of information shall be referred to herein as “records” or as “rows” and “columns,” these terms are not meant to limit the nature of present invention to the processing of traditional spreadsheet or database constructs.
  • In FIG. 1A, the input data 100 is processed and transformed into one or more streams of compressed data 140. The input data is initially arranged at 110 in accordance with what the inventors refer to as a “schema.” The schema 120 represents coded instructions for partitioning and reordering the data in a manner that optimizes the compression of the input data. Methods for devising such a schema are provided below. After the input data is rearranged and partitioned in accordance with the schema, the resulting data streams are, either concurrently or subsequently, compressed at 130 using any of a number of known compression schemes. The compressed data signals 140 may then be stored in electronic files on some storage medium or may be transmitted to a remote location for decoding/decompression. FIG. 1B demonstrates the corresponding decompression of the compressed data 140 into a copy of the input data 180. The compressed data is first decompressed at 150 using the analogue to whatever compression method was utilized at 130. The resulting data is then reordered and combined using the schema 120 to recreate the input data at 180.
  • The particular compression method used at 130 (and the corresponding decompression method at 150) does not matter for purposes of the present invention, although the particular method utilized will affect the nature of the schema used to optimize the compression results. The inventors have performed experiments with the gzip compression method, described above, although a richer set of compression methods may also be used, such as vdelta, Huffman coding, run-length encoding, etc.
  • The results will depend on the schema chosen for transforming the data prior to compression. The present invention emanates from the recognition that data prior to compression is often presented in a form that is suboptimal for conventional compression techniques. Transforming the data prior to compression is a method not unsimilar to that of taking a log before performing linear regression. Data compression, like linear regression, is a practical—but imperfect—modeling tool. Transforms make it easier to capture the generalizations of interest, while making it less likely that the tool will be misled by irrelevant noise, outliers, higher order interactions, and so on. The entropy of a data set, which Shannon demonstrated was the space required by the best possible encoding scheme, does not depend on the enumeration order or any other invertible transform. Nevertheless, in accordance with the present invention, such transforms can make a significant difference for most practical coders/predictors.
  • The invention has the advantage that existing data interfaces can be preserved by embedding data transformations within the compressor. Applications can deal with the schema unchanged, while the compressor can deal with transformed data that better suits its algorithms. With improved compression rates a good implementation can trade the extra time spent transforming data against the IO time saved by moving less data.
  • A schema transform that is especially useful for most tabular data files is transposing data originally in row major order into columns of fields that are compressed separately. Data files containing tables of records, such as the following simple example from The Awk Programming Language, Alfred V. Aho, Brian W. Kemighan, Peter J. Weinberger, Addison Wesley, 1988, are often stored in row major order.
    Name Rate Hours
    Beth 4.00 0
    Dan 3.75 0
    Kathy 4.00 10
    Mark 5.00 20
    Mary 5.50 22
    Susie 4.25 18
  • As a further example, the following C code outputs a series of employee records in row major order:
    struct employee {
      char name[30];
      int age;
      int salary; }
      employees[1000];
    fwrite(employees, sizeof(employees), 1, stdout)

    The result is that the employee names, ages, and salaries are interleaved in the data stream but the records themselves are sequential. Row major order is extremely common and is favored by nearly all commercial databases including those offered by Informix, Oracle, and Sybase.
  • Row major order, however, is often suboptimal for compression purposes. In fact, the inventors have determined that, as a general rule of thumb, it is better to compress two columns of fields separately when the columns contain data that is independent. Consider the following example. Let X be a sequence of a million bits, N=220, generated by a binomial process with a probability of PX. Let Y be another sequence of N bits, generated by the same process, but with a parameter PY. The question is whether X and Y should be compressed separately or whether they should be combined, for example by interleaving the data into row major order and compressing the columns together. Using PX=0.5 and PY=0 in a monte carlo experiment with the gzip compressor, the inventors found that gzip required approximately 1.0003 bits per symbol to store X and 0.0012 bits per symbol to store Y. The combined space required by gzip of 1.0015 bits per symbol is close to the true entropy of the sets, namely
    H(X)=−0.5 log2(0.5)−0.5 log2(0.5)=1 bit per symbol
    H(Y)=−0 log2(0)−1 log2(1)=0 bits per symbol
    This is a good (but not perfect) compression result. However, when X and Y are interleaved, as they would be if they were in row major order, gzip requires approximately 1.44 bits per symbol—which is worse than column major order. This result is reversed if there is an obvious dependency between X and Y, e.g. where X is as above but Y mirrors the bits of X with a probability of PY and will be the logical negation of X with a probability of 1−PY. Accordingly, as a general rule (with possible exceptional cases), two vectors should be combined if there is an obvious dependency; otherwise, if the two vectors are independent, compression should not be improved by combining them—and could possibly degrade. Thus, the common usage of row major order presents a practical opportunity for improvement in compression.
  • In order to create an optimal schema, a given space of possible schemas must be searched for the one that leads to the best compression; i.e. it is a matter of deciding which interactions in the data are important—and which are not. Compression results can be improved by searching a larger space of possible schemas. For example, different column permutations may be tried in order to take advantage of dependencies among non-adjacent columns. Transforms can be used that remove redundancies in columns by replacing values in one column with default values (nulls) if they are the same as the values in another column.
  • The generation of an optimal schema can be relegated to a machine learning task. Dynamic programming can be utilized to determine the schema that leads to the best compression, given a data sample and the space of possible schemas. In one embodiment of the present invention, an optimal schema can be generated by the method set forth in FIG. 2. A representative sample of the data to be compressed is chosen. The data is first divided into two classes: that portion which has low information content which can be dealt with as a whole and a smaller portion containing high information content which is processed further. In other words, the low entropy columns are separated from the high entropy columns, and this is accomplished by counting the rate of change of the columns and separating based on a previously chosen threshold. The low entropy columns then no longer need be included in the processing and can be designated to be run-length encoded and compressed as a whole. By handling the highly redundant data (mostly default values) separately, this not only improves the compression ratio but it can also increase the speed of compression.
  • The high information content data, on the other hand, can be further transformed to take advantage of both row and column dependencies. In a preferred embodiment of the invention, the high entropy data is fed to a dynamic programming module that searches for combinations of columns which minimizes the compressed size of the data. This can be made more feasible by a preprocessing step that prunes the initial search space. An optimal ordering of the columns, for example, can be generated in a bottom-up fashion by an initial transformation Π. This Π-transformed training data is used to build a matrix containing the sizes of the compressed subintervals of the training data. This matrix is used to dynamically generate the optimal partition of the schema.
  • Accordingly, let the number of high entropy columns chosen be n. Let P be a sequence of intervals, denoted as a set of endpoints P=(i1, i2 . . . ip)) such that i1<i2< . . . <ip=n. Each ij is the end of one interval, i.e. interval 1 is columns 1 through i1, interval 2 is columns i1+1 through i2, etc. Let {tilde over (H)}G be a n×n matrix where each cell contains the size after compressing the interval from i to j of the Π-transformed data after compression. That is, for 1≦i≦j≦n, let {tilde over (H)}G[i,j] be the size of columns i through j after compression. The task is to find the schema such that compression of the fields minimizes space. Consider any j and the two subpartitions
    P 1=(i 1 , . . . , i j) and P 2=(i j+1, . . . i p)
    If P is optimal, it follows that P1 is an optimal partition of columns 1 through ij and P2 is an optimal partition of columns ij+1 through ip. (Otherwise it would be possible to improve upon P which would violate the principle of optimality). Therefore, the following scheme can be used to compute the cost of the optimal partition P. Let {tilde over (H)}DP[i,j] be the size after compressing using the best decomposition of columns i through j seen so far. The goal is to compute {tilde over (H)}DP[1,n] and then to compute P, the optimal partition. By the principle of optimality: M [ i , j ] = MIN i k < j { H ~ DP [ i , k ] + H ~ DP [ k + 1 , j ] } H ~ DP [ i , j ] = min { M [ i , j ] , H ~ G [ i , j ] }
    where M is used as a scratch pad. This produces the optimal cross entropy for the training data. The schema with this optimal cross-entropy can be calculated by performing the additional step of saving the best partition achieved so far during the execution of the program. This is accomplished by executing the following step to populate Partition: if H DP Min [ i , j ] then Partition [ i , j ] = ARGMIN i k < j { H ~ DP [ i , k ] + H ~ DP [ k + 1 , j ] } else Partition [ i , j ] = j
    where ARGMIN is the result of finding k such that the compressed sizes of the subintervals partitioned by k is minimal i-j. The optimal partition can then be recursively recovered from the Partition array.
  • The first implementation of a partition inducer by the inventors was terribly slow. It was originally assumed that the compression routine gzip would eventually see all of the uncompressed data, only in a different order. The presence of fixed byte values in some sample data opened up other possibilities. Using run length encoding, fixed values trivially compress to a byte value and repetition count. With many fixed columns, they could be moved out of the inner program loop and noticeably improve run times. Experiments were done to determine how run length encoding performs with column values that change at a low rate from record to record. Data samples from a few AT&T systems show that the number of low frequency columns tends to increase with record size (define the “frequency” of a column to be the percentage of record to record value changes for the column with respect to the total number of records sampled: a frequency of 0 means the common value is fixed; a frequency of 100 means the column value changes with each record). This is because many long record schemas are typed, with each type populating a different set of fields. If the typed records cluster in any way than the number of low frequency columns naturally increases. Run length encoding can be inefficient, though, if there is correlation between two low frequency columns. For example, if two column values always change at the same time, then run length encoding ends up duplicating the run length count. A difference encoding (DIFE) was found by the inventors to do much better when there are correlated columns. DIFE maintains a pattern record of current low frequency column values and emits a count for the number of records the pattern repeats. It then emits a sequence of <column-number+1, byte-value> pairs, terminated by <0>, that modify the pattern record. This is followed by more repeat count and byte-value groups, and continues until a 0 repeat count is emitted after the last record. For example, the DIFE encoding for the 5 byte fixed record data set forth in FIG. 3A is the compressed data set forth in FIG. 3B. Experimental data shows that DIFE preprocessing before gzip uses less space than gzip alone for columns with frequency ≦10%. Although DIFE was formulated to decrease the load on the inner loop, which it did, it also boosted the compression rates for most of the data tested.
  • FIG. 4 sets forth an initial compression design of an implementation of the present invention which the inventors have called “PZIP.” PZIP was built on top of the SFIO IO library, as described in D. G. Korn and K. P. Vo, SFIO: Safe/Fast String/File IO, Proc. of Summer '91 USENIX Conf. in Nashville, Tenn., 1991. Sfio provides an interface similar to the UNIX stdio, but also supports user discipline functions that can access IO buffers both before and after read and write operations. A gzip sfio discipline “sfdcgzip” was implemented using the zlib read and write routines. The call
      • sfdcgzip(op, SFGZ_NOCRC);
        pushes the gzip discipline with crc32 ( ) disabled on the output stream op and all data written to op is compressed via zlib. The discipline also intercepts sfio output flush calls (i.e. sfsync (op) ;) and translates these to zlib full table flush calls. This greatly simplified the pzip compressor coding. Debugging was done by omitting the sfdcgzip ( ) call; the full working version simply enabled the call again. A portion of the compressor code in FIG. 5 sets forth the ease of this approach. A speed increase was achieved by modifying the zlib routines to allow crc32 ( ) checking to be disabled, and PZIP was modified to disable it by default. The crc32 ( ) routine was found by the inventors to account for over 20% of the run time. It turns out that the checksum is computed on the uncompressed data, so the percentage of time spent in crc32 ( ) increases with the compression rate. Since PZIP embeds partition counts throughout its data and has enough information to count to the last byte, crc32 ( ) was seen as overkill, especially since the disk and memory hardware on most modern systems already have checksum.
  • The PZIP data layout is illustrated in FIG. 7. The compressor arranges the data for fast decompression: each window contains the high frequency data first, then the DIFE low frequency data. The number of records per window may vary but never exceeds the maximum. This way the decompressor can preallocate all internal buffer space before reading the first window. Migration from gzip to PZIP is simplified by the fact that when PZIP uncompress encounters a gzip file without PZIP headers (pzip data is eventually passed to gzip, so PZIP files can identify themselves as gzip compressed data) it simply copies the compressed data with no further interpretation. Thus a project could convert to PZIP compression and still access old gzip data. Old data can be converted from gzip to PZIP during off hours (to regain space) while newly arrived data can be directed PZIPped.
  • As for decompression, the most decompression time is spent in the submatrix inner loop. The loop can be sped up by combining the low frequency DIFE decoding with the high frequency partition matrix reassembly. The high frequency submatrices for a single window are laid out end to end in a contiguous buffer buf. The inner loop selects column values from buf to reassemble records and write them to the output stream. Conditionals slow down tight inner loops. Tables computed before the inner loop can eliminate some conditionals. PZIP uses two tables to eliminate all but the loop termination conditional. These are the pointer array mix [i ] that points to the first value for column i in buf, and the integer array inc [i ], that when added to mix [i], points to the next value for column i. In the following inner loop code, pat is the current low frequency pattern buffer, the same size as the record size. High frequency columns in the pattern buffer are initialized to a default value. This allows each reassembled output record to be initialized by a single memcpy ( ), which would be much more efficient than separate memcpy ( ) calls for each contiguous low frequency column range, especially considering that 90% of the columns are usually low frequency. A final data optimization separates the DIFE repetition counts and column offsets from the changed column values so they can be gzip compressed with different tables. The low frequency column values are placed in the buffer val. See FIG. 6. Decompression can be sped up if only a portion of the high frequency columns are required; this happens often when only a few fields are present in a data query. In this case high_freq_cols can be set to the required number of high frequency columns, and mix and inc can be precomputed according to this lower number.
  • PZIP induces partitions using a separate program which the inventors have named “PIN.” PIN takes as input a 4 Mb window of training data (a sample of the actual data) and the fixed record size n. It implements a dynamic program, as described above, and, using gzip as a compressor to test compression rates, produces an optimal partition and corresponding compression rate for the training data. The partition is written to a file with the following syntax: the first line is the fixed record size; each remaining line describes one class in the induced schema partition. A class description is a space-separated list of the column positions in the class counting from 0, where i-j is shorthand for the column positions i through j inclusive. The partition file is simply a list of record byte position groups that, for the purposes of PZIP, correspond to the byte columns that should be compressed separately. See FIG. 8 for an example of such a partition file.
  • To handle data of all sizes, PZIP must operate on one section, or window, at a time. Window size affects both compression and decompression performance. At some point the PZIP compressor must perform submatrix transformations to prepare the partitions for gzip, and a similar operation does the inverse for decompression. This means a buffer to buffer copy, with linear access to one and random access to the other. Random buffer access can be expensive on some systems. Most virtual memory computers provide levels of increasingly slower memory access: local cache, ˜10-100 Kb, local memory, ˜1-4 Mb, and shared memory, >4 Mb. The access time difference between levels can be up to an order of magnitude on some systems. Timing experiments show that 4 Mb is a reasonable compromise for the inventor's local machines (SGI, Sun, Alpha, and Intel). There are valid concerns for basing the calculations on such a small, localized amount of data, especially since many systems using PZIP could deal with multiple gigabytes of new data each day. Some form of sampling over a large span of records might be preferable. But localized record correlations are exactly what PZIP exploits. A partition group with the same value from record to record is the best possible situation for PZIP.
  • The implementation of PIN is divided into four steps. The first step reads the training data window, computes the column frequencies, and generates a submatrix from the high frequency columns. The next step determines a permutation of the column positions from which an optimal partition will be determined. An optimal solution to this step is NP-complete, so a suboptimal heuristic search is done. Let the matrix T = [ t 11 t 1 n t m1 t mn ]
    represent the training data with m rows (the number of records) and n columns (the fixed record length), where tij is byte j of record i, and |T|=mn. A compressor ζ applied to T, ζ(T), reads the elements of T in row major order and produces a byte vector with |ζ(T)| elements and a compression rate of T ζ ( T ) .
    Let Tp,q be the submatrix of T consisting of columns p through q inclusive T p , q = [ t 1 p t 1 q t mp t mq ]
    The search starts by computing |ζ(T{i,j})| for all high frequency column pairs, where T{i,j} is the submatrix formed by columns i and j. All columns x for which |ζ(T{x})|≦min(|ζ(T{x,i)|, |ζ(T{i,x})|) are placed in singleton partition classes and are not considered further in this step. Notice that both T{x,i} and T{i,x} are to be tested. This is because the gzip matching and encoding algorithm is biased by its scan direction. The size difference is not much per pair, but could amount to a byte or so per compressed record when all columns are considered. The remaining pairs are sorted by |ζ(T{i,j})| from lowest to highest, and a partition is formed from the singletons and the lowest size pairs, possibly splitting a pair to fill out the partition. Next the classes are expanded, starting from the smallest (compressed) size. A class is expanded by selecting in order columns from the remaining classes, and keeping the one that minimizes the sizes of the two new classes (the one expanded and the one reduced by the donated column), if any. A column is added by checking the compressed size of the class for all positions within the class, e.g., |ζ(T{x,i, . . . , j})| through |ζ(T{i, . . . , j,x})|. Once a class is expanded it is not considered again, for expansion or for contribution to another class, until all the remaining classes have been considered for expansion. This process continues until there are no more expansions, and produces a column permutation, i.e., the columns in the heuristic partition classes taken in order, for the next step. The final step determines an optimal partition on the permutation produced by the heuristic search using the dynamic program described above: namely, an optimal partition of T for a given permutation has p classes of column ranges ik, jk that minimize: k = 1 p ζ ( T i k , j k )
    subject to:
      • i1=1
      • ik≦jk
      • ik=jk-1+1
      • jp=n
        where the conditions ensure that each column is a member of exactly one class. This is a linear program and can be solved by dynamic programming in O(n3) time.
  • PIN benefits from the removal of low frequency columns since they can be ignored when inducing an optimal partition. For example, a project with 781 byte records had only 81 columns with frequency greater than 10%. This reduces the 4 Mb window PIN run times form 8 hours on an SGI Challenge to under 10 minutes on an Intel Pentium II personal computer. Moreover, an unexpected discovery came out of the timing analysis. The gzip command provides levels of compression ranging from 0 (best time, worst space) to 9 (worst time, best space), the default level being 9. It turns out that the zlib default level is 6. When this was discovered PZIP was immediately run with level 9, and the result was little or no compression improvement, and much worse run time. So data partitioning has the added benefit of being able to run gzip at a lower compression level, saving time at little cost to space.
  • FIG. 8 illustrates the format of an induced schema utilized by the inventors in experiments on compressing telephone call detail records. The records each have a length of 752 bytes. The first line of the schema has been utilized to specify the size of the record and can be used to validate the input data. Each subsequent line, denoted field1, field2, etc., represents and identifies portions of each data record that should be extracted and compressed separately. For example, field 5 indicates that the bytes at position 331 and 330 out of the 752 bytes in each record should be taken and compressed together separately from the rest of the data file. Approximately 10% of the 752 bytes of each record are set forth as separate fields for compression; the rest are processed as a unit (designated implicitly as field 18). The induced schema, generated by the above process, looks similar to a standard database schema, but there are interesting differences. The induced schema emphasize what is probable over what is possible. For example, telephone number records are usually 10 digits long, but there are some exceptional international numbers that can consume up to 16 digits. A database schema for such records would usually need to allocate 16 bytes in all cases to accommodate these exceptions, but the induced schema of the present invention tends to split the telephone numbers into two fields, a 10 digit column for the general case, and a 6 digit column for the exceptions.
  • This subtle reshuffling of the data can have a dramatic effect on the results of even a suboptimal compression scheme such as gzip. Experiments conducted by the inventors on telephone call detail have yielded compression ratios of 30:1, in comparison to compression ratios of 14-15:1 when the data is left in row major order (and 16-18:1 when the data is merely transformed into column major order). Significant improvement can be had with other applications, although the worst case for the present invention would be a random table, i.e. a table whose shortest description is itself. In that case, no improvement is possible.
  • One might think that reordering the data into columns might cost extra time since the output clear text is in row major order. In fact, the inventors have found that rearranging the data into columns can result in faster compression times. It is believed that the time to rearrange the columns is relatively small compared with the times for encoding/decoding and disk input/output. Accordingly, the present invention results in an improvement in both these bottlenecks to quick compression, an improvement which can more than offset the time for rearranging the columns. It is notable that if one often poses queries to the database that require only a few fields containing high entropy columns, the decoding time can be further dramatically improved by decompressing only the columns of interest.
  • Moreover, although the techniques could be utilized for variable length record data, the present invention highlights certain advantages fixed length record data may have over variable length record data. Fixed length record data is often viewed as a waste of space, too sparse for production use. Much effort is then put into optimizing the data schema, and in the process complicating the data interface. PZIP shows that in many cases this view of fixed length data is wrong. In fact, variable length data may become more compressible when converted to a sparse, fixed length format. Intense semantic schema analysis can be replaced by an automated record partition, resulting in compression space improvements of 2 to 10 times and decompression speed improvements of 2 to 3 times over gzip for a large class of data.
  • The foregoing Detailed Description is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.

Claims (16)

1. A method for improving compression of a stream of data comprising:
transforming the data in accordance with a schema to form a first portion and a second portion; and
separately transforming the first portion and the second portion to form a transformed output that includes the transformed first portion and the transformed second portion.
2. The method of claim 1 wherein the first portion contains low entropy data and said second portion contains high entropy data.
3. The method of claim 1 wherein the data is tabular data, and the transformation step further comprises the step of reordering the data into column major order.
4. The method of claim 3 wherein the transformation step further comprises the step of partitioning the data into columns which are separately compressed.
5. A method for retrieving a stream of data from a stream of compressed data which has been compressed in accordance with claim 1, the method comprising:
decompressing the compressed data to form a first decompressed portion and a second decompressed portion; and
transforming the first decompressed portion and the second decompressed portion in accordance with a provided schema to combine the first decompressed portion and the second decompressed portion and obtain thereby said stream of data.
6. A method for generating a schema for improving compression of a stream of data comprising:
separating a sample of the data into a first portion of low entropy and a second portion of high entropy;
partitioning the second portion into columns;
identifying combinations of columns that minimize the compressed size of the sample.
7. An apparatus for improved compression of a stream of data comprising:
means for transforming the data in accordance with a schema to form a first portion and a second portion; and
means for compressing the first portion separately and for separately compressing the second portion.
8. The apparatus of claim 7 wherein the transforming means causes said first portion to contain low entropy data and said second portion to contain high entropy data.
9. The apparatus of claim 7 wherein the transforming means comprises means for reordering the data into column major order.
10. The apparatus of claim 9 wherein the transforming means comprises means for partitioning the data into columns which are separately compressed.
11. The method of claim 1 further comprising a step of receiving said schema.
12. The method of claim 1 further comprising a step of developing said schema from said data.
13. The method of claim 2 further comprising a step of developing said schema from said data by:
partitioning a subset of the data of the second portion into columns; and
identifying combinations of columns that minimize the compressed size of the sample.
14. The method of claim 1 wherein the transformation step comprises a step of partitioning the data into a first portion and second portion based on entropy of the data.
15. The method of claim 1 where the schema pertains to entropy of the data.
16. The method of claim 1 where the schema informs how to reorder the data to enable partitioning of the data into a low entropy portion and a high entropy portion.
US11/110,554 1998-12-10 2005-04-20 Data compression method and apparatus Expired - Fee Related US7720878B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/110,554 US7720878B2 (en) 1998-12-10 2005-04-20 Data compression method and apparatus

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11178198P 1998-12-10 1998-12-10
US09/383,889 US6959300B1 (en) 1998-12-10 1999-08-26 Data compression method and apparatus
US11/110,554 US7720878B2 (en) 1998-12-10 2005-04-20 Data compression method and apparatus

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/383,889 Continuation US6959300B1 (en) 1998-12-10 1999-08-26 Data compression method and apparatus

Publications (2)

Publication Number Publication Date
US20050192994A1 true US20050192994A1 (en) 2005-09-01
US7720878B2 US7720878B2 (en) 2010-05-18

Family

ID=34890129

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/383,889 Expired - Lifetime US6959300B1 (en) 1998-12-10 1999-08-26 Data compression method and apparatus
US11/110,554 Expired - Fee Related US7720878B2 (en) 1998-12-10 2005-04-20 Data compression method and apparatus

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/383,889 Expired - Lifetime US6959300B1 (en) 1998-12-10 1999-08-26 Data compression method and apparatus

Country Status (1)

Country Link
US (2) US6959300B1 (en)

Cited By (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114369A1 (en) * 2003-09-15 2005-05-26 Joel Gould Data profiling
US20070294268A1 (en) * 2006-06-16 2007-12-20 Business Objects, S.A. Apparatus and method for processing data corresponding to multiple cobol data record schemas
US20070294677A1 (en) * 2006-06-16 2007-12-20 Business Objects, S.A. Apparatus and method for processing cobol data record schemas having disparate formats
US20090204630A1 (en) * 2008-02-13 2009-08-13 Yung-Hsiao Lai Digital video apparatus and related method for generating index information
US20090249023A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Applying various hash methods used in conjunction with a query with a group by clause
US20090254521A1 (en) * 2008-04-04 2009-10-08 International Business Machines Corporation Frequency partitioning: entropy compression with fixed size fields
US20100042587A1 (en) * 2008-08-15 2010-02-18 International Business Machines Corporation Method for Laying Out Fields in a Database in a Hybrid of Row-Wise and Column-Wise Ordering
US20100274773A1 (en) * 2009-04-27 2010-10-28 Dnyaneshwar Pawar Nearstore compression of data in a storage system
US20110153373A1 (en) * 2009-12-22 2011-06-23 International Business Machines Corporation Two-layer data architecture for reservation management systems
US20110173167A1 (en) * 2003-07-17 2011-07-14 Binh Dao Vo Method and apparatus for windowing in entropy encoding
US8046496B1 (en) * 2007-12-12 2011-10-25 Narus, Inc. System and method for network data compression
US20130024432A1 (en) * 2011-07-20 2013-01-24 Symantec Corporation Method and system for storing data in compliance with a compression handling instruction
US8370326B2 (en) 2009-03-24 2013-02-05 International Business Machines Corporation System and method for parallel computation of frequency histograms on joined tables
US8442988B2 (en) 2010-11-04 2013-05-14 International Business Machines Corporation Adaptive cell-specific dictionaries for frequency-partitioned multi-dimensional data
US20150100556A1 (en) * 2012-05-25 2015-04-09 Clarion Co., Ltd. Data Compression/Decompression Device
US20150149739A1 (en) * 2013-11-25 2015-05-28 Research & Business Foundation Sungkyunkwan University Method of storing data in distributed manner based on technique of predicting data compression ratio, and storage device and system using same
WO2015084760A1 (en) * 2013-12-02 2015-06-11 Qbase, LLC Design and implementation of clustered in-memory database
US20150207742A1 (en) * 2014-01-22 2015-07-23 Wipro Limited Methods for optimizing data for transmission and devices thereof
WO2015137979A1 (en) * 2014-03-14 2015-09-17 Hewlett-Packard Development Company, Lp Column store database compression
US9177254B2 (en) 2013-12-02 2015-11-03 Qbase, LLC Event detection through text analysis using trained event template models
US9177262B2 (en) 2013-12-02 2015-11-03 Qbase, LLC Method of automated discovery of new topics
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US20150347087A1 (en) * 2014-05-27 2015-12-03 International Business Machines Corporation Reordering of database records for improved compression
US9208204B2 (en) 2013-12-02 2015-12-08 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9223875B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Real-time distributed in memory search architecture
US9223833B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Method for in-loop human validation of disambiguated features
US9230041B2 (en) 2013-12-02 2016-01-05 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9239875B2 (en) 2013-12-02 2016-01-19 Qbase, LLC Method for disambiguated features in unstructured text
US9317565B2 (en) 2013-12-02 2016-04-19 Qbase, LLC Alerting system based on newly disambiguated features
US9323748B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
US9336280B2 (en) 2013-12-02 2016-05-10 Qbase, LLC Method for entity-driven alerts based on disambiguated features
US9348573B2 (en) 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
WO2016078379A1 (en) * 2014-11-17 2016-05-26 华为技术有限公司 Method and device for compressing stream data
US9355152B2 (en) 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases
US9361317B2 (en) 2014-03-04 2016-06-07 Qbase, LLC Method for entity enrichment of digital content to enable advanced search functionality in content management systems
US9424524B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions
US9449057B2 (en) 2011-01-28 2016-09-20 Ab Initio Technology Llc Generating data pattern information
US9544361B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US9542477B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Method of automated discovery of topics relatedness
US9547701B2 (en) 2013-12-02 2017-01-17 Qbase, LLC Method of discovering and exploring feature knowledge
US9619571B2 (en) 2013-12-02 2017-04-11 Qbase, LLC Method for searching related entities through entity co-occurrence
US9659108B2 (en) 2013-12-02 2017-05-23 Qbase, LLC Pluggable architecture for embedding analytics in clustered in-memory databases
CN106934066A (en) * 2017-03-31 2017-07-07 联想(北京)有限公司 A kind of metadata processing method, device and storage device
US9710517B2 (en) 2013-12-02 2017-07-18 Qbase, LLC Data record compression with progressive and/or selective decomposition
US20170374140A1 (en) * 2015-02-09 2017-12-28 Samsung Electronics Co., Ltd. Method and apparatus for transmitting and receiving information between servers in contents transmission network system
US9892026B2 (en) 2013-02-01 2018-02-13 Ab Initio Technology Llc Data records selection
US9922032B2 (en) 2013-12-02 2018-03-20 Qbase, LLC Featured co-occurrence knowledge base from a corpus of documents
US9971798B2 (en) 2014-03-07 2018-05-15 Ab Initio Technology Llc Managing data profiling operations related to data type
US9984427B2 (en) 2013-12-02 2018-05-29 Qbase, LLC Data ingestion module for event detection and increased situational awareness
CN111222624A (en) * 2018-11-26 2020-06-02 深圳云天励飞技术有限公司 Parallel computing method and device
CN111858391A (en) * 2020-06-16 2020-10-30 中国人民解放军空军研究院航空兵研究所 Method for optimizing compressed storage format in data processing process
US11068540B2 (en) 2018-01-25 2021-07-20 Ab Initio Technology Llc Techniques for integrating validation results in data profiling and related systems and methods
US11227334B2 (en) * 2012-03-14 2022-01-18 Nasdaq Technology Ab Method and system for facilitating access to recorded data
US11487732B2 (en) 2014-01-16 2022-11-01 Ab Initio Technology Llc Database key identification

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6959300B1 (en) * 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus
US6741983B1 (en) * 1999-09-28 2004-05-25 John D. Birdwell Method of indexed storage and retrieval of multidimensional information
US8959582B2 (en) 2000-03-09 2015-02-17 Pkware, Inc. System and method for manipulating and managing computer archive files
US6879988B2 (en) 2000-03-09 2005-04-12 Pkware System and method for manipulating and managing computer archive files
US20050015608A1 (en) 2003-07-16 2005-01-20 Pkware, Inc. Method for strongly encrypting .ZIP files
US6785868B1 (en) * 2000-05-31 2004-08-31 Palm Source, Inc. Method and apparatus for managing calendar information from a shared database and managing calendar information from multiple users
CN1290027C (en) * 2001-08-27 2006-12-13 皇家飞利浦电子股份有限公司 Cache method
ATE377897T1 (en) * 2003-02-14 2007-11-15 Research In Motion Ltd SYSTEM AND METHOD FOR COMPACT MESSAGING IN NETWORK COMMUNICATIONS
US7100005B2 (en) * 2003-06-17 2006-08-29 Agilent Technologies, Inc. Record storage and retrieval solution
US20050091279A1 (en) * 2003-09-29 2005-04-28 Rising Hawley K.Iii Use of transform technology in construction of semantic descriptions
US8886617B2 (en) 2004-02-20 2014-11-11 Informatica Corporation Query-based searching using a virtual table
US7243110B2 (en) * 2004-02-20 2007-07-10 Sand Technology Inc. Searchable archive
US7668209B2 (en) * 2005-10-05 2010-02-23 Lg Electronics Inc. Method of processing traffic information and digital broadcast system
US7984477B2 (en) * 2007-03-16 2011-07-19 At&T Intellectual Property I, L.P. Real-time video compression
US7987161B2 (en) * 2007-08-23 2011-07-26 Thomson Reuters (Markets) Llc System and method for data compression using compression hardware
US8813041B2 (en) * 2008-02-14 2014-08-19 Yahoo! Inc. Efficient compression of applications
US8149147B2 (en) 2008-12-30 2012-04-03 Microsoft Corporation Detecting and reordering fixed-length records to facilitate compression
US8356060B2 (en) * 2009-04-30 2013-01-15 Oracle International Corporation Compression analyzer
US9667269B2 (en) 2009-04-30 2017-05-30 Oracle International Corporation Technique for compressing XML indexes
US8583692B2 (en) * 2009-04-30 2013-11-12 Oracle International Corporation DDL and DML support for hybrid columnar compressed tables
US8935223B2 (en) * 2009-04-30 2015-01-13 Oracle International Corporation Structure of hierarchical compressed data structure for tabular data
US8396960B2 (en) * 2009-05-08 2013-03-12 Canon Kabushiki Kaisha Efficient network utilization using multiple physical interfaces
US8880716B2 (en) * 2009-05-08 2014-11-04 Canon Kabushiki Kaisha Network streaming of a single data stream simultaneously over multiple physical interfaces
US8325601B2 (en) * 2009-05-08 2012-12-04 Canon Kabushiki Kaisha Reliable network streaming of a single data stream over multiple physical interfaces
US8296517B2 (en) 2009-08-19 2012-10-23 Oracle International Corporation Database operation-aware striping technique
US8832142B2 (en) 2010-08-30 2014-09-09 Oracle International Corporation Query and exadata support for hybrid columnar compressed data
US8356109B2 (en) 2010-05-13 2013-01-15 Canon Kabushiki Kaisha Network streaming of a video stream over multiple communication channels
US8488894B2 (en) 2010-11-12 2013-07-16 Dynacomware Taiwan Inc. Method and system for dot-matrix font data compression and decompression
US8572042B2 (en) * 2011-01-25 2013-10-29 Andrew LEPPARD Manipulating the actual or effective window size in a data-dependant variable-length sub-block parser
DE102012211031B3 (en) * 2012-06-27 2013-11-28 Siemens Aktiengesellschaft Method for coding a data stream
US9495466B2 (en) 2013-11-27 2016-11-15 Oracle International Corporation LIDAR model with hybrid-columnar format and no indexes for spatial searches
US9990308B2 (en) 2015-08-31 2018-06-05 Oracle International Corporation Selective data compression for in-memory databases
EP3776432A4 (en) 2018-04-02 2021-12-08 The Nielsen Company (US), LLC Processor systems to estimate audience sizes and impression counts for different frequency intervals
US11609889B1 (en) 2021-09-17 2023-03-21 International Business Machines Corporation Reordering datasets in a table for increased compression ratio

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5448733A (en) * 1993-07-16 1995-09-05 International Business Machines Corp. Data search and compression device and method for searching and compressing repeating data
US6012062A (en) * 1996-03-04 2000-01-04 Lucent Technologies Inc. System for compression and buffering of a data stream with data extraction requirements
US6014671A (en) * 1998-04-14 2000-01-11 International Business Machines Corporation Interactive retrieval and caching of multi-dimensional data using view elements
US6092071A (en) * 1997-11-04 2000-07-18 International Business Machines Corporation Dedicated input/output processor method and apparatus for access and storage of compressed data
US6216213B1 (en) * 1996-06-07 2001-04-10 Motorola, Inc. Method and apparatus for compression, decompression, and execution of program code
US6959300B1 (en) * 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5678043A (en) * 1994-09-23 1997-10-14 The Regents Of The University Of Michigan Data compression and encryption system and method representing records as differences between sorted domain ordinals that represent field values
US6216125B1 (en) * 1998-07-02 2001-04-10 At&T Corp. Coarse indexes for a data warehouse

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5448733A (en) * 1993-07-16 1995-09-05 International Business Machines Corp. Data search and compression device and method for searching and compressing repeating data
US6012062A (en) * 1996-03-04 2000-01-04 Lucent Technologies Inc. System for compression and buffering of a data stream with data extraction requirements
US6216213B1 (en) * 1996-06-07 2001-04-10 Motorola, Inc. Method and apparatus for compression, decompression, and execution of program code
US6092071A (en) * 1997-11-04 2000-07-18 International Business Machines Corporation Dedicated input/output processor method and apparatus for access and storage of compressed data
US6014671A (en) * 1998-04-14 2000-01-11 International Business Machines Corporation Interactive retrieval and caching of multi-dimensional data using view elements
US6959300B1 (en) * 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus

Cited By (93)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173167A1 (en) * 2003-07-17 2011-07-14 Binh Dao Vo Method and apparatus for windowing in entropy encoding
US8200680B2 (en) * 2003-07-17 2012-06-12 At&T Intellectual Property Ii, L.P. Method and apparatus for windowing in entropy encoding
US9323802B2 (en) 2003-09-15 2016-04-26 Ab Initio Technology, Llc Data profiling
US8868580B2 (en) * 2003-09-15 2014-10-21 Ab Initio Technology Llc Data profiling
US20050114369A1 (en) * 2003-09-15 2005-05-26 Joel Gould Data profiling
US20070294268A1 (en) * 2006-06-16 2007-12-20 Business Objects, S.A. Apparatus and method for processing data corresponding to multiple cobol data record schemas
US8656374B2 (en) 2006-06-16 2014-02-18 Business Objects Software Ltd. Processing cobol data record schemas having disparate formats
US7640261B2 (en) * 2006-06-16 2009-12-29 Business Objects Software Ltd. Apparatus and method for processing data corresponding to multiple COBOL data record schemas
US20070294677A1 (en) * 2006-06-16 2007-12-20 Business Objects, S.A. Apparatus and method for processing cobol data record schemas having disparate formats
US8516157B1 (en) * 2007-12-12 2013-08-20 Narus, Inc. System and method for network data compression
US8046496B1 (en) * 2007-12-12 2011-10-25 Narus, Inc. System and method for network data compression
US20090204630A1 (en) * 2008-02-13 2009-08-13 Yung-Hsiao Lai Digital video apparatus and related method for generating index information
US8108401B2 (en) 2008-03-28 2012-01-31 International Business Machines Corporation Applying various hash methods used in conjunction with a query with a group by clause
US20090249023A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Applying various hash methods used in conjunction with a query with a group by clause
US20090254521A1 (en) * 2008-04-04 2009-10-08 International Business Machines Corporation Frequency partitioning: entropy compression with fixed size fields
US7827187B2 (en) 2008-04-04 2010-11-02 International Business Machines Corporation Frequency partitioning: entropy compression with fixed size fields
US8099440B2 (en) 2008-08-15 2012-01-17 International Business Machines Corporation Method for laying out fields in a database in a hybrid of row-wise and column-wise ordering
US20100042587A1 (en) * 2008-08-15 2010-02-18 International Business Machines Corporation Method for Laying Out Fields in a Database in a Hybrid of Row-Wise and Column-Wise Ordering
US8370326B2 (en) 2009-03-24 2013-02-05 International Business Machines Corporation System and method for parallel computation of frequency histograms on joined tables
US9319489B2 (en) 2009-04-27 2016-04-19 Netapp, Inc. Nearstore compression of data in a storage system
US20100274773A1 (en) * 2009-04-27 2010-10-28 Dnyaneshwar Pawar Nearstore compression of data in a storage system
US8554745B2 (en) * 2009-04-27 2013-10-08 Netapp, Inc. Nearstore compression of data in a storage system
US20110153373A1 (en) * 2009-12-22 2011-06-23 International Business Machines Corporation Two-layer data architecture for reservation management systems
US8805711B2 (en) * 2009-12-22 2014-08-12 International Business Machines Corporation Two-layer data architecture for reservation management systems
US8442988B2 (en) 2010-11-04 2013-05-14 International Business Machines Corporation Adaptive cell-specific dictionaries for frequency-partitioned multi-dimensional data
US9449057B2 (en) 2011-01-28 2016-09-20 Ab Initio Technology Llc Generating data pattern information
US9652513B2 (en) 2011-01-28 2017-05-16 Ab Initio Technology, Llc Generating data pattern information
US9766812B2 (en) * 2011-07-20 2017-09-19 Veritas Technologies Llc Method and system for storing data in compliance with a compression handling instruction
US20130024432A1 (en) * 2011-07-20 2013-01-24 Symantec Corporation Method and system for storing data in compliance with a compression handling instruction
US11227334B2 (en) * 2012-03-14 2022-01-18 Nasdaq Technology Ab Method and system for facilitating access to recorded data
US20220114669A1 (en) * 2012-03-14 2022-04-14 Nasdaq Technology Ab Method and system for facilitating access to recorded data
US11699285B2 (en) * 2012-03-14 2023-07-11 Nasdaq Technology Ab Method and system for facilitating access to recorded data
US10116325B2 (en) * 2012-05-25 2018-10-30 Clarion Co., Ltd. Data compression/decompression device
US20150100556A1 (en) * 2012-05-25 2015-04-09 Clarion Co., Ltd. Data Compression/Decompression Device
US9323748B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
US9990362B2 (en) 2012-10-22 2018-06-05 Ab Initio Technology Llc Profiling data with location information
US9569434B2 (en) 2012-10-22 2017-02-14 Ab Initio Technology Llc Profiling data with source tracking
US9323749B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
US9892026B2 (en) 2013-02-01 2018-02-13 Ab Initio Technology Llc Data records selection
US11163670B2 (en) 2013-02-01 2021-11-02 Ab Initio Technology Llc Data records selection
US10241900B2 (en) 2013-02-01 2019-03-26 Ab Initio Technology Llc Data records selection
US9606750B2 (en) * 2013-11-25 2017-03-28 Research And Business Foundation Sungkyunkwan University Method of storing data in distributed manner based on technique of predicting data compression ratio, and storage device and system using same
US20150149739A1 (en) * 2013-11-25 2015-05-28 Research & Business Foundation Sungkyunkwan University Method of storing data in distributed manner based on technique of predicting data compression ratio, and storage device and system using same
US9355152B2 (en) 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases
US9208204B2 (en) 2013-12-02 2015-12-08 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9348573B2 (en) 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
WO2015084760A1 (en) * 2013-12-02 2015-06-11 Qbase, LLC Design and implementation of clustered in-memory database
US9317565B2 (en) 2013-12-02 2016-04-19 Qbase, LLC Alerting system based on newly disambiguated features
US9177254B2 (en) 2013-12-02 2015-11-03 Qbase, LLC Event detection through text analysis using trained event template models
US9177262B2 (en) 2013-12-02 2015-11-03 Qbase, LLC Method of automated discovery of new topics
US9424524B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions
US9430547B2 (en) 2013-12-02 2016-08-30 Qbase, LLC Implementation of clustered in-memory database
US9239875B2 (en) 2013-12-02 2016-01-19 Qbase, LLC Method for disambiguated features in unstructured text
US9507834B2 (en) 2013-12-02 2016-11-29 Qbase, LLC Search suggestions using fuzzy-score matching and entity co-occurrence
US9544361B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US9542477B2 (en) 2013-12-02 2017-01-10 Qbase, LLC Method of automated discovery of topics relatedness
US9547701B2 (en) 2013-12-02 2017-01-17 Qbase, LLC Method of discovering and exploring feature knowledge
US9230041B2 (en) 2013-12-02 2016-01-05 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9223833B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Method for in-loop human validation of disambiguated features
US9613166B2 (en) 2013-12-02 2017-04-04 Qbase, LLC Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9619571B2 (en) 2013-12-02 2017-04-11 Qbase, LLC Method for searching related entities through entity co-occurrence
US9626623B2 (en) 2013-12-02 2017-04-18 Qbase, LLC Method of automated discovery of new topics
US9223875B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Real-time distributed in memory search architecture
US9659108B2 (en) 2013-12-02 2017-05-23 Qbase, LLC Pluggable architecture for embedding analytics in clustered in-memory databases
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9710517B2 (en) 2013-12-02 2017-07-18 Qbase, LLC Data record compression with progressive and/or selective decomposition
US9720944B2 (en) 2013-12-02 2017-08-01 Qbase Llc Method for facet searching and search suggestions
US9336280B2 (en) 2013-12-02 2016-05-10 Qbase, LLC Method for entity-driven alerts based on disambiguated features
US9984427B2 (en) 2013-12-02 2018-05-29 Qbase, LLC Data ingestion module for event detection and increased situational awareness
US9785521B2 (en) 2013-12-02 2017-10-10 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9922032B2 (en) 2013-12-02 2018-03-20 Qbase, LLC Featured co-occurrence knowledge base from a corpus of documents
US9916368B2 (en) 2013-12-02 2018-03-13 QBase, Inc. Non-exclusionary search within in-memory databases
US9910723B2 (en) 2013-12-02 2018-03-06 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
US11487732B2 (en) 2014-01-16 2022-11-01 Ab Initio Technology Llc Database key identification
US20150207742A1 (en) * 2014-01-22 2015-07-23 Wipro Limited Methods for optimizing data for transmission and devices thereof
US9361317B2 (en) 2014-03-04 2016-06-07 Qbase, LLC Method for entity enrichment of digital content to enable advanced search functionality in content management systems
US9971798B2 (en) 2014-03-07 2018-05-15 Ab Initio Technology Llc Managing data profiling operations related to data type
WO2015137979A1 (en) * 2014-03-14 2015-09-17 Hewlett-Packard Development Company, Lp Column store database compression
US20150347426A1 (en) * 2014-05-27 2015-12-03 International Business Machines Corporation Reordering of database records for improved compression
US9798727B2 (en) * 2014-05-27 2017-10-24 International Business Machines Corporation Reordering of database records for improved compression
US20150347087A1 (en) * 2014-05-27 2015-12-03 International Business Machines Corporation Reordering of database records for improved compression
US9910855B2 (en) * 2014-05-27 2018-03-06 International Business Machines Corporation Reordering of database records for improved compression
US10218381B2 (en) 2014-11-17 2019-02-26 Huawei Technologies Co., Ltd. Method and device for compressing flow data
CN105680868A (en) * 2014-11-17 2016-06-15 华为技术有限公司 Method and equipment for compressing streaming data
US9768801B1 (en) 2014-11-17 2017-09-19 Huawei Technologies Co., Ltd. Method and device for compressing flow data
WO2016078379A1 (en) * 2014-11-17 2016-05-26 华为技术有限公司 Method and device for compressing stream data
US10560515B2 (en) * 2015-02-09 2020-02-11 Samsung Electronics Co., Ltd. Method and apparatus for transmitting and receiving information between servers in contents transmission network system
US20170374140A1 (en) * 2015-02-09 2017-12-28 Samsung Electronics Co., Ltd. Method and apparatus for transmitting and receiving information between servers in contents transmission network system
CN106934066A (en) * 2017-03-31 2017-07-07 联想(北京)有限公司 A kind of metadata processing method, device and storage device
US11068540B2 (en) 2018-01-25 2021-07-20 Ab Initio Technology Llc Techniques for integrating validation results in data profiling and related systems and methods
CN111222624A (en) * 2018-11-26 2020-06-02 深圳云天励飞技术有限公司 Parallel computing method and device
CN111858391A (en) * 2020-06-16 2020-10-30 中国人民解放军空军研究院航空兵研究所 Method for optimizing compressed storage format in data processing process

Also Published As

Publication number Publication date
US7720878B2 (en) 2010-05-18
US6959300B1 (en) 2005-10-25

Similar Documents

Publication Publication Date Title
US7720878B2 (en) Data compression method and apparatus
US7827187B2 (en) Frequency partitioning: entropy compression with fixed size fields
Pibiri et al. Techniques for inverted index compression
US4814746A (en) Data compression method
US6597812B1 (en) System and method for lossless data compression and decompression
US7103608B1 (en) Method and mechanism for storing and accessing data
US8120516B2 (en) Data compression using a stream selector with edit-in-place capability for compressed data
Crochemore et al. A subquadratic sequence alignment algorithm for unrestricted scoring matrices
US5561421A (en) Access method data compression with system-built generic dictionaries
US8077059B2 (en) Database adapter for relational datasets
US20020152219A1 (en) Data interexchange protocol
US8933829B2 (en) Data compression using dictionary encoding
US20010051941A1 (en) Searching method of block sorting lossless compressed data, and encoding method suitable for searching data in block sorting lossless compressed data
EP0127815A2 (en) Data compression method
US5815096A (en) Method for compressing sequential data into compression symbols using double-indirect indexing into a dictionary data structure
US8660187B2 (en) Method for treating digital data
US5394143A (en) Run-length compression of index keys
KR20080026772A (en) Method for a compression compensating restoration rate of a lempel-ziv compression method
JPH10261969A (en) Data compression method and its device
JPH0546357A (en) Compressing method and restoring method for text data
JPH05241776A (en) Data compression system
Zhang Transform based and search aware text compression schemes and compressed domain text retrieval
JPH0628149A (en) Method for compressing plural kinds of data
CN111488439B (en) System and method for saving and analyzing log data
JPH05152971A (en) Data compressing/restoring method

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552)

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20220518