US20120158709A1 - Methods and Apparatus for Incrementally Computing Similarity of Data Sources - Google Patents
Methods and Apparatus for Incrementally Computing Similarity of Data Sources Download PDFInfo
- Publication number
- US20120158709A1 US20120158709A1 US12/972,266 US97226610A US2012158709A1 US 20120158709 A1 US20120158709 A1 US 20120158709A1 US 97226610 A US97226610 A US 97226610A US 2012158709 A1 US2012158709 A1 US 2012158709A1
- Authority
- US
- United States
- Prior art keywords
- data
- block
- blocks
- dataset
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/214—Database migration support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
Definitions
- At least one embodiment of the present invention pertains to determining data similarity, and more particularly, to methods and apparatus for incremental determination of a similarity value based on a subset of frequency-weighted blocks of a dataset.
- de-duplication has attempted to address some of the burden of managing large amounts of data by eliminating redundant data to improve storage utilization.
- de-duplication In the de-duplication process, duplicate data on a logical storage device is deleted, leaving only one copy of the data, along with references to that one copy of the data.
- De-duplication is able to reduce the required storage capacity since only the unique data is stored. Each subsequent instance of duplicated data is simply referenced back to the one saved copy.
- VM Virtual Machine
- a VM is normally represented as a set of files, including one or more configuration files and one or more disk image files.
- a configuration file stores configuration (settings) of the virtual machine.
- a virtual machine disk image file represents the operating system and data contained within the virtual machine and itself typically includes numerous individual files.
- a single VM disk image file can exceed several gigabytes of storage space, and a single logical storage device can contain numerous VMs, up to the capacity of the storage device.
- this technique is time consuming and processor intensive, especially for larger files because every block of the file is processed.
- Broder technique of computing file similarity is its inability to efficiently re-compute the similarity of two files previously compared.
- Re-computing files' similarity is appropriate after data blocks on one or both of the files change. A change can occur after an existing block is removed or modified, or a new block is created in the file.
- Previous techniques lacked the ability to incrementally adjust the previously computed data similarity without re-computing the Broder equation, which involves at least resorting all of the data blocks of the files and introduces the problems associated with the brute force method.
- the similarity is determined based on comparing a subset of sorted frequency-weighted blocks from one dataset to a subset of sorted frequency-weighed blocks from another dataset.
- data blocks of a dataset are used to compute unique, frequency-weighted hash values.
- the frequency-weight of a particular hash value is based on a summation other hash values of the dataset equaling the particular hash value.
- the similarity value is re-determined without resorting or hashing the blocks of a dataset other than the blocks of the subset, resulting in an increased performance of the similarity comparison.
- blocks of a dataset are excluded based on a block-filtering rule to increase the accuracy of the similarity comparison.
- the solution presented herein overcomes the time-consuming computation of performing a baseline similarity comparison when re-determining (updating) a similarity between two datasets by incrementally updating only a portion of the total number of blocks of a dataset.
- the technique introduced herein also overcomes the problem of poor accuracy of the similarity comparison result by filtering undesirable data blocks from the comparison using block-filtering rules and by using block-frequencies to increase the accuracy of the similarity comparison.
- FIG. 1 a illustrates a network storage environment in which the present invention can be implemented.
- FIG. 1 b illustrates a virtual machine represented as a configuration file and a data image file.
- FIG. 2 is a high-level block diagram showing an example of the hardware architecture of a computer that can perform a similarity comparison.
- FIG. 3 is a low-level block diagram showing example modules of a processor to implement various functions of the present invention.
- FIG. 4 a illustrates an example of a various steps of a similarity comparison of data blocks from two different files.
- FIG. 4 b illustrates an example similarity determination based on frequency-weighted data blocks from two different files.
- FIG. 4 c illustrates an example of re-determining the similarity value based on creating a new data block, updating of an existing block, or removing a data block within a subset of sorted frequency-weighted data blocks of a file.
- FIG. 5 is a flow diagram illustrating a process for identifying a least similar virtual machine based on sorted frequency-weighted data blocks of multiple virtual machines and migrating the virtual machine to a server.
- FIG. 1 a shows a network configuration in which the techniques introduced here can be implemented. It is noted that the network environment described here is for illustration of one type of a configuration in which the techniques can be implemented, and that other network storage configurations and schemes can be used for implementing the techniques introduced herein.
- FIG. 1 a shows a network data storage environment 100 , which includes a server system 118 , and a data warehouse 102 containing a first data store 104 and an optional second data store 110 .
- Each data store contains files 108 / 114 that are accessible, over a switching fabric 116 , to the server 118 .
- a file contains data that may be stored by at the block level at a data store 104 / 110 .
- a block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured are said to be blocked. Blocked data are typically read a whole block at a time.
- the switching fabric 116 connects together server 118 and data stores 104 / 110 .
- the server 118 is connected, via a network 120 , to a client 122 .
- the first and/or second data stores 104 / 110 can optionally be located via network 120 , as illustrated by data store 124 .
- the environment 100 can be utilized to perform aspects of the invention.
- the environment 100 is used to identify a least similar file (or dataset) of the files 108 to free available space on the first data store 104 , for example.
- the least similar file (or dataset) is identified on the first data store, because removing that least similar file will provide the most free-space on the first data store due to data de-duplication or other storage techniques.
- the least similar file is migrated to a second data store 110 having files most similar to the least virtual file.
- the server 118 may be, for example, a standard computing system such as a personal computer (PC) or server-class computer, equipped with an operating system. Alternatively, the server 118 can be one of the FAS family of storage server products available from NetApp®, Inc of Sunnyvale, Calif. The server 118 may perform various functions and management operations on the files 108 , 114 , and 128 , such as computing a similarity comparison and performing data migrations of the files between data stores 104 , 110 , and 124 .
- PC personal computer
- server-class computer equipped with an operating system.
- the server 118 may perform various functions and management operations on the files 108 , 114 , and 128 , such as computing a similarity comparison and performing data migrations of the files between data stores 104 , 110 , and 124 .
- the switching fabric 116 connects the server 118 to the data stores 114 / 110 of the data warehouse 102 .
- the switching fabric can utilize any connection method known in the art, such as Fiber Channel, iSCSI, PCI Express, HyperTransport, or QuickPath.
- the switching fabric 116 can be a computer bus.
- Data warehouse 102 is an aggregation of data stores.
- a data store such as the first data store 104 , stores files 108 .
- a data store can be a logical storage device that provides an area of usable storage capacity on one or more physical disk drives components.
- a logical storage device can contain one or more non-volatile mass storage devices or portions thereof.
- a data store, such as the first data store 104 can be storage provided from a storage system, such as those available from NetApp, Inc of Sunnyvale, Calif.
- the data stores 104 , 110 , and 124 can make available, to the client 122 and server 118 , some or all of the storage space of each respective storage system.
- each of the non-volatile mass storage devices 104 , 110 , and 124 can be implemented as one or more disks (e.g., a RAID group) or any other suitable mass storage device(s).
- some or all of the storage space can be other types of storage, such as flash memory, SSDs, tape storage, etc.
- the server 118 and client 122 can communicate with the data stores 104 , 110 , and 124 according to well-known protocols, such as the SCSI protocol or the Fiber Channel Protocol (FCP) protocol, to make data stored in the data stores 104 and 110 available to the server 118 and/or client 122 .
- well-known protocols such as the SCSI protocol or the Fiber Channel Protocol (FCP) protocol
- Files 108 , 114 , and 128 are electronic files that store data for use by the server 118 and/or client 112 .
- Each file of the files 108 can include of any data capable of electronic storage including, for example, text, binary data, database entries, configurations, system information, graphics, disk images, and/or virtual disk images, etc.
- the number of files 108 is variably dependent on the storage capacity of the data store.
- the server 118 can optionally connect, via the computer network 120 , to the client 122 and data store 124 to allow for remote management of files.
- Network 120 can be, for example, a local area network (LAN), wide area network (WAN), or a global area network, such as the Internet, and can make use of any conventional or non-conventional network technologies.
- the client 122 may be a standard computing device, such as a personal computer, laptop computer, smart phone or other computing system capable of connecting to the network 120 .
- the client may perform various functions and management operations, such as the similarity comparisons and data migrations described within this application.
- any other suitable numbers of servers, clients, files, networks, and/or data stores may be employed.
- FIG. 1 b illustrates a virtualization environment and provides context for the technique and system introduced here.
- the virtualization environment may be embodied in a physical host system 130 , such as server 118 , for example. However, it is noted that a separate server or multiple servers can implement the virtualization environment.
- a guest virtual machine 132 operates logically on top of a hypervisor 134 within a physical host system 130 .
- Hypervisor 134 is a software layer that typically provides the virtualization, i.e., virtualization of physical processors, memory and peripheral devices. In certain embodiments, the hypervisor 134 may operate logically on top of a host operating system 136 ; in others, it may operate directly (logically) on top of the host hardware.
- the host operating system 136 can be a conventional operating system, such as Windows, UNIX or Linux.
- the physical host system 130 can be a conventional personal computer (PC), server-class computer, or potentially even a handheld device.
- the physical host system 130 includes various computer hardware, including a set of storage devices (not shown). Alternatively, one or more of the storage devices 104 and/or 110 may be external to the physical host system 130 .
- the virtualization environment can be, for example, a virtualization environment provided by VMWare® or Xen®, for example.
- the virtualization environment represents the virtual machine 132 in the form of two types of files, a configuration file 138 and at least one data image file 140 .
- the configuration file 138 contains the configuration (settings) of the virtual machine 132 .
- Each data image file 140 contains data blocks contained within the virtual machine 132 and itself includes numerous individual files, VF 1 , VF 2 , . . . , VF N .
- the data image file 140 is formatted according to the particular virtualization environment being used.
- the technique and system introduced here enable a data image 140 to be compared at the data block level for a degree of similarity with another data image file (not shown). Additionally, one or more of the individual files VF 1 of the data image 140 may be compared at the data block level to compute a degree of similarity with another individual file VF N .
- a virtual machine can be a virtual storage server such as used in a network storage environment, or an independent functional module or portion of a virtual storage server.
- a virtual machine data image from a virtual machine snapshot backup can be a data image of a virtual storage server.
- FIG. 2 is a diagram illustrating an example of the internal architecture 200 of a server 118 , 130 and/or client 122 that can implement one or more features of the invention.
- the client/server architecture 200 is a computer system that includes a processor subsystem 202 that further includes one or more processors.
- the client/server architecture 200 further includes a memory 204 , a network adapter 210 , a storage adapter 211 (optional), a filtering module 212 , a hashing module 214 , a migration module 216 , and a similarity comparator module 218 , each interconnected by an interconnect 222 and powered by a power supply 220 .
- the client/server architecture 200 can be embodied as a single- or multi-processor storage system executing the server 118 or client 122 that preferably implements a high-level module, such as a storage manager, to logically organize the information as a hierarchical structure of named directories, files 108 and 114 (including virtual machines) on the data stores 104 , 110 , and 124 .
- a high-level module such as a storage manager
- the memory 204 illustratively comprises storage locations that are addressable by the processors 202 and components 210 through 222 for storing software program code and data structures associated with the present invention.
- the processor 202 and components may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the data structures.
- the operating system 206 portions of which are typically resident in memory and executed by the processor(s) 202 , functionally organizes the client/server architecture 200 by (among other things) configuring the processor(s) 202 to invoke storage and file related operations in support of the present invention. It will be apparent to those skilled in the art that other processing and memory implementations, including various computer readable storage media, may be used for storing and executing program instructions pertaining to the technique introduced here.
- the network adapter 210 includes one or more ports to couple the client/server architecture 200 of the server 118 and/or client 122 over the network 120 , such as a wide area network, virtual private network implemented over a public network (Internet) or a shared local area network. Additionally, the network adapter 210 , or a separate additional adapter, is further configured to connect, via the network 100 , to the data store 124 .
- the network adapter 210 thus can include the mechanical, electrical and signaling circuitry needed to connect the client/server architecture 200 to the network 120 .
- the network 120 can be embodied as an Ethernet network or a Fibre Channel (FC) adapter, for example.
- the server 118 and the client 122 can communicate, via the network 120 by, exchanging discrete frames or packets of data according to pre-defined protocols, such as TCP/IP.
- the storage adapter 211 cooperates with the operating system 206 to access information requested by the server 118 .
- the information may be stored on any type of attached array of writable storage media, such as magnetic disk or tape, optical disk (e.g., CD-ROM or DVD), flash memory, solid-state disk (SSD), electronic random access memory (RAM), micro-electro mechanical and/or any other similar media adapted to store information, including data and parity information.
- the information is stored on non-volatile mass storage device within a data store 104 and 110 .
- the operating system 212 facilitates the server's and the client's access to data stored within the data stores 104 and 110 .
- the operating system 206 implements a write-anywhere file system that cooperates with one or more virtualization modules to “virtualize” the storage space provided by the data stores 104 and 110 .
- the operating system 206 is a version of the Data ONTAP® operating system available from NetApp, Inc. implementing the Write Anywhere File Layout (WAFL®) file system.
- WAFL® Write Anywhere File Layout
- the filtering module 212 contains logic to filter data blocks from the comparison by the similarity comparator module 218 .
- the filtering module 212 can selectively filter certain types of data blocks through the use of block-filtering rules.
- a block-filtering rule contains programmable logic, alterable by an end-user, to selectively allow or disallow certain data blocks for comparison, based on the data represented by the data block. For example, data blocks representing free space on a virtual machine image file can be filtered based on a free-space filtering rule to provide increased efficiency in computing a similarity comparison between files. Similarly, data blocks representing portions of an operating system page-file can be automatically filtered (omitted) from a similarity analysis to increase the accuracy of a similarity comparison of virtual machine image files.
- a block-filtering rule can selectively allow a certain type of data block for similarity comparison.
- the hashing module 214 generates a hash value for each of the data blocks for comparing during the similarity comparison, described below.
- the hashing module 214 determines a hash value of a data block, based on a hashing algorithm. Creating a hash value for a data block simplifies the subsequent similarity comparison by converting large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array.
- the values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.
- Hash functions are mostly used to speed up table lookup or data comparison tasks-such as detecting duplicated or similar records in a large file.
- the hashing module 214 can be utilized by the hashing module 214 , such as MD2, MD4, MD5, CRC, SHA, SHA256, or other mathematical algorithms capable of implementing a hashing function.
- the present invention may operate without the use of hashing algorithms by, for example, simply comparing the layout of bits of one data block to the layout of bits of another data block.
- the migration module 216 is configured to initiate data migrations between the data stores 104 , 110 and 124 . In one embodiment, the migration module 216 is configured to initiate a data migration of a file 108 from data store 104 to data store 110 or 124 .
- the similarity comparator module 218 is configured to generate a similarity value that expresses the degree of similarity between files.
- the similarity comparator module 218 can be a processor 202 , programmed by the operating system 206 or other software stored in memory 204 .
- the similarity comparator module 218 can be special-purpose hardwired circuitry.
- FIG. 3 illustrates the inter-operation of modules, operating at least in part in the processor(s) 202 , to migrate a file based on the files similarity to other files.
- the similarity comparator module 218 receives files 108 for comparison from, for example, data store 1 .
- files can be filtered based on a block-filtering rule of the filtering module 212 to allow/disallow certain files from comparison.
- hash values are generated from the files' data blocks by a hashing algorithm of the hashing module 214 .
- FIG. 4 a illustrates a file 130 containing data blocks 402 that are used to generate a series of hash values 404 .
- Each individual hash value of the series of hash values 404 is then passed to the similarity comparator 218 where the hash values are sorted.
- the similarity comparator 218 utilizes a sorting algorithm to create a list of unique hash values 406 from the hash values 404 , and preferably lexicographically sorts each unique hash value, as illustrated at 410 .
- the term “lexicographical sort” refers to the ordering used in creating a dictionary. To lexicographically sort two hash values, the first characters in each hash value are compared. If the characters are identical, then the second characters in each hash value are compared. If the second characters are identical the third, fourth, and remaining characters are compared until two non-identical characters are encountered.
- the hash value with the character having the smaller value is placed first in the lexicographical ordering. For example, if hash values “B78Q64” and “B78MT3” are compared, the determination of lexicographical order is based on the fourth characters, “Q” and “M”. This is because each hash value contains the initial three characters “B78.” Since “M” has an ASCII value that is less than “Q”, the hash value “B78MT3” would be placed before hash value “B78Q64” in the lexicographical order.
- the process of creating hash values and sorting the data blocks is referred to in this description as creating a baseline. Creating a baseline is additionally performed, as described above, for file 2 . Generating the baseline is costly in terms of time and processing power because, as explained above, every block of each file to compare must be individually hashed and sorted to create the sorted, unique hash values 410 .
- weights are applied to the data blocks (or hash values of the data blocks).
- the weight can be a number or sum of numbers associated with one or more of the data blocks (or hash values) to effect a degree of accuracy of the similarity.
- the weight can be an average distribution of a particular block in the data store.
- a series of block-frequency numbers 408 is utilized in the similarity comparison to increase the accuracy of the similarity comparison.
- the series of block-frequency numbers 408 is generated by the similarity comparator module 218 .
- a block-frequency number 409 represents the number of occurrences of a unique data block (optionally represented as a hash value 405 ) within the data blocks 402 of a file.
- data block 15 ( 405 ) may be repeated 80 times within the data blocks 402 of file 1 ( 130 ).
- the value 80 therefore, is recorded as a block-frequency number 409 associated with data block 15 ( 405 ).
- This step can be repeated for all of the unique data blocks 406 to create the series of block-frequency numbers 408 .
- the similarity comparator module 218 selects a first portion 412 and 428 of the sorted, unique hash values 410 and 426 , respectively.
- the first portion 412 and 428 are a predetermined number (k) of data blocks of each file for use in the similarity comparison.
- the number (k) is selectable based on a desired accuracy of the similarity comparison. As can be seen from Equation 1, selecting a high value for k yields higher accuracy in the similarity comparison and selecting a lower value for k yields lower accuracy in the similarity comparison. It should be noted that as greater values are chosen for k, the greater is the time-commitment and performance cost on the processor 202 performing the similarity comparison. Therefore, there is a cost associated with choosing high k values to increase accuracy of the similarity comparison.
- the similarity comparator 218 determines the degree of similarity of the files, represented as a percentage 444 and based on Equation 1, by matching identical hash values 432 from the selected portion of file 1 ( 412 ) and from the selected portion of file 2 ( 428 ).
- FIG. 4 b illustrates that the hash values 11 , 12 , and 15 ( 432 ) are common to each selected portion of file 1 ( 412 ) and file 2 ( 428 ), where the value of ‘k’ is five.
- a summation 436 of the hash-values' corresponding block-frequency values is determined, based on the hash value having a lesser block frequency number of the matching pair.
- matching hash values 11 , 12 , and 15 have lesser block frequency numbers 25 , 35 , and 55 , respectively.
- the numerator 436 is divided by a denominator 438 , as shown in Equation 1.
- the denominator 438 is preferably a summation of the number of data blocks (optionally represented as hash values) within the selected portions 412 and 428 , whichever is greater.
- FIG. 4 b illustrates a denominator of 377 ( 442 ) which is the summation of the five selected block-frequency numbers 412 of file 1 , which is larger than the summation of the five selected block-frequency numbers 428 of file 2 .
- a person having ordinary skill in the art will understand that other values for the numerator 436 and/or the denominator 438 can be selected based on the desired accuracy of the similarity comparison.
- Similarity comparator module 218 After the similarity comparator module 218 performs the similarity comparison, other similarity comparisons can be performed on files 108 and files 114 to determine the most suitable data store, such as a logical storage device, to which to migrate file 1 . Based on the operations of data de-duplication, available space on a data store is optimized by storing together files having the most similarity; therefore, it may be advantages to identify a least similar file (or dataset) of a data store 104 by performing similarity comparisons on all or a portion of files (or datasets) located at a data store 104 . The file (or dataset) having the lowest similarity value of all or a portion of the files (or datasets) at a data store 104 is identified as the least similar file (or dataset).
- Migration module 216 can migrate 304 files (or datasets) having the lowest similarity value to a separate data store 110 having files 114 (or datasets) more similar to the least similar files. This optimizes each data store by maximizing the available space at the data store after de-duplication.
- FIG. 4 c illustrates a process of updating the similarity of a file by updating the sorted, unique hash values 410 and block frequency numbers 411 without having to re-determine the entire baseline determination previously described.
- An alteration to a file affects the sorted, unique hash values 410 only if a data block within the selected portion 412 is affected. If the alteration does not affect the selected portion 412 , the values used for the similarity comparison remain unchanged from the previously computed similarity comparison.
- a second portion 414 of the sorted, unique hash values and block frequency numbers is selected to increase the accuracy of the similarity comparison after the file is modified. Similar to the first selected portion 412 , the number of data blocks (or hash values) chosen for the second portion 414 is based on the desired accuracy of the similarity comparison. The higher the number chosen, the greater is the accuracy but the greater is the performance cost in generating the comparison. Preferably, the size of the second portion 414 will be a single multiple of the k-value selected in the first selected portion 412 .
- the block frequency number associated with that data block is increased.
- the modification removes a data block 456 already represented by multiple occurrences within the selected portion 412 the block frequency number associated with that data block is decremented.
- the similarity comparison between the modified file and another file is determined based on the updated sorted, unique hash values and Equation 1, without resorting or rehashing the entire list of weighted hash values 410 , which reduces the time and processing required to perform and incrementally update the similarity comparison.
- FIG. 5 is a flow chart illustrating an example of the process of selecting a file to migrate from a first logical storage device to second logical storage device.
- the steps of FIG. 5 discuss migrating the data representing a virtual machine from one storage device to another; however, any data source can be utilized.
- the data blocks of each of two virtual machines are identified for use in computing a first degree of similarity between the two virtual machines.
- Step 504 removes any undesirable blocks from the data blocks of the virtual machines to increase the accuracy of the similarity by use of a block filter rule.
- a block filter rule is a predefined (but alterable) set of one or more criteria to exclude (or include) a data block from the similarity comparison, based on one or more characteristics of the data block.
- One block filter rule for example, can exclude comparing free space on the virtual machines.
- Another block filter rule for example, can exclude a page-file of the virtual machine.
- the block filter rule is user-defined such that a user of the system can identify a type of data block to exclude (or include) in the comparison analysis.
- the non-excluded data blocks are used to generate hash values that are lexicographically sorted in step 508 .
- Step 510 includes determining a first similarity value of a first virtual machine to another virtual machine.
- the first similarity value associated with the first virtual machine is then compared, in step 512 , to a second similarity value between the first virtual machine and a virtual machine on a separate, second storage device. If it is determined in step 513 that the second similarity value is greater than the first similarity value (or exceeds a predetermined threshold value), the first virtual machine is migrated to the second storage device in step 514 , so that a greater amount of space can be retrieved from the first storage device.
- the least similar virtual machine is identified on the first storage device, because removing that least similar virtual machine will provide the most free-space on the first storage device due to data de-duplication or other storage techniques. In order to further save additional storage space on other storage devices, the least similar virtual machine is migrated to a storage sever having virtual machines most similar to the least similar virtual machine.
- ASICs application-specific integrated circuits
- PLDs programmable logic devices
- FPGAs field-programmable gate arrays
- Machine-readable medium includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA), manufacturing tool, any device with one or more processors, etc.).
- a machine-accessible medium includes recordable/non-recordable media (e.g., read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), etc.
- logic can include, for example, special-purpose hardwired circuitry, software and/or firmware in conjunction with programmable circuitry, or a combination thereof.
Abstract
Description
- At least one embodiment of the present invention pertains to determining data similarity, and more particularly, to methods and apparatus for incremental determination of a similarity value based on a subset of frequency-weighted blocks of a dataset.
- The exponential growth of digital information, credited to faster processors, lower cost of digital data storage, increasing availability of high data rate access, and development of new applications has increased the demand for computer storage. This increased dependence on computer data and data storage creates a need for more efficient data analysis technology.
- With the increasing availability of low-cost, high-volume data storage devices, an increasing amount of data can be stored on an individual logical storage device, such as a physical disk drive, tape drive, or optical drive. Consumer hard drives, for example, have recently exceeded a terabyte of data storage capacity to meet the increasing demands for electronic storage. However, efficiently managing large amounts of data is burdensome and costly.
- Technologies, such as de-duplication, have attempted to address some of the burden of managing large amounts of data by eliminating redundant data to improve storage utilization. In the de-duplication process, duplicate data on a logical storage device is deleted, leaving only one copy of the data, along with references to that one copy of the data. De-duplication is able to reduce the required storage capacity since only the unique data is stored. Each subsequent instance of duplicated data is simply referenced back to the one saved copy.
- To maximize the benefits of de-duplication, it is advantageous to aggregate, to a single logical storage device, data files having maximum similarity to one another. However, it is time-consuming, and computationally intensive to compare each data block of one file, for example, to each data block of another file to determine the similarity between the two files. The computational complexity is further increased with larger files that may be associated with a Virtual Machine (VM).
- A VM is normally represented as a set of files, including one or more configuration files and one or more disk image files. A configuration file stores configuration (settings) of the virtual machine. A virtual machine disk image file represents the operating system and data contained within the virtual machine and itself typically includes numerous individual files. A single VM disk image file can exceed several gigabytes of storage space, and a single logical storage device can contain numerous VMs, up to the capacity of the storage device.
- Previous efforts to determine similarity between files relied on a “brute force” method. The brute force method utilizes set similarity based on determining both an intersection and union of all data blocks of each file undergoing comparison. For example, to determine similarity between VM ‘A’ and VM ‘B’, the following “brute force” equation has been utilized: S(A,B)=|A∩B|/|A∪B|, where ‘A’ is the set of data blocks (or corresponding hash values) of VM ‘A’, ‘B’ is the set of data blocks (or corresponding hash values) of VM ‘B’, ‘∩’ is the intersection operator, and ‘∪’ is the union operator. However, this technique is time consuming and processor intensive, especially for larger files because every block of the file is processed.
- Other techniques utilize a Broder equation to attempt to offset the brute force method, by comparing only a subset of sorted data blocks of the files being compared. This technique avoids some of the issues of the brute force method by limiting the determination of the intersection of data blocks of the files to a predetermined number (k) of data blocks and eliminating the determination of the union of all data blocks of each file being compared.
- One limitation with the Broder technique of computing file similarity is its inability to efficiently re-compute the similarity of two files previously compared. Re-computing files' similarity is appropriate after data blocks on one or both of the files change. A change can occur after an existing block is removed or modified, or a new block is created in the file. Previous techniques lacked the ability to incrementally adjust the previously computed data similarity without re-computing the Broder equation, which involves at least resorting all of the data blocks of the files and introduces the problems associated with the brute force method.
- Another problem with the Broder technique is that it introduces variance in the accuracy of the similarity comparison. Under the Broder technique, accuracy of the similarity comparison is a function of the number (k) of sorted data blocks utilized in the similarity comparison. The lower the number (k), the less accurate the similarity comparison will be. The higher number (k) of sorted data blocks, the greater the accuracy of the similarity comparison will be. However, increasing the number (k) results in the original problem of the “brute force” method where the computational complexity and time commitment exceeded the usefulness of computing the similarity.
- Therefore, the problems of computational complexity, high time-commitments, and poor accuracy when incrementally determining a similarity comparison of large files thus far has not been addressed and hinders current efforts to efficiently utilize data storage devices to manage and organize electronic information.
- Introduced herein are methods and apparatus for efficiently determining a degree of similarity between two or more datasets. In one embodiment, the similarity is determined based on comparing a subset of sorted frequency-weighted blocks from one dataset to a subset of sorted frequency-weighed blocks from another dataset. In one embodiment, data blocks of a dataset are used to compute unique, frequency-weighted hash values. The frequency-weight of a particular hash value is based on a summation other hash values of the dataset equaling the particular hash value. These frequency-weighted hash values can be compared to frequency-weighted hash values of another dataset to determine a degree of similarity of the two datasets. In another embodiment, upon a change of a block in a subset of the dataset, the similarity value is re-determined without resorting or hashing the blocks of a dataset other than the blocks of the subset, resulting in an increased performance of the similarity comparison. In yet another embodiment, blocks of a dataset are excluded based on a block-filtering rule to increase the accuracy of the similarity comparison.
- The solution presented herein overcomes the time-consuming computation of performing a baseline similarity comparison when re-determining (updating) a similarity between two datasets by incrementally updating only a portion of the total number of blocks of a dataset. The technique introduced herein also overcomes the problem of poor accuracy of the similarity comparison result by filtering undesirable data blocks from the comparison using block-filtering rules and by using block-frequencies to increase the accuracy of the similarity comparison.
- One or more embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
-
FIG. 1 a illustrates a network storage environment in which the present invention can be implemented. -
FIG. 1 b illustrates a virtual machine represented as a configuration file and a data image file. -
FIG. 2 is a high-level block diagram showing an example of the hardware architecture of a computer that can perform a similarity comparison. -
FIG. 3 is a low-level block diagram showing example modules of a processor to implement various functions of the present invention. -
FIG. 4 a illustrates an example of a various steps of a similarity comparison of data blocks from two different files. -
FIG. 4 b illustrates an example similarity determination based on frequency-weighted data blocks from two different files. -
FIG. 4 c illustrates an example of re-determining the similarity value based on creating a new data block, updating of an existing block, or removing a data block within a subset of sorted frequency-weighted data blocks of a file. -
FIG. 5 is a flow diagram illustrating a process for identifying a least similar virtual machine based on sorted frequency-weighted data blocks of multiple virtual machines and migrating the virtual machine to a server. - References in this specification to “an embodiment”, “one embodiment”, or the like, mean that the particular feature, structure or characteristic being described is included in at least one embodiment of the present invention. Occurrences of such phrases in this specification do not necessarily all refer to the same embodiment.
-
FIG. 1 a shows a network configuration in which the techniques introduced here can be implemented. It is noted that the network environment described here is for illustration of one type of a configuration in which the techniques can be implemented, and that other network storage configurations and schemes can be used for implementing the techniques introduced herein. -
FIG. 1 a shows a networkdata storage environment 100, which includes aserver system 118, and adata warehouse 102 containing afirst data store 104 and an optionalsecond data store 110. Each data store containsfiles 108/114 that are accessible, over aswitching fabric 116, to theserver 118. A file contains data that may be stored by at the block level at adata store 104/110. A block is a sequence of bytes or bits, having a nominal length (a block size). Data thus structured are said to be blocked. Blocked data are typically read a whole block at a time. - The switching
fabric 116 connects togetherserver 118 anddata stores 104/110. Theserver 118 is connected, via anetwork 120, to aclient 122. The first and/orsecond data stores 104/110 can optionally be located vianetwork 120, as illustrated bydata store 124. - The
environment 100 can be utilized to perform aspects of the invention. For example in one embodiment, theenvironment 100 is used to identify a least similar file (or dataset) of thefiles 108 to free available space on thefirst data store 104, for example. In this regard, the least similar file (or dataset) is identified on the first data store, because removing that least similar file will provide the most free-space on the first data store due to data de-duplication or other storage techniques. In order to save storage space, the least similar file is migrated to asecond data store 110 having files most similar to the least virtual file. - The
server 118 may be, for example, a standard computing system such as a personal computer (PC) or server-class computer, equipped with an operating system. Alternatively, theserver 118 can be one of the FAS family of storage server products available from NetApp®, Inc of Sunnyvale, Calif. Theserver 118 may perform various functions and management operations on thefiles data stores - The switching
fabric 116 connects theserver 118 to thedata stores 114/110 of thedata warehouse 102. The switching fabric can utilize any connection method known in the art, such as Fiber Channel, iSCSI, PCI Express, HyperTransport, or QuickPath. Alternatively, the switchingfabric 116 can be a computer bus. -
Data warehouse 102 is an aggregation of data stores. A data store, such as thefirst data store 104, stores files 108. In one embodiment, a data store can be a logical storage device that provides an area of usable storage capacity on one or more physical disk drives components. A logical storage device can contain one or more non-volatile mass storage devices or portions thereof. In another embodiment, a data store, such as thefirst data store 104, can be storage provided from a storage system, such as those available from NetApp, Inc of Sunnyvale, Calif. Thedata stores client 122 andserver 118, some or all of the storage space of each respective storage system. For example, each of the non-volatilemass storage devices server 118 andclient 122 can communicate with thedata stores data stores server 118 and/orclient 122. -
Files server 118 and/or client 112. Each file of thefiles 108 can include of any data capable of electronic storage including, for example, text, binary data, database entries, configurations, system information, graphics, disk images, and/or virtual disk images, etc. The number offiles 108 is variably dependent on the storage capacity of the data store. - The
server 118 can optionally connect, via thecomputer network 120, to theclient 122 anddata store 124 to allow for remote management of files.Network 120 can be, for example, a local area network (LAN), wide area network (WAN), or a global area network, such as the Internet, and can make use of any conventional or non-conventional network technologies. - The
client 122 may be a standard computing device, such as a personal computer, laptop computer, smart phone or other computing system capable of connecting to thenetwork 120. The client may perform various functions and management operations, such as the similarity comparisons and data migrations described within this application. - It is noted that, within the network
data storage environment 100, any other suitable numbers of servers, clients, files, networks, and/or data stores may be employed. -
FIG. 1 b illustrates a virtualization environment and provides context for the technique and system introduced here. The virtualization environment may be embodied in aphysical host system 130, such asserver 118, for example. However, it is noted that a separate server or multiple servers can implement the virtualization environment. A guestvirtual machine 132 operates logically on top of ahypervisor 134 within aphysical host system 130.Hypervisor 134 is a software layer that typically provides the virtualization, i.e., virtualization of physical processors, memory and peripheral devices. In certain embodiments, thehypervisor 134 may operate logically on top of ahost operating system 136; in others, it may operate directly (logically) on top of the host hardware. Thehost operating system 136 can be a conventional operating system, such as Windows, UNIX or Linux. Thephysical host system 130 can be a conventional personal computer (PC), server-class computer, or potentially even a handheld device. Thephysical host system 130 includes various computer hardware, including a set of storage devices (not shown). Alternatively, one or more of thestorage devices 104 and/or 110 may be external to thephysical host system 130. - The virtualization environment can be, for example, a virtualization environment provided by VMWare® or Xen®, for example. The virtualization environment represents the
virtual machine 132 in the form of two types of files, aconfiguration file 138 and at least onedata image file 140. Although only oneconfiguration file 138 and only onedata image file 140 are shown, note that a virtual machine may be represented by two or more configuration files and/or two or more data image files. Theconfiguration file 138 contains the configuration (settings) of thevirtual machine 132. Eachdata image file 140 contains data blocks contained within thevirtual machine 132 and itself includes numerous individual files, VF1, VF2, . . . , VFN. Thedata image file 140 is formatted according to the particular virtualization environment being used. Nonetheless, the technique and system introduced here enable adata image 140 to be compared at the data block level for a degree of similarity with another data image file (not shown). Additionally, one or more of the individual files VF1 of thedata image 140 may be compared at the data block level to compute a degree of similarity with another individual file VFN. - The technique and system introduced above can be used with virtual machines of various designs and functions. For example, a virtual machine can be a virtual storage server such as used in a network storage environment, or an independent functional module or portion of a virtual storage server. Accordingly, a virtual machine data image from a virtual machine snapshot backup can be a data image of a virtual storage server.
-
FIG. 2 is a diagram illustrating an example of theinternal architecture 200 of aserver client 122 that can implement one or more features of the invention. In the illustrated embodiment, the client/server architecture 200 is a computer system that includes aprocessor subsystem 202 that further includes one or more processors. The client/server architecture 200 further includes amemory 204, anetwork adapter 210, a storage adapter 211 (optional), afiltering module 212, ahashing module 214, amigration module 216, and asimilarity comparator module 218, each interconnected by aninterconnect 222 and powered by apower supply 220. - The client/
server architecture 200 can be embodied as a single- or multi-processor storage system executing theserver 118 orclient 122 that preferably implements a high-level module, such as a storage manager, to logically organize the information as a hierarchical structure of named directories, files 108 and 114 (including virtual machines) on thedata stores - The
memory 204 illustratively comprises storage locations that are addressable by theprocessors 202 andcomponents 210 through 222 for storing software program code and data structures associated with the present invention. Theprocessor 202 and components may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the data structures. Theoperating system 206, portions of which are typically resident in memory and executed by the processor(s) 202, functionally organizes the client/server architecture 200 by (among other things) configuring the processor(s) 202 to invoke storage and file related operations in support of the present invention. It will be apparent to those skilled in the art that other processing and memory implementations, including various computer readable storage media, may be used for storing and executing program instructions pertaining to the technique introduced here. - The
network adapter 210 includes one or more ports to couple the client/server architecture 200 of theserver 118 and/orclient 122 over thenetwork 120, such as a wide area network, virtual private network implemented over a public network (Internet) or a shared local area network. Additionally, thenetwork adapter 210, or a separate additional adapter, is further configured to connect, via thenetwork 100, to thedata store 124. Thenetwork adapter 210 thus can include the mechanical, electrical and signaling circuitry needed to connect the client/server architecture 200 to thenetwork 120. Illustratively, thenetwork 120 can be embodied as an Ethernet network or a Fibre Channel (FC) adapter, for example. Theserver 118 and theclient 122 can communicate, via thenetwork 120 by, exchanging discrete frames or packets of data according to pre-defined protocols, such as TCP/IP. - The
storage adapter 211 cooperates with theoperating system 206 to access information requested by theserver 118. The information may be stored on any type of attached array of writable storage media, such as magnetic disk or tape, optical disk (e.g., CD-ROM or DVD), flash memory, solid-state disk (SSD), electronic random access memory (RAM), micro-electro mechanical and/or any other similar media adapted to store information, including data and parity information. However, as illustratively described herein, the information is stored on non-volatile mass storage device within adata store - The
operating system 212 facilitates the server's and the client's access to data stored within thedata stores operating system 206 implements a write-anywhere file system that cooperates with one or more virtualization modules to “virtualize” the storage space provided by thedata stores operating system 206 is a version of the Data ONTAP® operating system available from NetApp, Inc. implementing the Write Anywhere File Layout (WAFL®) file system. However, other storage operating systems are capable of being enhanced or created for use in accordance with the principles described herein. - The
filtering module 212 contains logic to filter data blocks from the comparison by thesimilarity comparator module 218. Thefiltering module 212 can selectively filter certain types of data blocks through the use of block-filtering rules. A block-filtering rule contains programmable logic, alterable by an end-user, to selectively allow or disallow certain data blocks for comparison, based on the data represented by the data block. For example, data blocks representing free space on a virtual machine image file can be filtered based on a free-space filtering rule to provide increased efficiency in computing a similarity comparison between files. Similarly, data blocks representing portions of an operating system page-file can be automatically filtered (omitted) from a similarity analysis to increase the accuracy of a similarity comparison of virtual machine image files. Alternatively, a block-filtering rule can selectively allow a certain type of data block for similarity comparison. - The
hashing module 214 generates a hash value for each of the data blocks for comparing during the similarity comparison, described below. Thehashing module 214 determines a hash value of a data block, based on a hashing algorithm. Creating a hash value for a data block simplifies the subsequent similarity comparison by converting large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. Hash functions are mostly used to speed up table lookup or data comparison tasks-such as detecting duplicated or similar records in a large file. Various mathematical functions can be utilized by thehashing module 214, such as MD2, MD4, MD5, CRC, SHA, SHA256, or other mathematical algorithms capable of implementing a hashing function. Alternatively, the present invention may operate without the use of hashing algorithms by, for example, simply comparing the layout of bits of one data block to the layout of bits of another data block. - The
migration module 216 is configured to initiate data migrations between thedata stores migration module 216 is configured to initiate a data migration of afile 108 fromdata store 104 todata store - The
similarity comparator module 218 is configured to generate a similarity value that expresses the degree of similarity between files. In one embodiment, thesimilarity comparator module 218 can be aprocessor 202, programmed by theoperating system 206 or other software stored inmemory 204. Alternatively, thesimilarity comparator module 218 can be special-purpose hardwired circuitry. -
FIG. 3 illustrates the inter-operation of modules, operating at least in part in the processor(s) 202, to migrate a file based on the files similarity to other files. Thesimilarity comparator module 218 receivesfiles 108 for comparison from, for example,data store 1. Optionally, as described above, files can be filtered based on a block-filtering rule of thefiltering module 212 to allow/disallow certain files from comparison. Before receipt by thesimilarity comparator 218, hash values are generated from the files' data blocks by a hashing algorithm of thehashing module 214.FIG. 4 a illustrates afile 130 containingdata blocks 402 that are used to generate a series of hash values 404. Each individual hash value of the series of hash values 404 is then passed to thesimilarity comparator 218 where the hash values are sorted. Thesimilarity comparator 218 utilizes a sorting algorithm to create a list of unique hash values 406 from the hash values 404, and preferably lexicographically sorts each unique hash value, as illustrated at 410. The term “lexicographical sort” refers to the ordering used in creating a dictionary. To lexicographically sort two hash values, the first characters in each hash value are compared. If the characters are identical, then the second characters in each hash value are compared. If the second characters are identical the third, fourth, and remaining characters are compared until two non-identical characters are encountered. When this occurs, the hash value with the character having the smaller value is placed first in the lexicographical ordering. For example, if hash values “B78Q64” and “B78MT3” are compared, the determination of lexicographical order is based on the fourth characters, “Q” and “M”. This is because each hash value contains the initial three characters “B78.” Since “M” has an ASCII value that is less than “Q”, the hash value “B78MT3” would be placed before hash value “B78Q64” in the lexicographical order. The process of creating hash values and sorting the data blocks is referred to in this description as creating a baseline. Creating a baseline is additionally performed, as described above, forfile 2. Generating the baseline is costly in terms of time and processing power because, as explained above, every block of each file to compare must be individually hashed and sorted to create the sorted, unique hash values 410. - In one embodiment, to increase the accuracy of the similarity comparison, weights are applied to the data blocks (or hash values of the data blocks). The weight can be a number or sum of numbers associated with one or more of the data blocks (or hash values) to effect a degree of accuracy of the similarity. By non-limiting example, the weight can be an average distribution of a particular block in the data store. In a particular embodiment a series of block-
frequency numbers 408 is utilized in the similarity comparison to increase the accuracy of the similarity comparison. The series of block-frequency numbers 408 is generated by thesimilarity comparator module 218. A block-frequency number 409 represents the number of occurrences of a unique data block (optionally represented as a hash value 405) within the data blocks 402 of a file. For example, data block 15 (405) may be repeated 80 times within the data blocks 402 of file 1 (130). Thevalue 80, therefore, is recorded as a block-frequency number 409 associated with data block 15 (405). This step can be repeated for all of the unique data blocks 406 to create the series of block-frequency numbers 408. - As illustrated in
FIG. 4 b, once the baseline steps are performed forfile 1 andfile 2, for example, thesimilarity comparator module 218 selects afirst portion unique hash values first portion Equation 1, selecting a high value for k yields higher accuracy in the similarity comparison and selecting a lower value for k yields lower accuracy in the similarity comparison. It should be noted that as greater values are chosen for k, the greater is the time-commitment and performance cost on theprocessor 202 performing the similarity comparison. Therefore, there is a cost associated with choosing high k values to increase accuracy of the similarity comparison. - The
similarity comparator 218 determines the degree of similarity of the files, represented as apercentage 444 and based onEquation 1, by matching identical hash values 432 from the selected portion of file 1 (412) and from the selected portion of file 2 (428). For example,FIG. 4 b illustrates that the hash values 11, 12, and 15 (432) are common to each selected portion of file 1 (412) and file 2 (428), where the value of ‘k’ is five. For each of these hash values, asummation 436 of the hash-values' corresponding block-frequency values is determined, based on the hash value having a lesser block frequency number of the matching pair. For example, matching hash values 11, 12, and 15 have lesserblock frequency numbers numerator 436 is divided by adenominator 438, as shown inEquation 1. Thedenominator 438 is preferably a summation of the number of data blocks (optionally represented as hash values) within the selectedportions FIG. 4 b illustrates a denominator of 377 (442) which is the summation of the five selected block-frequency numbers 412 offile 1, which is larger than the summation of the five selected block-frequency numbers 428 offile 2. A person having ordinary skill in the art will understand that other values for thenumerator 436 and/or thedenominator 438 can be selected based on the desired accuracy of the similarity comparison. - After the
similarity comparator module 218 performs the similarity comparison, other similarity comparisons can be performed onfiles 108 andfiles 114 to determine the most suitable data store, such as a logical storage device, to which to migratefile 1. Based on the operations of data de-duplication, available space on a data store is optimized by storing together files having the most similarity; therefore, it may be advantages to identify a least similar file (or dataset) of adata store 104 by performing similarity comparisons on all or a portion of files (or datasets) located at adata store 104. The file (or dataset) having the lowest similarity value of all or a portion of the files (or datasets) at adata store 104 is identified as the least similar file (or dataset).Migration module 216 can migrate 304 files (or datasets) having the lowest similarity value to aseparate data store 110 having files 114 (or datasets) more similar to the least similar files. This optimizes each data store by maximizing the available space at the data store after de-duplication. - Once the degree of similarity between files has been determined, one of more of the files may change based on, for example, data being added to and/or removed from one of the files. This alteration of a file changes its existing data block structure and thus its similarity to other files.
FIG. 4 c illustrates a process of updating the similarity of a file by updating the sorted,unique hash values 410 andblock frequency numbers 411 without having to re-determine the entire baseline determination previously described. An alteration to a file affects the sorted,unique hash values 410 only if a data block within the selectedportion 412 is affected. If the alteration does not affect the selectedportion 412, the values used for the similarity comparison remain unchanged from the previously computed similarity comparison. For example, if a new block added to file 1 has a hash value less than the lowest hash value of the selectedportion 412, the addition of that block does not affect thenumerator 436 ofEquation 1. This avoids the need to re-determine the similarity value. - Alternatively, when a modification to a file's data blocks affects a data block within the selected
portion 412, asecond portion 414 of the sorted, unique hash values and block frequency numbers is selected to increase the accuracy of the similarity comparison after the file is modified. Similar to the first selectedportion 412, the number of data blocks (or hash values) chosen for thesecond portion 414 is based on the desired accuracy of the similarity comparison. The higher the number chosen, the greater is the accuracy but the greater is the performance cost in generating the comparison. Preferably, the size of thesecond portion 414 will be a single multiple of the k-value selected in the first selectedportion 412. - When a
new block 452 is created that has a hash value which, when sorted, is within the selectedportion 412, data blocks having smaller hash values are each decremented in placement relative to thenew block 452, such that a data block is pushed from the first selectedportion 412 into the highest lexicographically sorted position of the second selectedportion 414. Decrementing lesser hash values may have the affect of removing thelowest hash value 450 from the sorted,unique hash values 410, as shown by theelement 458. Similarly, if adata block 460 within the selectedportion 412 is deleted from a file and the data block had a block frequency number of one, all hash values sorted lower than the deleted data block are incremented in placement to take the space of the deleted block. This may have the affect of creating anull value entry 462 for an unused data block in thesecond portion 414. - If the modification to the file adds a
data block 454 already represented in the selectedportion 412, the block frequency number associated with that data block is increased. Similarly, if the modification removes adata block 456 already represented by multiple occurrences within the selectedportion 412, the block frequency number associated with that data block is decremented. - After all modifications to the file have been made, the similarity comparison between the modified file and another file is determined based on the updated sorted, unique hash values and
Equation 1, without resorting or rehashing the entire list of weighted hash values 410, which reduces the time and processing required to perform and incrementally update the similarity comparison. -
FIG. 5 is a flow chart illustrating an example of the process of selecting a file to migrate from a first logical storage device to second logical storage device. For the purpose of illustration, the steps ofFIG. 5 discuss migrating the data representing a virtual machine from one storage device to another; however, any data source can be utilized. Instep 502, the data blocks of each of two virtual machines are identified for use in computing a first degree of similarity between the two virtual machines. Step 504 removes any undesirable blocks from the data blocks of the virtual machines to increase the accuracy of the similarity by use of a block filter rule. In one embodiment, a block filter rule is a predefined (but alterable) set of one or more criteria to exclude (or include) a data block from the similarity comparison, based on one or more characteristics of the data block. One block filter rule, for example, can exclude comparing free space on the virtual machines. Another block filter rule, for example, can exclude a page-file of the virtual machine. In another embodiment, the block filter rule is user-defined such that a user of the system can identify a type of data block to exclude (or include) in the comparison analysis. Instep 506, the non-excluded data blocks are used to generate hash values that are lexicographically sorted instep 508. Step 510 includes determining a first similarity value of a first virtual machine to another virtual machine. The first similarity value associated with the first virtual machine is then compared, instep 512, to a second similarity value between the first virtual machine and a virtual machine on a separate, second storage device. If it is determined instep 513 that the second similarity value is greater than the first similarity value (or exceeds a predetermined threshold value), the first virtual machine is migrated to the second storage device instep 514, so that a greater amount of space can be retrieved from the first storage device. In this regard, the least similar virtual machine is identified on the first storage device, because removing that least similar virtual machine will provide the most free-space on the first storage device due to data de-duplication or other storage techniques. In order to further save additional storage space on other storage devices, the least similar virtual machine is migrated to a storage sever having virtual machines most similar to the least similar virtual machine. - The techniques introduced above can be implemented by programmable circuitry programmed or configured by software and/or firmware, or entirely by special-purpose circuitry, or in a combination of such forms. Such special-purpose circuitry (if any) can be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), etc.
- Software or firmware for implementing the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “machine-readable medium”, as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, a computer, network device, cellular phone, personal digital assistant (PDA), manufacturing tool, any device with one or more processors, etc.). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), etc.
- The term “logic”, as used herein, can include, for example, special-purpose hardwired circuitry, software and/or firmware in conjunction with programmable circuitry, or a combination thereof.
- Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense.
Claims (35)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/972,266 US8364716B2 (en) | 2010-12-17 | 2010-12-17 | Methods and apparatus for incrementally computing similarity of data sources |
PCT/US2011/065893 WO2012083305A1 (en) | 2010-12-17 | 2011-12-19 | Methods and systems to incrementally compute similarity of data sources |
EP11848750.3A EP2652649A4 (en) | 2010-12-17 | 2011-12-19 | Methods and systems to incrementally compute similarity of data sources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/972,266 US8364716B2 (en) | 2010-12-17 | 2010-12-17 | Methods and apparatus for incrementally computing similarity of data sources |
Publications (2)
Publication Number | Publication Date |
---|---|
US20120158709A1 true US20120158709A1 (en) | 2012-06-21 |
US8364716B2 US8364716B2 (en) | 2013-01-29 |
Family
ID=46235752
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/972,266 Active 2031-03-18 US8364716B2 (en) | 2010-12-17 | 2010-12-17 | Methods and apparatus for incrementally computing similarity of data sources |
Country Status (3)
Country | Link |
---|---|
US (1) | US8364716B2 (en) |
EP (1) | EP2652649A4 (en) |
WO (1) | WO2012083305A1 (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120317395A1 (en) * | 2011-06-13 | 2012-12-13 | XtremlO Ltd. | Low latency replication techniques with content addressable storage |
US20130159645A1 (en) * | 2011-12-15 | 2013-06-20 | International Business Machines Corporation | Data selection for movement from a source to a target |
US8935222B2 (en) | 2013-01-02 | 2015-01-13 | International Business Machines Corporation | Optimizing a partition in data deduplication |
US20150253763A1 (en) * | 2012-09-28 | 2015-09-10 | SCREEN Holdings Co., Ltd. | Data generation system and data generation method |
US20150286442A1 (en) * | 2014-04-03 | 2015-10-08 | Strato Scale Ltd. | Cluster-wide memory management using similarity-preserving signatures |
CN105302495A (en) * | 2015-11-20 | 2016-02-03 | 华为技术有限公司 | Data storage method and device |
US20160125059A1 (en) * | 2014-11-04 | 2016-05-05 | Rubrik, Inc. | Hybrid cloud data management system |
WO2016075562A1 (en) * | 2014-11-12 | 2016-05-19 | Strato Scale Ltd. | Exploiting node-local deduplication in distributed storage system |
US9390028B2 (en) | 2014-10-19 | 2016-07-12 | Strato Scale Ltd. | Coordination between memory-saving mechanisms in computers that run virtual machines |
US9418419B2 (en) * | 2013-08-16 | 2016-08-16 | Siemens Aktiengsellschaft | Control method and apparatus to prepare medical image data with user acceptance of previews after each of first and second filtering of the medical image data |
US20160239538A1 (en) * | 2015-02-13 | 2016-08-18 | International Business Machines Corporation | Method for processing a database query |
US9471354B1 (en) * | 2014-06-25 | 2016-10-18 | Amazon Technologies, Inc. | Determining provenance of virtual machine images |
US9575661B2 (en) * | 2014-08-19 | 2017-02-21 | Samsung Electronics Co., Ltd. | Nonvolatile memory systems configured to use deduplication and methods of controlling the same |
US9609345B2 (en) * | 2011-03-22 | 2017-03-28 | International Business Machines Corporation | Scalable image distribution in virtualized server environments |
US20170228416A1 (en) * | 2015-10-20 | 2017-08-10 | Sanjay JAYARAM | System for managing data |
EP3126982A4 (en) * | 2014-04-03 | 2018-01-31 | Strato Scale Ltd. | Scanning memory for de-duplication using rdma |
US9912748B2 (en) | 2015-01-12 | 2018-03-06 | Strato Scale Ltd. | Synchronization of snapshots in a distributed storage system |
US9971698B2 (en) | 2015-02-26 | 2018-05-15 | Strato Scale Ltd. | Using access-frequency hierarchy for selection of eviction destination |
US10061834B1 (en) * | 2014-10-31 | 2018-08-28 | Amazon Technologies, Inc. | Incremental out-of-place updates for datasets in data stores |
US20190065519A1 (en) * | 2017-08-31 | 2019-02-28 | Fujitsu Limited | Information processing apparatus, information processing method, and recording medium |
CN110309143A (en) * | 2018-03-21 | 2019-10-08 | 华为技术有限公司 | Data similarity determines method, apparatus and processing equipment |
US11334438B2 (en) | 2017-10-10 | 2022-05-17 | Rubrik, Inc. | Incremental file system backup using a pseudo-virtual disk |
US11372813B2 (en) | 2019-08-27 | 2022-06-28 | Vmware, Inc. | Organize chunk store to preserve locality of hash values and reference counts for deduplication |
US11372729B2 (en) | 2017-11-29 | 2022-06-28 | Rubrik, Inc. | In-place cloud instance restore |
US11461229B2 (en) | 2019-08-27 | 2022-10-04 | Vmware, Inc. | Efficient garbage collection of variable size chunking deduplication |
US20220405302A1 (en) * | 2021-06-22 | 2022-12-22 | Pure Storage, Inc. | Generating Datasets Using Approximate Baselines |
US11669495B2 (en) * | 2019-08-27 | 2023-06-06 | Vmware, Inc. | Probabilistic algorithm to check whether a file is unique for deduplication |
US11775484B2 (en) | 2019-08-27 | 2023-10-03 | Vmware, Inc. | Fast algorithm to find file system difference for deduplication |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5725014B2 (en) * | 2010-03-15 | 2015-05-27 | 日本電気株式会社 | Information processing apparatus, information processing method, and information processing program |
US9880771B2 (en) * | 2012-06-19 | 2018-01-30 | International Business Machines Corporation | Packing deduplicated data into finite-sized containers |
CN103593256B (en) | 2012-08-15 | 2017-05-24 | 阿里巴巴集团控股有限公司 | Method and system for virtual machine snapshot backup on basis of multilayer duplicate deletion |
CN103731335B (en) * | 2012-10-11 | 2017-10-24 | 腾讯科技(深圳)有限公司 | Collective message sending method and device |
US8862847B2 (en) * | 2013-02-08 | 2014-10-14 | Huawei Technologies Co., Ltd. | Distributed storage method, apparatus, and system for reducing a data loss that may result from a single-point failure |
US10241689B1 (en) | 2015-06-23 | 2019-03-26 | Amazon Technologies, Inc. | Surface-based logical storage units in multi-platter disks |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010047365A1 (en) * | 2000-04-19 | 2001-11-29 | Hiawatha Island Software Co, Inc. | System and method of packaging and unpackaging files into a markup language record for network search and archive services |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7565348B1 (en) | 2005-03-24 | 2009-07-21 | Palamida, Inc. | Determining a document similarity metric |
US7814078B1 (en) | 2005-06-20 | 2010-10-12 | Hewlett-Packard Development Company, L.P. | Identification of files with similar content |
FR2899708B1 (en) | 2006-04-07 | 2008-06-20 | Thales Sa | METHOD FOR RAPID DE-QUILLLING OF A SET OF DOCUMENTS OR A SET OF DATA CONTAINED IN A FILE |
JP4859595B2 (en) * | 2006-09-01 | 2012-01-25 | 株式会社日立製作所 | Storage system, data relocation method thereof, and data relocation program |
US8082233B2 (en) | 2007-03-29 | 2011-12-20 | Microsoft Corporation | Comparing data sets through identification of matching blocks |
-
2010
- 2010-12-17 US US12/972,266 patent/US8364716B2/en active Active
-
2011
- 2011-12-19 EP EP11848750.3A patent/EP2652649A4/en not_active Withdrawn
- 2011-12-19 WO PCT/US2011/065893 patent/WO2012083305A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010047365A1 (en) * | 2000-04-19 | 2001-11-29 | Hiawatha Island Software Co, Inc. | System and method of packaging and unpackaging files into a markup language record for network search and archive services |
Cited By (49)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9609345B2 (en) * | 2011-03-22 | 2017-03-28 | International Business Machines Corporation | Scalable image distribution in virtualized server environments |
US9734431B2 (en) * | 2011-03-22 | 2017-08-15 | International Business Machines Corporation | Scalable image distribution in virtualized server environments |
US9639591B2 (en) * | 2011-06-13 | 2017-05-02 | EMC IP Holding Company LLC | Low latency replication techniques with content addressable storage |
US20120317395A1 (en) * | 2011-06-13 | 2012-12-13 | XtremlO Ltd. | Low latency replication techniques with content addressable storage |
US20130159645A1 (en) * | 2011-12-15 | 2013-06-20 | International Business Machines Corporation | Data selection for movement from a source to a target |
US20130159648A1 (en) * | 2011-12-15 | 2013-06-20 | International Business Machines Corporation | Data selection for movement from a source to a target |
US9087010B2 (en) * | 2011-12-15 | 2015-07-21 | International Business Machines Corporation | Data selection for movement from a source to a target |
US9087011B2 (en) * | 2011-12-15 | 2015-07-21 | International Business Machines Corporation | Data selection for movement from a source to a target |
US20150253763A1 (en) * | 2012-09-28 | 2015-09-10 | SCREEN Holdings Co., Ltd. | Data generation system and data generation method |
US9626374B2 (en) | 2013-01-02 | 2017-04-18 | International Business Machines Corporation | Optimizing a partition in data deduplication |
US8935222B2 (en) | 2013-01-02 | 2015-01-13 | International Business Machines Corporation | Optimizing a partition in data deduplication |
US9418419B2 (en) * | 2013-08-16 | 2016-08-16 | Siemens Aktiengsellschaft | Control method and apparatus to prepare medical image data with user acceptance of previews after each of first and second filtering of the medical image data |
US20150286442A1 (en) * | 2014-04-03 | 2015-10-08 | Strato Scale Ltd. | Cluster-wide memory management using similarity-preserving signatures |
EP3126982A4 (en) * | 2014-04-03 | 2018-01-31 | Strato Scale Ltd. | Scanning memory for de-duplication using rdma |
US9747051B2 (en) * | 2014-04-03 | 2017-08-29 | Strato Scale Ltd. | Cluster-wide memory management using similarity-preserving signatures |
US9471354B1 (en) * | 2014-06-25 | 2016-10-18 | Amazon Technologies, Inc. | Determining provenance of virtual machine images |
US9575661B2 (en) * | 2014-08-19 | 2017-02-21 | Samsung Electronics Co., Ltd. | Nonvolatile memory systems configured to use deduplication and methods of controlling the same |
US9390028B2 (en) | 2014-10-19 | 2016-07-12 | Strato Scale Ltd. | Coordination between memory-saving mechanisms in computers that run virtual machines |
US10061834B1 (en) * | 2014-10-31 | 2018-08-28 | Amazon Technologies, Inc. | Incremental out-of-place updates for datasets in data stores |
US20160125059A1 (en) * | 2014-11-04 | 2016-05-05 | Rubrik, Inc. | Hybrid cloud data management system |
US11947809B2 (en) | 2014-11-04 | 2024-04-02 | Rubrik, Inc. | Data management system |
EP3567482A1 (en) * | 2014-11-04 | 2019-11-13 | Rubrik, Inc. | Data management system |
US11354046B2 (en) | 2014-11-04 | 2022-06-07 | Rubrik, Inc. | Deduplication of virtual machine content |
WO2016075562A1 (en) * | 2014-11-12 | 2016-05-19 | Strato Scale Ltd. | Exploiting node-local deduplication in distributed storage system |
US9912748B2 (en) | 2015-01-12 | 2018-03-06 | Strato Scale Ltd. | Synchronization of snapshots in a distributed storage system |
US20160239538A1 (en) * | 2015-02-13 | 2016-08-18 | International Business Machines Corporation | Method for processing a database query |
US9953065B2 (en) | 2015-02-13 | 2018-04-24 | International Business Machines Corporation | Method for processing a database query |
CN105893453A (en) * | 2015-02-13 | 2016-08-24 | 国际商业机器公司 | Computer-implemented method for processing query in database and computer system |
US10698912B2 (en) | 2015-02-13 | 2020-06-30 | International Business Machines Corporation | Method for processing a database query |
US9959323B2 (en) * | 2015-02-13 | 2018-05-01 | International Business Machines Corporation | Method for processing a database query |
US9971698B2 (en) | 2015-02-26 | 2018-05-15 | Strato Scale Ltd. | Using access-frequency hierarchy for selection of eviction destination |
US11829344B2 (en) * | 2015-10-20 | 2023-11-28 | Sanjay JAYARAM | System for managing data |
US20170228416A1 (en) * | 2015-10-20 | 2017-08-10 | Sanjay JAYARAM | System for managing data |
US20210073205A1 (en) * | 2015-10-20 | 2021-03-11 | Sanjay JAYARAM | System for managing data |
US10860572B2 (en) * | 2015-10-20 | 2020-12-08 | Sanjay JAYARAM | System for managing data |
CN105302495A (en) * | 2015-11-20 | 2016-02-03 | 华为技术有限公司 | Data storage method and device |
US10824599B2 (en) * | 2017-08-31 | 2020-11-03 | Fujitsu Limited | Information processing apparatus, information processing method, and recording medium |
US20190065519A1 (en) * | 2017-08-31 | 2019-02-28 | Fujitsu Limited | Information processing apparatus, information processing method, and recording medium |
US11892912B2 (en) | 2017-10-10 | 2024-02-06 | Rubrik, Inc. | Incremental file system backup using a pseudo-virtual disk |
US11334438B2 (en) | 2017-10-10 | 2022-05-17 | Rubrik, Inc. | Incremental file system backup using a pseudo-virtual disk |
US11829263B2 (en) | 2017-11-29 | 2023-11-28 | Rubrik, Inc. | In-place cloud instance restore |
US11372729B2 (en) | 2017-11-29 | 2022-06-28 | Rubrik, Inc. | In-place cloud instance restore |
CN110309143A (en) * | 2018-03-21 | 2019-10-08 | 华为技术有限公司 | Data similarity determines method, apparatus and processing equipment |
US11669495B2 (en) * | 2019-08-27 | 2023-06-06 | Vmware, Inc. | Probabilistic algorithm to check whether a file is unique for deduplication |
US11775484B2 (en) | 2019-08-27 | 2023-10-03 | Vmware, Inc. | Fast algorithm to find file system difference for deduplication |
US11461229B2 (en) | 2019-08-27 | 2022-10-04 | Vmware, Inc. | Efficient garbage collection of variable size chunking deduplication |
US11372813B2 (en) | 2019-08-27 | 2022-06-28 | Vmware, Inc. | Organize chunk store to preserve locality of hash values and reference counts for deduplication |
US20220405302A1 (en) * | 2021-06-22 | 2022-12-22 | Pure Storage, Inc. | Generating Datasets Using Approximate Baselines |
US11816129B2 (en) * | 2021-06-22 | 2023-11-14 | Pure Storage, Inc. | Generating datasets using approximate baselines |
Also Published As
Publication number | Publication date |
---|---|
EP2652649A4 (en) | 2015-10-07 |
US8364716B2 (en) | 2013-01-29 |
EP2652649A1 (en) | 2013-10-23 |
WO2012083305A1 (en) | 2012-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8364716B2 (en) | Methods and apparatus for incrementally computing similarity of data sources | |
US9792306B1 (en) | Data transfer between dissimilar deduplication systems | |
US10303797B1 (en) | Clustering files in deduplication systems | |
US8914338B1 (en) | Out-of-core similarity matching | |
US9152333B1 (en) | System and method for estimating storage savings from deduplication | |
US10261946B2 (en) | Rebalancing distributed metadata | |
US10248656B2 (en) | Removal of reference information for storage blocks in a deduplication system | |
US11157453B2 (en) | Parallel deduplication using automatic chunk sizing | |
US10762051B1 (en) | Reducing hash collisions in large scale data deduplication | |
US10242021B2 (en) | Storing data deduplication metadata in a grid of processors | |
US9396071B1 (en) | System and method for presenting virtual machine (VM) backup information from multiple backup servers | |
US9965487B2 (en) | Conversion of forms of user data segment IDs in a deduplication system | |
WO2014037767A1 (en) | Multi-level inline data deduplication | |
US10838923B1 (en) | Poor deduplication identification | |
US9679007B1 (en) | Techniques for managing references to containers | |
US10255288B2 (en) | Distributed data deduplication in a grid of processors | |
US9268832B1 (en) | Sorting a data set by using a limited amount of memory in a processing system | |
CN113535670B (en) | Virtual resource mirror image storage system and implementation method thereof | |
US11809379B2 (en) | Storage tiering for deduplicated storage environments | |
EP3590041B1 (en) | System and method to propagate information across a connected set of entities irrespective of the specific entity type | |
US10042854B2 (en) | Detection of data affected by inaccessible storage blocks in a deduplication system | |
US9965488B2 (en) | Back referencing of deduplicated data | |
US11321194B2 (en) | Recovery from a clustered file system queue failure event using a modified extended attribute of a file |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NETAPP, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GAONKAR, SHRAVAN;DIXIT, SAGAR;SIGNING DATES FROM 20101115 TO 20101210;REEL/FRAME:025523/0980 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |