CN101566928B - Virtual disk drive system and method - Google Patents

Virtual disk drive system and method Download PDF

Info

Publication number
CN101566928B
CN101566928B CN2009100047280A CN200910004728A CN101566928B CN 101566928 B CN101566928 B CN 101566928B CN 2009100047280 A CN2009100047280 A CN 2009100047280A CN 200910004728 A CN200910004728 A CN 200910004728A CN 101566928 B CN101566928 B CN 101566928B
Authority
CN
China
Prior art keywords
data
disk
time point
page
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009100047280A
Other languages
Chinese (zh)
Other versions
CN101566928A (en
Inventor
P·E·索兰
J·P·圭德
L·E·阿兹曼
M·J·克莱姆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DELL International Ltd
Original Assignee
Compellent Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Compellent Technologies Inc filed Critical Compellent Technologies Inc
Priority claimed from US10/918,329 external-priority patent/US7613945B2/en
Publication of CN101566928A publication Critical patent/CN101566928A/en
Application granted granted Critical
Publication of CN101566928B publication Critical patent/CN101566928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A disk drive system and method capable of dynamically allocating data is provided. The disk drive system may include a RAID subsystem having a pool of storage, for example a page pool of storage that maintains a free list of RAIDs, or a matrix of disk storage blocks that maintain a null list of RAIDs, and a disk manager having at least one disk storage system controller. The RAID subsystem and disk manager dynamically allocate data across the pool of storage and a plurality of disk drives based on RAID-to-disk mapping. The RAID subsystem and disk manager determine whether additional disk drives are required, and a notification is sent if the additional disk drives are required. Dynamic data allocation and data progression allow a user to acquire a disk drive later in time when it is needed. Dynamic data allocation also allows efficient data storage of snapshots/point-in-time copies of virtual volume pool of storage, instant data replay and data instant fusion for data backup, recovery etc., remote data storage, and data progression, etc.

Description

Virtual disk drive system and method
The application be that August 13, international application no in 2004 are that PCT//US2004/026499, China national application number are 200480026308.8 the applying date, denomination of invention divides an application for the patented claim of " virtual disk drive system and method ".
Technical field
The present invention relates generally to disk drive system and method, especially design has such as dynamic data and distributes and the disk drive system of ability such as disc driver is virtual.
Background technology
Existing disk drive system is to design with so a kind of mode: make virtual volume data space and the physical disk with specific size and position be associated statically for the storage data.These disk drive system need understand and keep watch on/exact position of the virtual volume of control data storage space with size so that the storage data.In addition, system often needs bigger data space, so that add more RAID equipment.Yet these additional RAID equipment are expensive usually, and before the extra data space of actual needs, are not required.
Figure 14 A shows and comprises the virtual volume data space that is associated with the physical disk with specific size and position existing disk drive system for storage, read/write and/or restore data.Disk drive system is come distribute data statically based on the ad-hoc location and the size of the virtual volume of data space.The result is, with not using the data space that empties, and obtain in advance extra, be expensive data storage device sometimes, for example RAID equipment is for the data in storage, read/write and/or the recovery system.After a while, just need and/or use these extra data spaces.
Thereby, have demand to improved disk drive system and method.Also exist effective, dynamic data is distributed and the demand of disc driver room and time management system and method.
Summary of the invention
The present invention provide can the dynamic assignment data improved disk drive system and method.This disk drive system can comprise the RAID subsystem of the matrix that contains disk block and the disk administrator that contains at least one disk storage system controller.RAID subsystem and disk administrator shine upon matrix and a plurality of disc driver of striding disk block based on the RAID-disk and come dynamically distribute data.RAID subsystem and disk administrator determine whether to need other disc driver, and other if desired disc driver then sends notice.Dynamic data distributes the permission user after a while when needs the time, to obtain disc driver.Dynamic data distributes the valid data storage that also allows the virtual volume matrix of disk block or the snapshot in pond/time point copy; The instant data playback and the data that are used for data backup, recovery etc. merge immediately, remote data storage and data staging management (dataprogression) etc.Owing to will buy more cheap disc driver after a while, so the data staging management allows also postponement to buy more cheap disc driver.
In one embodiment, provide the matrix of virtual volume or disk block or pond to be associated with physical disk.The matrix of virtual volume or disk block or pond are dynamically kept watch on/are controlled by a plurality of disk storage system controllers.In one embodiment, the size of each virtual volume can be acquiescence or can be by consumer premise justice, and the position of each virtual volume is defaulted as sky.Before distribute data, virtual volume is empty.Can be in any grid in matrix or pond distribute data (for example, in case in grid distribute data, be one " point " in this grid).In case delete this data, this virtual volume is available once more, is designated as " sky ".Therefore, can be on the basis of demand obtaining extra data space after a while and be expensive disk storage device, for example RAID equipment sometimes.
In one embodiment, disk administrator can be managed a plurality of disk storage system controllers, and a plurality of redundancy magnetic disk controller system memories can be implemented the fault that covers on the operated disk storage system controller.
In one embodiment, the RAID subsystem comprises in each RAID type the combination of at least one, RAID type such as RAID-0, RAID-1, RAID-5 and RAID-10.Be appreciated that and in the RAID subsystem of replacement, use other RAID type, such as RAID-3, RAID-4, RAID-6 and RAID-7 etc.
The present invention also provides the dynamic data distribution method, and it may further comprise the steps: the default size of logical block or disk block is provided, makes the disk space of RAID subsystem form the matrix of disk block; Write data and distribute data in the matrix of this disk block; Confirm the occupancy of the disk space of RAID subsystem based on the historical occupancy of the disk space of RAID subsystem; Need to determine whether extra disc driver; And extra if desired disc driver then sends notice to the RAID subsystem.In one embodiment, notice is sent through Email.
One of advantage of disk drive system of the present invention is that the disk that the RAID subsystem can be striden virtual quantity uses the RAID technology.Remaining storage space can supply freely to use.The occupancy of the storage space through keeping watch on storage space and definite RAID subsystem, the user needn't obtain costliness but useless a large amount of drivers when buying.Therefore, when the actual needs driver, add driver will reduce disc driver significantly with the cumulative demand that satisfies storage space total cost.Simultaneously, improved the efficient that disk is used basically.
Another advantage of the present invention is that this disk storage system controller is general to any computer file system, and not only is used for the certain computer file system.
The present invention also provides the method for data instant replay.In one embodiment, data instant replay method may further comprise the steps: the default size of logical block or disk block is provided, makes the disk space of RAID subsystem form the matrix of storage page pool or disk block; Generate at interval automatically the snapshot of matrix of snapshot or disk block of the volume of storage page pool with preset time; And the allocation index of the snapshot of the matrix of store storage page pool or disk block or increment, make the snapshot of disk storage block matrix or increment to come instant location through institute's address stored index.
Data instant replay method generates the snapshot of RAID subsystem automatically with the user-defined time interval, user configured dynamic time stamp (for example, every at a distance from a few minutes or several hours etc.) or by time of server indication.Just in case the system failure or virus attack occur, these virtual snapshots that add time stamp allow big approximate number minute or hour in etc. data instant replay and data instant recovery.This technology also is called as instant replay and merges, that is, fusion is collapsed or attacked data not long ago in time, and can use collapse immediately or attack the operation that the snapshot of being stored before is used for future.
In one embodiment, can store snapshot at local RAID subsystem or in long-range RAID subsystem, if make that the integrality of data is unaffected when owing to the for example attack of terrorism etc. main system crash taking place, but and instant recovery data.
Another advantage of data instant replay method is, snapshot can be used for test, and simultaneity factor keeps its operation.Real time data can be used for real-time testing.
The present invention also provides the system of data instant replay, and it comprises RAID subsystem and the disk administrator with at least one disk storage system controller.In one embodiment, RAID subsystem and disk administrator shine upon the automatic distribute data of disk space of striding a plurality of drivers based on the RAID-disk, and wherein the disk space of RAID subsystem forms the matrix of disk block.The disk storage system controller generates the snapshot of the matrix of disk block at interval automatically with preset time; And the allocation index of the snapshot of the matrix of memory disk storage block or increment, feasible this snapshot or the increment that can pass through the matrix of the instant positioning disk storage block of institute's address stored index.
In one embodiment, the disk storage system controller frequency that monitoring data uses from the snapshot of the matrix of disk block, and use aging rule, make that the data of less use or visit are moved in the more not expensive RAID subsystem.Similarly, when the data that are arranged in more not expensive RAID subsystem will be used more continually, controller moved to these data in the expensive RAID subsystem.Thereby the user can select desired RAID subsystem briefcase to satisfy the storage demand of himself.Thereby the cost of disk drive system can reduce significantly, and is dynamically controlled by the user.
Through following detailed description; To those skilled in the art; Of the present invention these will be conspicuous with further feature and advantage, shown in describe in detail and described illustrative example of the present invention, comprise the optimal mode of being conceived that is used for embodiment of the present invention.Can recognize, the present invention can various tangible aspect in revise, but all do not deviate from the spirit and scope of the present invention.Thereby, accompanying drawing with describe that will to be illustrated as be illustrative and nonrestrictive in essence in detail.
Description of drawings
Fig. 1 shows an embodiment according to the disk drive system in the computer environment of principle of the present invention.
Fig. 2 shows according to principle of the present invention, has the embodiment that the dynamic data of the storage page pool of the RAID subsystem that is used for disc driver distributes.
The routine data that Fig. 2 A shows in the RAID subsystem of disk drive system distributes.
Fig. 2 B shows according to the data allocations in the RAID subsystem of the disk drive system of principle of the present invention.
Fig. 2 C shows the dynamic data distribution method according to principle of the present invention.
Fig. 3 A and 3B are according to principle of the present invention, and the disk block of RAID subsystem is at the synoptic diagram of snapshot at a plurality of time intervals place.
Fig. 4 is according to principle of the present invention, the synoptic diagram of the instant fusion function of data of the snapshot of the disk block through using the RAID subsystem.
Fig. 5 is according to principle of the present invention, and the local-remote data of the snapshot of the disk block through using the RAID subsystem are duplicated the synoptic diagram with the instant replay function.
Fig. 6 shows according to principle of the present invention, uses same RAID interface to carry out I/O and with the synoptic diagram of the snapshot of a plurality of RAID equipment serial connection rolling.
Fig. 7 shows an embodiment according to the snapshot structure of principle of the present invention.
Fig. 8 shows an embodiment according to the PITC life cycle of principle of the present invention.
Fig. 9 shows according to principle of the present invention, has an embodiment of the PITC list structure of multiple index.
Figure 10 shows an embodiment according to the recovery of the PITC table of principle of the present invention.
Figure 11 shows according to principle of the present invention, has an embodiment of the process of writing of own pagination row and non-own pagination row.
Figure 12 shows the exemplary snapshot operation according to principle of the present invention.
Figure 13 A shows and contains the virtual data storage space that is associated with the physical disk with specific size and position existing disk drive system for the static allocation data.
Figure 13 B shows the volume logical block mapping in the existing disk drive system of Figure 13 A.
Figure 14 A shows according to principle of the present invention, contains the embodiment of disk block virtual volume matrix for the disk drive system of the data in the dynamic allocation system.
Figure 14 B shows the embodiment that the dynamic data in the disk block virtual volume matrix shown in Figure 14 A distributes.
Figure 14 C shows according to principle of the present invention, the synoptic diagram that the volume of an embodiment of stored virtual volume page pool-the RAID page or leaf remaps.
Figure 15 shows according to principle of the present invention, is mapped to the example of three disc drivers of a plurality of disk blocks of RAID subsystem.
Figure 16 shows after three disc drivers interpolation disc drivers as shown in Figure 15, the example that remaps of disc driver storage block.
Figure 17 shows an embodiment according to the addressable data page in the data staging bookkeeping of principle of the present invention.
Figure 18 shows the process flow diagram according to an embodiment of the data staging bookkeeping of principle of the present invention.
Figure 19 shows an embodiment according to the page compression layout of principle of the present invention.
Figure 20 shows an embodiment according to the management of the data staging in the senior disk drive system of principle of the present invention.
Figure 21 shows an embodiment according to the stream of the external data in the subsystem of principle of the present invention.
Figure 22 shows an embodiment of the internal data flow in the subsystem.
Figure 23 shows an embodiment of each subsystem of independent maintenance coherence.
Figure 24 shows an embodiment according to the mixing RAID waterfall type data staging management of principle of the present invention.
Figure 25 shows according to principle of the present invention, an embodiment of a plurality of free-lists of storage page pool.
Figure 26 shows an embodiment according to the database example of principle of the present invention.
Figure 27 shows an embodiment according to the MRI reflection example of principle of the present invention.
Embodiment
The present invention provide can the dynamic assignment data improved disk drive system and method.Disk drive system can comprise and contains the storage page pool of safeguarding the RAID free-lists or the RAID subsystem of disk storage block matrix, and the disk administrator that contains at least one disk storage system controller.RAID subsystem and disk administrator shine upon based on the RAID-disk strides dynamically distribute data of storage page pool or disk storage block matrix and a plurality of disc driver.RAID subsystem and disk administrator determine whether to need other disc driver, and other if desired disc driver then sends notice.Dynamic data distributes the permission user when needing disc driver after a while, to obtain disc driver.Dynamic data distributes the valid data storage that also allows the virtual volume matrix of disk block or the snapshot in pond/time point copy, and the instant data playback and the data that are used for data backup, recovery etc. merge immediately, remote data storage and data staging management etc.Owing to can buy after a while than the inexpensive disk driver, the data staging management also allows to postpone buys cheap disc driver.
Fig. 1 shows an embodiment according to the disk drive system 100 in the computer environment 102 of principle of the present invention.As shown in fig. 1, disk drive system 100 comprises RAID subsystem 104 and the disk administrator 106 with at least one disk storage system controller (Figure 16).RAID subsystem 104 shines upon the disk space dynamic assignment data of striding a plurality of disc drivers 108 with disk administrator 106 based on the RAID-disk.In addition, RAID subsystem 104 can need determine whether other disc driver based on the data allocations of striding disk space with disk administrator 106.Other if desired disc driver then sends notice to the user, if make expectation then can add other disk space.
According to principle of the present invention, in one embodiment, in Fig. 2, illustrated and had the disk drive system 100 that dynamic data distributes (or be called " disc driver is virtual "), in another embodiment, this system has been shown in Figure 14 A and 14B.As shown in Figure 2, disk storage system 110 comprises storage page pool 112, promptly comprises the data storage pond of the data space tabulation that can freely store data.Page pool 112 is safeguarded the free-lists of RAID equipment 114, and distributes based on user's request management read/write.The disk storage volume 116 that the user asked is sent to page pool 112 to obtain storage space.The identical different storage devices class of (for example, RAID 10, RAID 5, RAID 0 etc.) that each volume can ask to have identical or different RAID grade.
Another embodiment that dynamic data of the present invention distributes is shown in Figure 14 A and the 14B; Wherein according to principle of the present invention, contain a plurality of disk storage system controllers 1402 and by the data in disk storage system 1400 dynamic allocation system of the matrix of the disk block 1404 of these a plurality of disk storage system controllers 1402 controls.Provide the matrix of virtual volume or piece 1404 to be used for being associated with physical disk.The matrix of virtual volume or piece 1404 is by a plurality of disk storage system controller 1402 dynamic surveillance/controls.In one embodiment, but the size of each virtual volume 1404 of predefine, 2M byte for example, the position of each virtual volume 1404 is defaulted as sky.Before distribute data, each in the virtual volume 1404 is all sky.Can be in any grid in matrix or pond distribute data (for example,, being " point " in this grid) in case in grid, distributed data.In case deleted data, this virtual volume 1404 is available once more, is designated as " sky ".Therefore, can be on the basis of demand obtaining extra after a while and be expensive disk storage device, for example RAID equipment sometimes.
Thereby the disk that the RAID subsystem can be striden virtual quantity uses the RAID technology.Remaining storage space can supply freely to use.The occupancy of the storage space through keeping watch on storage space and definite RAID subsystem, the user needn't obtain costliness but useless a large amount of drivers when buying.Therefore, when the actual needs driver, add driver will reduce disc driver significantly with the cumulative demand that satisfies storage space total cost.Simultaneously, improved service efficiency basically to disk.
Equally, the dynamic data of disk drive system of the present invention distributes the valid data storage that allows the snapshot/time point copy of stored virtual volume page pool or disk block virtual volume matrix, is used for that data are recovered and the instant data playback of remote data storage and data merge immediately and the data staging management.
Will be in the following above feature and advantage that go through by dynamic data distribution system and method and the realization gained in disk drive system 100 thereof.
Dynamic data distributes
The routine data that Fig. 2 A shows in the RAID subsystem of disk drive system distributes, and the data space that wherein empties is captive and can not be assigned with for data storage.
Fig. 2 B shows according to the data allocations in the RAID subsystem of the disk drive system of principle of the present invention; The data storage that empties that wherein can supply data storage to use mixes to form page pool, the for example single page pool in one embodiment of the present of invention.
Fig. 2 C shows the dynamic data distribution method 200 according to principle of the present invention.Dynamic data distribution method 200 comprises that the default size of definition logical block or disk block makes the disk space of RAID subsystem form the step 202 of the matrix of disk block; And disk block is designated as the step 204 of write data and distribute data in the disk block of this matrix of " sky " therein.This method also comprises the step 206 of occupancy of confirming the disk space of RAID subsystem based on the historical occupancy of the disk space of RAID subsystem; And determine whether to need other disc driver, and then send the step 208 of notice if desired to the RAID subsystem.In one embodiment, notice is sent through Email.In addition, the big I of disk block is set at acquiescence or can be changed by the user.
In one embodiment, be also referred to as " virtual " or " disk space is virtual " when dynamic data is assigned, its per second is handled a large amount of read and write requests effectively.This architecture can require the direct calls cache subsystem of interrupt handling routine.Because dynamic data distributes not to request queue, it may not optimization request, but it can once have a large amount of pending requests.
Dynamic data distribute also can maintaining data integrity and the content of protected data in case any controller failure.For this reason, the dynamic data distribution writes RAID equipment for reliable memory with status information.
Dynamic data distributes the order that can also safeguard read-write requests, and accomplishes the request that reads or writes according to the accurate order that receives request.Dynamic data distributes permission maximum system availability, and supports the remote copy of data to diverse geographic location.
In addition, dynamic data distributes the ability of from data error, recovering that provides.Through snapshot, the user can check Disk State in the past.
Dynamic data allocation manager RAID equipment also provides storage abstract in to create and the expansion main equipment.
Dynamic data divides the orientation server to present virtual disk equipment; This equipment is called as volume.As far as server, volume is equally worked.It can return the different information to sequence number, but volume is equally worked as disc driver basically.It is abstract in to create bigger dynamic volume equipment to the storage of a plurality of RAID equipment that volume provides.Volume comprises a plurality of RAID equipment, for the effective use to disk space.
Figure 21 shows existing volume logical block mapping.Figure 14 C shows according to principle of the present invention, the remapping of the volume of an embodiment of stored virtual volume page pool-RAID page or leaf.Each volume is divided into one group of page or leaf, and for example 1,2,3 etc., and each RAID is divided into one group of page or leaf.In one embodiment, scrolling size can be identical with RAID page or leaf size.Thereby an example of volume of the present invention-RAID page or leaf mapping is to use the page or leaf #1 of RAID-2 to be mapped to RAID page or leaf #1.
Dynamic data distributes the data integrity of safeguarding volume.Data are written in the volume, and confirm to server.Data integrity covers the configuration of various controllers, comprises redundancy independent and through controller failure.Controller failure comprises power fail, power supply circulation, software anomaly and hard reset.Dynamic data distributes does not generally handle the disk drive failure that is covered by RAID.
Dynamic data is assigned as controller five-star data abstraction is provided.Its past termination is asked, and finally uses RAID equipment that data are write disk.
Dynamic data distributes and comprises various internal subsystems:
High-speed cache---be bundled to next smoothly the read and write operation of data plug-in unit through fast response time being provided to server and will writing to rolling up.
Configuration---comprise the method for establishment, deletion, retrieval and modification data allocations object.Provide assembly to be used to the upper system application and create the tool box.
The data plug-in unit---depend on the volume configuration, will roll up the read and write request and be distributed to each sub-systems.
The RAID interface---provide the RAID device abstract to create bigger volume to the user with other dynamic data assignment subsystem.
Duplicate/mirror image/exchange---data be will roll up and this locality and remote volume copied to.In one embodiment, can only duplicate piece by server write.
Snapshot---the increment type roll recovery of data is provided.It is the view volume (ViewVolume) of the volume state of creating over immediately.
Agency's volume---realize being used to support remote copy to the request communication of long-range purpose volume.
Keep accounts---storage, activity, performance and data to the user just distributes recover to ask for expense.
Dynamic data distributes any mistake in also will disposing to change in the log with remarkable.
Figure 21 shows an embodiment of the external data stream in this subsystem.External request is from front end.Request comprises, obtains volume information, read and write.All requests all contain volume ID.Volume information is handled by the volume configuration subsystem.The read and write request comprises LBA.The request of writing also comprises data.
Depend on the volume configuration, the dynamic data distribution passes to a plurality of exterior layers with request.Remote copy passes to front end with request, and the destination is a long-range destination volume.The RAID interface passes to RAID with request.Duplicate/mirror image/exchange passes request to dynamic data back is dispensed to destination volume.
Figure 22 shows an embodiment of the internal data flow in this subsystem.Internal data flow begins with high-speed cache.High-speed cache can place high-speed cache maybe request to be directly passed to the data plug-in unit with writing request.High-speed cache is supported the direct DMA from front end HBA equipment.Can accomplish request fast, and response is returned to server.The data plugin manager is the center of high-speed cache below request stream.To each volume, the subsystem object that it is registered for each request call.
The dynamic data assignment subsystem that influences data integrity can require the support to the controller coherence.As shown in Figure 23, each subsystem independent maintenance coherence.The coherence upgrades and avoids striding coherence's link copied chunks.Cache coherency can require to copy data to reciprocity controller.
The disk storage system controller
Figure 14 A shows according to principle of the present invention, contains a plurality of disk storage system controllers 1402 and by the matrix of the disk block of a plurality of disk storage system controllers 1402 controls or virtual volume 1404 disk storage system 1400 for the data in the dynamic allocation system.Figure 14 B shows the embodiment that dynamic data distributes in the virtual volume matrix of disk block or virtual volume 1404.
In an operation; Disk storage system 1400 generates the snapshot of the matrix of disk block or virtual volume 1404 at interval automatically with preset time; And the allocation index of this snapshot of the matrix of memory disk storage block or virtual volume 1404 or increment wherein, make snapshot or the increment of matrix of disk block or virtual volume 1404 to come instant location through institute's address stored index.
In another operation, disk storage system controller 1402 frequency that monitoring data uses from the snapshot of the matrix of disk block 1404, and use aging rule, make that the data of less use or visit are moved in the more not expensive RAID subsystem.Similarly, when the data that are arranged in more not expensive RAID subsystem began more frequently to use, controller moved to these data in the expensive RAID subsystem.Thereby the user can select desired RAID subsystem briefcase to satisfy the storage demand of himself.Thereby the cost of disk drive system can reduce significantly, and is dynamically controlled by the user.
The mapping of RAID-disk
RAID subsystem and disk administrator shine upon the disk space of striding a plurality of disc drivers based on the RAID-disk and come the dynamic assignment data.In one embodiment, RAID subsystem and disk administrator determine whether to need other disc driver, and other if desired disc driver then sends notice.
Figure 15 shows according to principle of the present invention, is mapped to the example of three disk drive 108 (Fig. 1) of a plurality of disk block 1502-1512 in the RAID-5 subsystem 1500.
Figure 16 shows after three disc drivers 108 that disc driver 1602 added to as shown in Figure 15, the disk drive storage block remap 1600 example.
Disk administrator
As shown in fig. 1, disk administrator 106 general management disk and disk arrays, comprise grouping/resource merge (pooling), Disk Properties abstract, format, add/deduct disk and follow the tracks of disk service times and error rate.Disk administrator 106 is not distinguished the difference between the various disk models, and general memory device is provided for the RAID assembly.Disk administrator 106 also provides packet capability, and this ability is convenient to structure and is had such as 10, and the RAID of special characteristics such as 000RPM disk divides into groups.
In one embodiment of the invention, disk administrator 106 is three layers at least: abstract, configuration and I/O optimize.Disk administrator 106 presents " disk " to higher level, and higher level for example can be, physical disk drive or long-range additional disk system that Local or Remote is additional.
Common foundation characteristic is that any one in these equipment can be the target of I/O operation.Abstract service is that higher level (especially RAID subsystem) provides the uniform data path interface, and general mechanism is provided for Admin Administration's target device.
Disk administrator 106 of the present invention also provides packet capability with streamlining management and configuration.Disk can and be placed in the group by name, and group also can be named.Grouping is to simplify such as roll up from a grouping of disk to migrate to another, a grouping of disk is exclusively used in the powerful characteristic of tasks such as being grouped into of specific function, designated disk be subsequent use.
Disk administrator also with such as be responsible for detecting the equipment interfaces such as scsi device subsystem whether external unit exists.The scsi device subsystem can be confirmed the subset of devices as the block type target device at least for optical-fibre channel/SCSI type equipment.These equipment are by disk management management and abstract just.
In addition, disk administrator is responsible in response to the flow process control from the scsi device layer.Disk administrator has the ability of queuing, and this provides assembles the I/O request to optimize the chance of disk drive system handling capacity as method.
And, a plurality of disk storage system controllers of disk management management of the present invention.Equally, can realize that a plurality of redundancy magnetic disk controller system memories cover the fault of operated disk storage system controller.The redundancy magnetic disk controller system memory is also by the disk management management.
The relation of disk administrator and other subsystem
Disk administrator and some other subsystems are mutual.The RAID subsystem is major customer's machine of the service that provided for the data routing activity by disk administrator.The RAID subsystem is used disk administrator in the exclusive path of accomplishing the disk that is used for I/O.The incident from disk administrator is also monitored by the RAID system, to confirm existing and mode of operation of disk.The RAID subsystem also works the structure range of distribution for RAID equipment with disk administrator.The disk incident is monitored in management control, to understand existing and understanding the mode of operation change of disk.In one embodiment of the invention, RAID subsystem 104 can comprise the combination of at least one RAID type, RAID type such as RAID-0, RAID-1, RAID-5 and RAID-10.Be appreciated that and in the RAID subsystem of replacement, use other RAID type, such as RAID-3, RAID-4, RAID-6 and RAID-7 etc.
In one embodiment of the invention, disk administrator utilizes the configuration access service to store persistent configuration and such as current cambic the read messages such as statistics to presentation layer.Disk administrator to configuration access location registration process program to visit these parameters.
Disk administrator also utilizes the service of scsi device layer to understand the existence and the mode of operation of block device, and contains the I/O path to these block devices.Disk administrator is to scsi device subsystem query facility, as the support method that identifies disk uniquely.
Data instant replay and data merge immediately
The present invention also provides data instant replay and the instant method that merges of data.Fig. 3 A and 3B show principle according to the present invention a plurality of time intervals the place to the synoptic diagram of the snapshot of the disk block of RAID subsystem.Fig. 3 C shows data instant replay method 300, and it comprises that the default size of definition logical block or disk block makes the disk space of RAID subsystem form the step 302 of storage page pool or disk storage block matrix; Generate at interval automatically the step 304 of snapshot of matrix of snapshot or disk block of the volume of page pool with preset time; And the allocation index of the snapshot of store storage page pool or disk storage block matrix or increment wherein, make the snapshot of disk storage block matrix or increment to come instant location through institute's address stored index.
Shown in Fig. 3 B,, for example, 5 minutes,, generate the snapshot of storage page pool or disk storage block matrix automatically such as T1 (12:00PM), T2 (12:05PM), T3 (12:10PM) and T4 (12:15PM) at each predetermined time interval place.The snapshot or the increment wherein of storage page pool or disk storage block matrix are stored in this storage page pool or the disk storage block matrix, make to come instant this snapshot or increment of locating storage page pool or disk storage block matrix through institute's address stored index.
Thereby data instant replay method generates the snapshot of RAID subsystem automatically with the user-defined time interval, user configured dynamic time stamp (for example, every at a distance from a few minutes or several hours etc.) or by time of server indication.Just in case the system failure or virus attack occur, these virtual snapshots that add time stamp allow big approximate number minute or hour in etc. instant replay and the instant recovery of data of data.This technology also is called as instant replay and merges, and promptly fusion is collapsed or attacked data not long ago in time, and can use collapse immediately or attack the operation that the snapshot of being stored before is used for future.
Fig. 4 also shows according to principle of the present invention, the synoptic diagram of the instant fusion function 400 of data of a plurality of snapshots of the disk block through using the RAID subsystem.At the T3 place, generate parallel chain (parallelchain) the T3 '-T5 ' of snapshot, can be used for replacing at the T4 place by data T3 ' fusion of merging and/or data recovered whereby with Fused data.Similarly, can generate a plurality of parallel chain T3 of snapshot ", T4 " ', be used for the replacement locate and T4 at T4 '-T5 ' "-T5 " locate Fused data.In alternative embodiment, still can be with T4, T4 '-T5 ', T5 " snapshot located is stored in page pool or the matrix.
Snapshot can be stored in local RAID subsystem or long-range RAID subsystem, if make that the integrality of data is with unaffected, and data can be by instant recovery owing to main system crash takes place in the for example attack of terrorism.The local-remote data that Fig. 5 shows the snapshot of the disk block of principle according to the present invention through using the RAID subsystem are duplicated the synoptic diagram with instant recovery function 500.
Remote copy is carried out will roll up the service that data copy to remote system.It attempts to keep as much as possible the close synchronization of local and remote volume.In one embodiment, the data of remote volume may not reflect the perfect copy of the data of local volume.Network connects and performance possibly make remote volume asynchronous with local volume.
Another feature of data instant replay and the instant fusion method of data is, snapshot can be used for test, and simultaneity factor still keeps its operation.Can use real time data to be used for real-time testing.
Snapshot and time point copy (PITC)
According to principle of the present invention, an example of data instant replay is to utilize the snapshot of the disk block of RAID subsystem.The write operation of snapshot record to rolling up, the feasible content that can create the view volume of checking over.Therefore snapshot is also supported through establishment the data of the view of the previous time point copy (PITC) of volume to be recovered.
The establishment of the core realization snapshot of snapshot, polymerization, management and I/O operation.Snapshot is kept watch on the writing of volume, and visits to roll up through view for creation-time point copy (PITC).Its data routing in virtualization layer adds LBA (LBA) and remaps layer.This is another the virtual LBA mapping layer in the I/O path.PITC can not duplicate all volume informations, and it can only revise the table that remaps use.
Snapshot is followed the tracks of the change to the volume data, and the ability of checking from the volume data of previous time point is provided.Snapshot is carried out this function through safeguard the tabulation that increment is write for each PITC.
Snapshot is that PITC brief introduction table provides several different methods, comprising: the starting with the time of application program launching.Snapshot is that application program provides the ability of creating PITC.Application program is created through the control of the API on the server, and establishment is passed to snapshot API.Equally, snapshot provides the ability of creation-time brief introduction table.
Snapshot can not realize log processing system or recover volume all write.Snapshot can only be preserved writing for the last time the individual address in the PITC window.Snapshot allows the user to create the PITC of the defined short-term time of covering such as a few minutes or several hours etc.Be handling failure, snapshot is written to disk with all information.Snapshot is safeguarded and is comprised the volume data page pointer that increment is write.Because table provides the mapping to the volume data, and if do not have it then inaccessible volume data, therefore showing data must the processing controller failure condition.
View volume function provides the visit to PITC.View volume function can be additional to any PITC except that existing PITC in the volume.To the additional of PITC is comparatively faster operation.The purposes of view volume function comprises test, training, backup and recovers.View volume function allow write operation and do not revise it based on basic PITC.
In one embodiment, the design snapshot is that cost is easy to use with the optimization performance and with the disk space:
Snapshot provides quick response for user's request.User's request comprises the I/O operation, creates PITC and establishment/deletion view volume.For this reason, snapshot uses the more disk space that needs than minimum to come storage list information.To I/O, snapshot makes and can satisfy all read and write requests by individual table during the current state general introduction of volume is shown to individual.Snapshot reduces the influence to normal I/O operation as much as possible.Secondly, to the operation of view volume, snapshot uses the table mechanism identical with the master file data routing.
Snapshot minimizes the data volume of duplicating.For this reason, snapshot is safeguarded pointer gauge for each PITC.Snapshot and moving hand, but its data on the Move Volumes not.
Snapshot uses the data page of fixed size to manage volume.Following the tracks of individual sector possibly need a large amount of storeies to be used for the volume of single fair-sized.Through using greater than the sectors of data page or leaf, some page can comprise directly from another page and duplicate and the information of the certain percentage of coming.
Snapshot uses the data space on the volume to store the data page table.After controller failure, regenerate look-up table.The look-up table branch gather leaves of a book go forward side by side a step segmentation they.
In one embodiment, snapshot is through requiring to use twisting in of snapshot to operate the processing controller fault on the single controller.This embodiment does not require any coherence.The institute of volume changed all to be recorded on the disk or to be recorded to reliable high-speed cache recover to use for the replacement controller.In one embodiment, recover in the slave controller fault to require to read SNAPSHOT INFO from disk.
Snapshot uses virtual RAID interface to visit storage.Snapshot can use a plurality of RAID equipment as the individual data space.
Snapshot is supported individual PITC of every volume ' n ' and the individual view of every volume ' m '.Restriction to ' n ' and ' m ' is the function of disk space and controller storage.
Volume and volume distribution/layout
Snapshot adds LBA to volume and remaps layer.Remap and use I/O request LBA address translation to be become data page with look-up table.As shown in Figure 6, use the volume that is appeared of snapshot to operate with the volume that does not have snapshot identically.It has linear LBA space and handles the I/O request.Snapshot uses the RAID interface to carry out I/O, and a plurality of RAID equipment are included in the volume.In one embodiment, the size of the RAID equipment of snapped volume be not present the size of volume.RAID equipment allows snapshot to be the data page Extended Spaces in the volume.
The new volume of launching snapshot at the very start only need comprise the space of new data page or leaf.Snapshot is not created the page or leaf tabulation and is placed bottom PITC.In this case, bottom PITC is empty.In a minute timing, all PITC pages or leaves all are positioned on the free-lists.Through creating the volume just launch snapshot at the beginning, but the physical space still less that its distribution ratio volume is appeared.Snapshot is followed the tracks of writing volume.In one embodiment of the invention, will in page pool or matrix, not duplicate and/or store the NULL volume, thereby improve service efficiency storage space.
In one embodiment, to these two kinds of allocative decisions, PITC all places virtual NULL volume in the bottom of tabulation.NULL volume read to return zero piece.Before NULL volume is handled not by the sector of server write.Writing the NULL volume can not be taken place.Volume use NULL volume is used for reading the sector of not writing.
The quantity of free page depends on the size of volume, the quantity of PITC and the expection speed that data change.System is the quantity that given volume is confirmed the page or leaf of distribution.The quantity of data page can expand in time.Expansion can support than expect faster data to change, more many PITC or bigger volume.New page or leaf is added into free-lists.Can take place automatically free-lists is added page or leaf.
Snapshot uses data page to manage volume space.Each data page can comprise the data of several megabyte.Use operating system often in the same area of volume, to write a plurality of sectors.Storage requirement also indicates snapshot to use page or leaf to manage volume.Each sector that is the volume of 1 terabyte safeguards that single 32 bit pointers can need the RAM of 8 gigabytes.Different volumes can have different page or leaf sizes.
Fig. 7 shows an embodiment of snapshot structure.Snapshot is added into volume structure with a plurality of objects.Other object comprises pointer, data page free-lists, sub-view volume and the PITC aggregate objects of PITC, sensing activity PITC.
Movable PITC (AP) pointer is by volume maintenance.AP handles the mapping to the read and write request of volume.AP comprises the general introduction of the current location of all data in the volume.
The data page free-lists is followed the tracks of the available page or leaf on the volume.
Can choose sub-view volume wantonly the visit to volume PITC is provided.The AP that the view wraparound contains them to the writing of PITC, does not revise basic data with record simultaneously.Volume can be supported a plurality of sub-view volumes.
The snapshot aggregate objects links two PITC for for the purpose of removing previous PITC temporarily.The entitlement and the release data page or leaf that the polymerization of PITC are related to the mobile data page or leaf.
PITC comprises the table and the data page of the page or leaf that is used for when PITC is movable, being write.PITC comprises and freezes time stamp, carves PITC at that and stops to accept to write request.PITC also comprises the life span value, and this value determines when that PITC is with polymerization.
Equally, getting PITC so that the moment of predictable read and write performance to be provided, snapshot is summarized the data page pointer of whole volume.Other solution can require to read to check that a plurality of PITC are to find out up-to-date pointer.These solutions need be shown cache algorithm, but have worst-case performance.
The storer that snapshot general introduction among the present invention also reduces the worst case of table uses.It can require whole table is loaded in the storer, but it possibly only require to load single table.
General introduction comprises the page or leaf that current PITC has, and can comprise the page or leaf from all previous PITC.For confirming which page or leaf PITC can write, and it follows the tracks of page or leaf entitlement to each data page.It also follows the tracks of entitlement to polymerisation run.For this reason, the data page pointer comprises a page index.
Fig. 8 shows the PITC embodiment of life cycle.Each PITC is a plurality of following states of process before as read-only submission:
1. create table---when creating, indumentum is created.
2. submit to disk---this generates the storage on the disk for PITC.Through writing table at the moment, it has guaranteed the required space of memory allocated table information before getting PITC.Simultaneously, also the PITC object is submitted to disk.
3. accept I/O---it becomes movable PITC (AP)---it handles the read and write request for volume now.This is unique state of accepting the request of writing of his-and-hers watches.PITC generates expression, and it is movable incident at present.
4. table being submitted to disk---PITC no longer is AP, and no longer accepts other page or leaf.New AP takes over.At the moment, only if in converging operation, remove table, otherwise table will no longer change.It is read-only.At the moment, PITC generates the incident that its quilt of expression is freezed and submitted to.Any service can be monitored this incident.
5. the storer that free list storer---free list needs.This step is also removed daily record and has been written into disk to state all changes.
The top layer PITC of volume or view volume is called as movable PITC (AP).AP satisfies all the read and write requests to volume.As far as volume, AP is unique PITC that can accept the request of writing.AP comprises the general introduction to the data page pointer of whole volume.
As far as polymerisation run, AP can be the destination, rather than the source.As the destination, AP increases the quantity of the page or leaf that is had, but it does not change the view of data.
Volume is expanded, and AP increases with volume immediately.New page or leaf points to the NULL volume.Non-AP PITC expands volume not to be needed to revise.
Each PITC safeguards that the LBA with input is mapped to the table to the data page pointer of basis volume.This table comprises the pointer that points to data page.The more physical disk space of logical space that this table needs contrast before to appear is carried out addressing.Fig. 9 shows an embodiment of the list structure that contains multiple index.This structure will be rolled up LBA and will be decoded into the data page pointer.As shown in Figure 9, each grade is to the more and more lower position decoding of address.This structure of table allows to search fast and provide the ability that expands volume.To searching fast, multilevel index structure makes table very shallow, on each level, a plurality of clauses and subclauses is arranged.Index is carried out array and is searched on each level.For supporting the volume expansion, multilevel index structure allows to add other layer to support expansion.In whole situation, it is the expansion of presenting to the LBA counting of higher level that volume expands, rather than is the expansion of rolling up the actual quantity of the storage space that distributes.
Multiple index comprises the general introduction that whole volume data page remaps.Volume complete that each PITC is included in the time point of submitting PITC to remaps tabulation.
Each layer of multilevel index structure his-and-hers watches uses different entry types.Different entry type supports from the disk read message and storer the demand of canned data.The bottom clauses and subclauses can only comprise the data page pointer.Top layer and middle layer clauses and subclauses comprise two arrays, a LBA who is used for the next stage table clause, and another is used for the memory pointer of Compass.
When the expansion of the volume that appeared size, the size of previous PITC table did not need to increase, and these tables do not need to revise.Because table is for read-only, the information in the table can not change, and the expansion process is revised table through adding the NULL page pointer that points to the end.Snapshot does not directly present the table from previous PITC to the user.
I/O operation requirement table is mapped to the data page pointer with LBA.I/O multiply by the data page size to obtain the LBA of basic RAID with the data page pointer then.In one embodiment, the data page size is 2 power.
This table provides API to remap LBA, adds page or leaf and polymerization table.
Snapshot uses data page to store PITC object and LBA mapping table.This table is directly visited the RAID interface for the I/O to its table clause.In the time should showing read and write to RAID equipment, this table minimizes modification.Do not having under the situation about revising, maybe the table clause structure be being gone in the direct read and write of table information.This has reduced the required copy of I/O.Snapshot can use the change daily record on disk, to create focus to stop.Focus is to reuse to follow the tracks of the position to the renewal of volume.The change log record is to the renewal of PITC table and the free-lists of volume.In rejuvenation, snapshot uses the change daily record to create again AP and free-lists in the storer.Figure 10 shows an embodiment of the recovery of his-and-hers watches, and it has been illustrated the AP on the AP in the storer, the disk and has changed the relation between the daily record.It also shows the same relation to free-lists.Rebuild AP and the daily record that AP in storer table can be from disk.To any controller failure, through reading the AP on the disk and using the change daily record to it and rebuild the AP in the storer.Depend on system configuration, different physical resources is used in the change daily record.As far as multi controller systems, the change daily record depends on the reserve battery cache memory for storage.The number of times that uses cache memory to allow snapshot to reduce disk is write table is maintaining data integrity still simultaneously.The change daily record copies to backup controller for recovery.As far as the single controller system, the change daily record is written to disk with all information.This has the spinoff of creating the focus on the disk in the daily record position.This permission is written to the individual equipment piece with a plurality of changes.
Periodically, snapshot is written to disk with PITC table and free-lists, thereby in daily record, creates the checkpoint and remove the checkpoint.This cycle is depended on the quantity of the renewal of PITC is changed.Polymerisation run is not used the change daily record.
Snapshot data page or leaf I/O can require request within the data page border, to be fit to.If snapshot runs into the I/O request of crossing over page boundary, then it splits and should ask.It will ask to pass to request handler then downwards.Write and read supposes that partly I/O is fit within page boundary.AP provides LBA to remap to satisfy the I/O request.
AP satisfies all requests of writing.Snapshot is supported two kinds of different write sequences to own and non-own page or leaf.Different write sequences allows to add page or leaf to table.Figure 11 shows an embodiment of the process of writing that contains own pagination row and non-own pagination row.
To own pagination row, this process comprises following:
1) finds out the table mapping; And
2) own page or leaf is write---and remap LBA, and data are written to the RAID interface.
The page or leaf of before writing is the request of simply writing.Snapshot is written to this page or leaf with data, thus the current content of overwrite.The data page of only writing AP and being had.The page or leaf that other PITC had is read-only.
To non-own pagination row, this process comprises following:
1) finds out the table mapping;
2) read page or leaf before---execution is read data page, the complete page or leaf of data formation that makes the request write and read.This is the beginning of writing on the process of duplicating.
3) data splitting---data page read and write request service load is placed single adjacent block.
4) free-lists distributes---from free-lists, obtain new data page pointer.
5) data with combination are written to new data page.
6) newly the information of page or leaf is submitted to daily record.
7) updating form---the LBA in the change table remaps with reflection new data page pointer.This data page is had by this PITC now.
Add page or leaf and can require to block the read and write request, up to page or leaf is added in the table.Be written to disk through showing to upgrade, and be the copy of a plurality of high-speed caches of log saving, snapshot is realized the controller coherence.
With regard to read request, AP fulfils all read requests.Use the AP table, read request is mapped to the LBA replay LBA of data page.It will pass to the RAID interface through the LBA that remaps to satisfy request.Volume can fulfil to before be not written to the read request of the data page of volume.These pages or leaves are marked as NULL address (complete 1) in the PITC table.Request to this address can be rolled up foot by NULL, and returns the constant data pattern.Page or leaf by different PITC had can satisfy the read request of crossing over page boundary.
Snapshot use the NULL volume satisfy to before the read request of the data page do not write.It returns complete 0 to each sector of reading.It does not have the space of RAID equipment or distribution.The piece that is expected at preservation complete 0 in the storer is to satisfy the data demand of reading to the NULL volume.The shared NULL of all volumes rolls up and satisfies read request.
In one embodiment, polymerisation run from volume, remove in PITC and its own page or leaf certain some.Removing PITC creates more free space and follows the tracks of new difference.Polymerization is to two adjacent table comparing differences, and only preserves newer difference.According to user's configuration, polymerization cycle property or generation manually.
This process can comprise two PITC, source and destination ground.In one embodiment, to qualified object the rule as follows:
1) source must be destination PITC before---the source must be created in the destination before.
2) destination can not be the source simultaneously.
3) source can not be quoted by a plurality of PITC.When creating the view volume from PITC, multiple quoting taken place.
4) multiple quoting can be supported in the destination.
5) AP can be the destination, but cannot be the source.
Polymerisation run is written to disk with all changes, and does not require any coherence.If controller breaks down, volume recovers PITC information from disk, and restarts polymerisation run.
Two PITC of this Processes Tag are for polymerization, and comprise following steps:
1) the source state being changed to syndicated feeds---this state is submitted to disk and is recovered for storage failure.At this moment, because the data page in source maybe be invalid and access originator no longer.Data page can be returned to free-lists, or the transferable destination of giving of entitlement.
2) the destination state being changed to the polymerization destination---this state is submitted to disk and is recovered for controller failure.
3) loading and comparison sheet---this process mobile data page pointer.The data page that discharges is added into free-lists immediately.
4) the destination state being changed to normally---this process is accomplished.
5) adjustment tabulation---change the prior pointer of next pointer of source into the sensing purpose.This removes the source effectively from tabulation.
6) source of release---return any data page that is used for control information to free-lists.
Above process is supported the combination of two PITC.Those skilled in the art will appreciate that polymerization can be designed to remove a plurality of PITC and in one time, create multiple source.
As shown in Figure 2, page pool service data page or leaf free-lists uses for all volumes that are associated with this page pool.This free-lists manager uses and from the data page of page pool free-lists is submitted to permanent memory.The renewal of free-lists is from more than individual source: the process of writing is distributed page or leaf, control page management device branch is gathered leaves of a book and polymerisation run is returned page or leaf.
Free-lists is maintained in a certain threshold value and expands the trigger of self automatically.This trigger uses the page pool extending method to add page or leaf to page pool.Automatically expansion can be by the decision of volume strategy.More important book will be allowed to expand, and more unessential volume is forced polymerization.
The view volume provides the visit of previous time point and supports normal volume I/O operation.PITC follows the tracks of the difference before the PITC, the information that comprises in the view volume permission user capture PITC.View volume branch from PITC.The view volume is supported recovery, test, backup operation etc.Because the view volume does not need data trnascription, and the establishment of view volume almost takes place immediately.View volume can require its own AP to support writing what view was rolled up.
Can from current volume AP, duplicate the view of from the current state of volume AP, obtaining.Use AP, the view volume allows the write operation of view volume need not to revise basic data.OS can require file system or file reconstruction to use data.In the view volume uncle volume is AP and institute's write data page or leaf allocation space.The RAID facility information that the view volume is not associated.Deletion view volume discharges back father's volume with the space.
Figure 12 shows the exemplary snapshot operation of using snapshot to show migration volume.Figure 12 shows the volume with 10 pages.Each state comprises fulfils tabulation to the read request of volume.The own data page pointer of shaded block indication.
Transfer in the middle of from this figure left side (that is original state) to figure illustrates writing page or leaf 3 and 8.Writing of page or leaf 3 required to change PITC I (AP).PITC I follows new page or leaf and writes processing so that page or leaf 3 is added in the table.PITC reads unaltered information from page or leaf J, and uses driver page or leaf B to store this page or leaf.Can under need not the situation of move page, handle among this PITC all writing in the future to page or leaf 3.Writing of page or leaf 8 shown the second kind of situation that is used to be written to page or leaf.Because PITC I has comprised page or leaf 8, that part of data in the PITC I overwrite page or leaf 8.As far as this situation, it is present on the driver page or leaf C.
From scheming the polymerization that middle transfer to figure right side (that is end-state) illustrates PITC II and III.The snapshot polymerization relates to and removes older page or leaf respectively, still safeguards that the institute among two PITC changes simultaneously.These two PITC all comprise page or leaf 3 and page or leaf 8.This process keeps from the newer page or leaf of PITC II and discharges the page or leaf from PITC III, and it returns to free-lists with page or leaf A and D.
Snapshot distributes the data page from page pool to be used to store free-lists and PITC table information.The distribution of control page or leaf is carried out secondary distribution with the required size of match objects to data page.
Wraparound contains the page pointer to control page information top.From this page or leaf, can read all out of Memory.
Snapshot is followed the tracks of the quantity of the page or leaf at interval use sometime.When this permission snapshot predictive user need add more physical disk space to system exhausts to prevent snapshot.
The data staging management
In one embodiment of the invention, data staging management (DP) is used for data are little by little moved to the storage space with suitable cost.The present invention allows the user when the actual needs driver, to add driver.This will reduce the total cost of disc driver significantly.
The data staging management moves to the data and the historical snapshot data of non-nearest visit in the more not expensive storage.As far as the data of non-nearest visit, this has gradually reduced the cost of storage for any page or leaf of non-nearest visit.It can not move to data instant the storage of least cost.To historical snapshot data, it moves to more effective storage space with read-only page or leaf, such as RAID 5, if this page or leaf moves to this page in the least expensive storage no longer by the volume visit so.
Other advantage of data staging management of the present invention comprises, safeguards the quick I/O visit and the minimizing of current accessed data are bought fast but the demand of expensive disc driver.
In operation, the data staging management efficient using the cost of physical medium and be used for the RAID equipment of data protection is confirmed the cost stored.Storage efficiency and mobile data are correspondingly also confirmed in data staging management.For example, the data staging management can convert RAID 10 to RAID 5 equipment so that more effectively use physical disk space.
Data staging management is the current data that can be read or write by server with addressable data definition.It uses accessibility to confirm the storage class that page or leaf should use.If page or leaf belongs to historical PITC, then it is read-only.If server does not upgrade this page or leaf in nearest PITC, then this page or leaf is still addressable.
Figure 17 shows an embodiment of the addressable data page in the data staging bookkeeping.This addressable data page is divided into following classification:
Visit is addressable recently---and these are active pages that volume uses at most.
Non-nearest visit addressable---nearest untapped read-write page or leaf.
Historical accessible---the read-only page or leaf that can be read by volume---is applied to snapped volume
Historical non-addressable---rolling up the current not read-only data page or leaf of visit---is applied to snapped volume.Snapshot is safeguarded these pages or leaves for the purpose of recovering, and these pages generally place in the storage of least cost as far as possible.
In Figure 17, show three PITC with different own pages or leaves of snapped volume.Represent the dynamic capacity volume separately by PITC C.All these pages are addressable and read-write.These pages can have the different access times.
Following table shows various memory devices according to the efficient that increases progressively or the monetary fee of successively decreasing.The tabulation of this memory device is also according to the gradually slow general sequence of writing the I/O visit.The data staging Management Calculation is by the efficient of the logic protected space of total physical space division of RAID equipment.
Table 1:RAID type
Type subtype storage efficiency is write 1 I/O counting usage
RAID
10 50% 2 has the relatively better main read-write accessible storage of write performance
RAID
5 3-drivers 66.6% 4 (2 read-2 writes) cause the loss of RAID 5 write performances simultaneously to the minimum efficiency gain of RAID 10
The outstanding candidate of RAID 5 5-drivers 80% 4 (2 read-2 writes) read-only historical informations.The good candidate of the page or leaf write of non-nearest visit.
The outstanding candidate of RAID 5 9-drivers 88.8% 4 (2 read-2 writes) read-only historical informations.
RAID 5 17-drivers 94.1% 4 (2 read-2 writes) have reduced the inefficacy territory that efficiency gain has doubled RAID equipment simultaneously.
Along with the increase of driver number in the band, RAID 5 efficient increase thereupon.Along with the increase of disk number in the band, inefficacy territory (fault domain) also increases thereupon.The increase of driver number has also increased the necessary minimum number of disks of establishment RAID equipment in the band.In one embodiment, because the increase of inefficacy territory size and limited efficient increase, RAID 5 stripe size greater than 9 drivers are not used in the data staging management.The data staging management is used for RAID 5 stripe size of snapshot page or leaf size integral multiple.This permission data staging management is carried out full band and is write when page or leaf is moved to RAID 5, thereby makes mobile more effective.To the purpose of data differentiated control, all RAID 5 configurations have the identical I/O characteristic of writing.For example, the RAID 5 on 2.5 inches FC disks may not use the performance of these disks effectively.For preventing this combination, the support of data staging managerial demand prevents the ability that the RAID type is moved on some disk type.The configuration of data staging management can prevent that also locking system uses the space of RAID 10 or RAID 5.
Disk type shown in the following table:
Table 2: disk type
Type speed cost problem
2.5 the excellent height of inch FC is very expensive
It is expensive during FC 15K PRM is good
The good reasonable price of FC 10K PRM
SATA is general to be hanged down cheaply/and more unreliable
The data staging management comprises the ability of classifying automatically to respect to the disc driver of intrasystem driver.The systems inspection disk is confirmed its performance with respect to other disk in the system.Disk faster is sorted in the high value classification, and slow disk is sorted in than in the low value classification.When disk was added into system, system is the value of balance disk classification again automatically.This method handled simultaneously from immovable system and the system that when adding new disk, often changes both.Automatically classification can place same value classification with a plurality of disk types.If confirm that driver is enough approaching on value, they can have identical value so.
In one embodiment, system comprises following driver:
Height-10K FC driver
Low-SATA drive
Along with the interpolation of 15K FC driver, the data staging management reclassifies disk automatically, and this 10K FC driver of demoting.Classification below this produces:
Height-15K FC driver
In-10K FC driver
Low-SATA drive
In another embodiment, system can have following type of driver:
Height-25K FC driver
Low-15K FC driver
Thereby this 15K FC driver is classified as than the low value classification, and 15K FC driver is classified as the high value classification.
If SATA drive is added into this system, the data staging management reclassifies disk automatically.Classification below this produces:
Height-25K FC driver
In-15K FC driver
Low-SATA drive
The data staging management can comprise the waterfall type differentiated control.Usually, the waterfall type differentiated control only just moves to data more not in the expensive resources when having used resource fully.The waterfall type differentiated control maximizes the use of expensive system resource effectively.It is the cost of minimization system also.Add cheap disk to minimum pond and will create bigger pond in the bottom.
RAID 10 spaces are used in typical waterfall type differentiated control, use the next one in the RAID space then, such as RAID 5 spaces.This forces waterfall directly to advance to the RAID 10 of next type disk.Perhaps, the data staging management can comprise mixing RAID waterfall type differentiated control as shown in Figure 24.This replacement data grading management method has solved the problem of maximization disk space and performance, and allows storage to convert the more effective form in the same disk sort to.This replacement method is also supported the requirement of the total resources of RAID 10 and RAID 5 shared disk classes.This can require to dispose the fixed percentage of the disk space that the RAID grade can use disk sort.Thereby the use of the data staging management method maximization expensive storage of this replacement allows the space coexistence to another RAID class simultaneously.
This mixes RAID waterfall type method when storage is limited, also only page or leaf is moved and moves to more not expensive storage.Memory space such as a certain RAID type of threshold limitation such as number percent of total disk space.The use of expensive storage in this maximization system.When storage approached its limit, the data staging management moved to page or leaf the storage of lower cost automatically.The data staging management provides impact damper for writing peak value.
Be appreciated that above waterfall type method also moves to least cost storage with page or leaf immediately because in some situation, possibly exist with mode timely with historical and non-addressable page move to the more not demand of expensive storage.Historical page or leaf also can be moved to more not expensive storage immediately.
Figure 18 shows the process flow diagram of data staging managing process 1800.Data staging management to its access module of each page continuous review in the system and carrying cost to determine whether to exist the data that will move.The data staging management can confirm also whether storage reaches its maximum allocated.
The data staging managing process confirms whether this page or leaf can be by any volume visit.Inspection is additional to each historical volume to determine whether to quote this page or leaf to this process to PITC.If this page or leaf is just used versatilely, this page or leaf is qualified as far as upgrading or slowly demoting so.If this page or leaf can not move to available least cost storage with it so by any volume visit.Time Calculation before the data staging management also expires PITC is interior.If snapshot scheduling PITC is about to expiration, there is not page or leaf so with differentiated control.If page pool is just with positive pattern operation, the gradable management of page or leaf so.
Data staging is managed nearest access detection need be from the outburst to the elimination activity during upgrading of page or leaf.The data staging management is followed the tracks of write access and is separated.This allows the data staging management on addressable RAID 5 equipment, to keep data.Operate only read data like virus scan or report etc.If the storage shortage, then the qualification of data staging management change visit is recently identified.This allows the data staging management more energetically page or leaf to be demoted.This also helps when when shortage storage fill system from the bottom up.
When system resource became shortage, the data staging management is the mobile data page or leaf energetically.For all these situation, still more disk or configuration change must be arranged.The time quantum that system can operate has been elongated in the data staging management in short supply state.The data staging management attempts to keep as far as possible for a long time system to operate.This lasts till when its all storage class all exhausts the space always.
Working as the shortage of RAID 10 spaces, and in the situation of total free disk space shortage, the data staging management can be alloted RAID 10 disk spaces and moved among the more effective RAID 5.With the write performance is cost, and this has increased the population size of system.But still more disk must be arranged.If used the specific memory class fully, data staging management allows to use the non-page or leaf of accepting and moves with the maintenance system so.For example, if volume is configured to can prevent asking that to it information uses RAID 10-FC that it can divide and gathers leaves of a book so, up to there being more RAID 10-FC space to use from RAID 5-FC or RAID 10-SATA.
The data staging management supports that also compression increases the capacity of discovering of system.The historical page or leaf that compression can only be used for not visiting, or as the storage of recovering information.Compression shows as the another kind of storage near the carrying cost bottom.
As shown in Figure 25, page pool comprises free-lists and facility information basically.Page pool need be supported the page or leaf allocative decision of a plurality of free-lists, enhancing and the classification of free-lists.Page pool is the independent free-lists of each type storage system maintenance.Allocative decision allows a branch from a plurality of ponds to gather leaves of a book, and sets the minimum or maximum class that allows simultaneously.The classification of free-lists is from equipment disposition.Each free-lists provides its oneself counter to be used for statistics to compile and show.Each free-lists also provides RAID plant efficiency information to be used for compiling of storage efficiency state.
In one embodiment, list of devices can require to follow the tracks of the additional capabilities of storage class cost.The class of storage is confirmed in this combination.If the user hopes the class that disposed when having more or less granularity, this situation is taken place.
Figure 26 shows an embodiment in high-performance data storehouse, and wherein all data availables even if do not visit recently, also only reside on the 2.5FC driver.Non-addressable historical data is moved into RAID 5 optical-fibre channels.
Figure 27 shows an embodiment of MRI reflection volume, is SATA RAID 10 and RAID 5 as far as this dynamic volume accessible storage wherein.If reflection is not visited recently, this reflection is moved into RAID 5 so.New write initial entering RAID 10.Figure 19 shows an embodiment through the page or leaf layout of compression.The data staging management is carried out secondary distribution through the data page to fixed size and is realized compression.The position of the free part of this page or leaf of secondary distribution information trace and the distribution portion of this page or leaf.The efficient of compression can not be predicted in the data staging management, and can handle the variable-size page or leaf in its secondary distribution.
Page or leaf through compression can influence cpu performance significantly.To write access, will require whole page or leaf to be extracted and compression again through the page or leaf that compresses.Thereby, just be not compressed, and be back to its uncompressed state by the page or leaf of visit versatilely.Under the extremely limited situation of storage, it possibly be essential writing.
PITC replay firing table points to secondary distribution information, and is marked as the compressed page or leaf of indication.Visit needs higher I/O counting through the comparable incompressible page or leaf of the page or leaf of compression.Visit can be retrieved the position of real data to reading of secondary distribution information.These data through compression can from disk, read and can be on processor decompress(ion).
The data staging management can require to compress can be to the part decompress(ion) of whole page or leaf.This allows the only fraction of decompress(ion) page or leaf of data staging management read access.Read reading characteristic in advance and can helping to postpone compression of high-speed cache.Single decompress(ion) can be handled a plurality of server I/O.The data staging administrative tag makes it needn't frequently attempt page compression as far as the page or leaf of the non-good candidate of compression.
Figure 20 shows an embodiment according to the management of the data staging in the senior disk drive system of principle of the present invention.The operation of the external behavior or the data routing of data staging management change volume.The data staging management can require the modification to page pool.Page pool comprises free-lists and facility information basically.Page pool need be supported the page or leaf allocative decision of a plurality of free-lists, enhancing and the classification of free-lists.Page pool is the independent free-lists of each type storage system maintenance.This allocative decision allows a branch from a plurality of ponds to gather leaves of a book, and sets the minimum or maximum class that allows simultaneously.The classification of free-lists can be from equipment disposition.Each free-lists provides its oneself counter to be used for statistics to compile and show.Each free-lists also provides RAID plant efficiency information to be used for compiling of storage efficiency statistics.
The candidate that the PITC sign is used to move, and when moving addressable page or leaf, block I/O to this page.The data staging management is constantly to PITC inspection candidate.Because the establishment/deletion of server I/O, new snapshot web update and view volume, the accessibility of page or leaf constantly changes.The also continuous inspection volume of data staging management configuration change, and the current list of summary page or leaf class and counting.This allows the data staging management assessment to summarize, and determines whether to exist the page or leaf that moves possibly.
Each PITC presents the counter of the quantity of the page or leaf that is used for each type storage.The data staging management uses this information to identify the PITC of the good candidate that when reaching threshold value, becomes move page.
RAID is based on disk cost distributing equipment from one group of disk.RAID also provides API to come the efficient of retrieval facility or potential equipment.It also need return the information about the required I/O quantity of write operation.The data staging management also can require RAID NULL to use the part of third party RAID controller as the data staging management.RAIDNULL can consume whole magnetic disk, and can be only as the layer that passes.
Disk administrator also can be confirmed and the memory disk classification automatically.Automatically confirm that disk sort can require the change to the SCSI start-up routine.
Through above description and accompanying drawing, those of ordinary skill in the art can understand, shown in only be used for illustrative purposes with the specific embodiment of describing, and be not intended to limit scope of the present invention.Those of ordinary skill in the art can recognize, also available other concrete form realization of the present invention, and do not deviate from spirit of the present invention or essential characteristic.Reference to the details of specific embodiment is not intended to limit scope of the present invention.

Claims (20)

1. data fusion method may further comprise the steps:
Generate the disk space of at least one virtual volume from a plurality of disk storage devices;
The disk space that distributes said at least one virtual volume;
To the disk space write data of being distributed;
At interval each automatic rise time of said at least one virtual volume is put copy with preset time; Each time point copy comprises the table of the pointer of page of data, and said pointer is used for the content map of the corresponding virtual volume content when generating the time point of said time point copy; And
Store the allocation index of each time point copy, make it possible to come each time point copy of said at least one virtual volume in instant location through the address stored index.
2. the method for claim 1 is characterized in that, each time point copy comprises the write operation record to the virtual volume that is associated.
3. method as claimed in claim 2 is characterized in that, each time point copy only comprises the change to the data in the virtual volume that is associated that since the previous time point copy that obtains in the previous time interval, has been taken place.
4. the method for claim 1; It is characterized in that the said step of each automatic rise time of said at least one virtual volume being put copy with the preset time interval comprises: with each automatic rise time point copy of the user-defined time interval to said at least one virtual volume.
5. method as claimed in claim 4 is characterized in that, the said time interval is in every separated a few minutes to whenever within several hours scope.
6. the method for claim 1 is characterized in that, said time point copy is stored in local RAID subsystem place.
7. the method for claim 1 is characterized in that, said time point copy is stored in long-range RAID subsystem place, if make main system crash takes place, the integrality of data is with unaffected, and said data can instant recovery.
8. the method for claim 1 is characterized in that, also comprises based on user's request at least one virtual volume rise time point copy.
9. the method for claim 1 is characterized in that, also comprises at least one parallel chain at least one virtual volume rise time point copy.
10. method as claimed in claim 9 is characterized in that, at least one parallel chain of the time point copy of said at least one virtual volume is based on that the user asks to generate.
11. the method for claim 1 is characterized in that, also comprises the two or more time point copies of virtual volume polymerization that are associated to.
12. a data fusion method may further comprise the steps:
The default size of the disk space of definition of data storage subsystem, the disk space of said default size forms storage pool;
Generate at interval automatically the time point copy of said storage pool with preset time, the time point copy comprises the table of the pointer of page of data, and said pointer is used for the content map of the corresponding virtual volume content when generating the time point of said time point copy; And
Store the allocation index of said time point copy, make it possible to said time point copy through the said storage pool in the instant location of institute's address stored index.
13. method as claimed in claim 12 is characterized in that, said time point copy comprises the write operation record to said storage pool.
14. method as claimed in claim 12 is characterized in that, saidly at interval the step of some copy of automatic rise time of said storage pool is comprised with preset time: with the user-defined time interval to some copy of automatic rise time of said storage pool.
15. method as claimed in claim 12 is characterized in that, said time point copy is stored in long-range RAID subsystem place, if make main system crash takes place, the integrality of data is with unaffected, and said data can instant recovery.
16. method as claimed in claim 12 is characterized in that, also comprises based on user's request said storage pool rise time point copy.
17. method as claimed in claim 12 is characterized in that, also comprises at least one parallel chain to said storage pool rise time point copy.
18. method as claimed in claim 17 is characterized in that, at least one parallel chain of the time point copy of said storage pool is based on that the user asks to generate.
19. one kind can fused data disk drive system, comprising:
Data storage subsystem, it has the disk space from least one virtual volume of a plurality of disk storage devices of said data storage subsystem; And
Disk administrator with at least one disk storage system controller; Said system controller is put copy to each automatic rise time of said at least one virtual volume at interval with preset time; And in the virtual volume that is associated the allocation index of each time point copy of storage, make it possible to locate immediately the time point copy of this virtual volume through institute's address stored index;
Wherein each time point copy comprises the table of the pointer of page of data, and said pointer is used for the content map of the corresponding virtual volume content when generating the time point of said time point copy.
20. one kind can fused data disk drive system, comprising:
Be used for generating at least one abstract abstract device of disk space from a plurality of disk storage devices;
Be used to distribute the distributor of said at least one abstract said disk space;
Be used for write device to the disk space write data of being distributed;
Be used at interval said at least one each abstract automatic rise time being put the fusing device of copy with preset time; Each time point copy comprises the table of the pointer of page of data, and the content map that said pointer is used for corresponding virtual volume is the time point that generates said time point copy; And
Be used to store the memory storage of the allocation index of each time point copy, make it possible to locate each time point copy immediately through institute's address stored index.
CN2009100047280A 2003-08-14 2004-08-13 Virtual disk drive system and method Active CN101566928B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US49520403P 2003-08-14 2003-08-14
US60/495,204 2003-08-14
US10/918,329 US7613945B2 (en) 2003-08-14 2004-08-13 Virtual disk drive system and method
US10/918,329 2004-08-13

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CNB2004800263088A Division CN100478865C (en) 2003-08-14 2004-08-13 Virtual disk drive system and method

Publications (2)

Publication Number Publication Date
CN101566928A CN101566928A (en) 2009-10-28
CN101566928B true CN101566928B (en) 2012-06-27

Family

ID=37078433

Family Applications (4)

Application Number Title Priority Date Filing Date
CNB2004800263088A Active CN100478865C (en) 2003-08-14 2004-08-13 Virtual disk drive system and method
CN 200910004729 Active CN101566929B (en) 2003-08-14 2004-08-13 Virtual disk drive system and method
CN2009100047280A Active CN101566928B (en) 2003-08-14 2004-08-13 Virtual disk drive system and method
CN 200910004737 Active CN101566930B (en) 2003-08-14 2004-08-13 Virtual disk drive system and method

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CNB2004800263088A Active CN100478865C (en) 2003-08-14 2004-08-13 Virtual disk drive system and method
CN 200910004729 Active CN101566929B (en) 2003-08-14 2004-08-13 Virtual disk drive system and method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN 200910004737 Active CN101566930B (en) 2003-08-14 2004-08-13 Virtual disk drive system and method

Country Status (1)

Country Link
CN (4) CN100478865C (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8601035B2 (en) * 2007-06-22 2013-12-03 Compellent Technologies Data storage space recovery system and method
JP4375435B2 (en) * 2007-05-23 2009-12-02 株式会社日立製作所 Hierarchical storage system for predictive data migration
JP5080201B2 (en) * 2007-10-22 2012-11-21 京セラドキュメントソリューションズ株式会社 Information processing apparatus and device driver provided therein
JP2011530746A (en) * 2008-08-07 2011-12-22 コンペレント・テクノロジーズ System and method for transmitting data between different RAID data storage formats for current data and playback data
CN102246135B (en) * 2008-11-07 2015-04-22 戴尔康佩伦特公司 Thin import for a data storage system
TWI432959B (en) 2009-01-23 2014-04-01 Infortrend Technology Inc Storage subsystem and storage system architecture performing storage virtualization and method thereof
KR101552753B1 (en) * 2009-01-29 2015-09-11 엘에스아이 코포레이션 Allocate-on-write snapshot mechanism to provide dynamic storage tiering on-line data placement for volumes
US8108646B2 (en) * 2009-01-30 2012-01-31 Hitachi Ltd. Storage system and storage control method that compress and store data elements
US9646039B2 (en) * 2013-01-10 2017-05-09 Pure Storage, Inc. Snapshots in a storage system
CN104424052A (en) * 2013-09-11 2015-03-18 杭州信核数据科技有限公司 Automatic redundant distributed storage system and method
CN107402838A (en) * 2016-05-18 2017-11-28 深圳市深信服电子科技有限公司 A kind of backup method and storage system based on write buffer
CN107832168B (en) * 2017-10-13 2020-10-16 记忆科技(深圳)有限公司 Solid state disk data protection method
CN107766004A (en) * 2017-11-02 2018-03-06 郑州云海信息技术有限公司 A kind of mapping relations implementation method, system and computer equipment
CN113535069A (en) * 2020-04-22 2021-10-22 联想企业解决方案(新加坡)有限公司 Data storage system, computing equipment and construction method of data storage system
CN114048157A (en) * 2021-11-16 2022-02-15 安徽芯纪元科技有限公司 Internal bus address remapping device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6269431B1 (en) * 1998-08-13 2001-07-31 Emc Corporation Virtual storage and block level direct access of secondary storage for recovery of backup data
CN1373402A (en) * 2001-02-28 2002-10-09 廖瑞民 Hard disk data preserving and restoring device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491593A (en) * 1993-09-10 1996-02-13 International Business Machines Corporation Disk drive spindle synchronization apparatus and method
JPH0944381A (en) * 1995-07-31 1997-02-14 Toshiba Corp Method and device for data storage
DE10085321T1 (en) * 1999-12-22 2002-12-05 Seagate Technology Llc Buffer management system for managing data transfer to and from a buffer in a disk drive

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6269431B1 (en) * 1998-08-13 2001-07-31 Emc Corporation Virtual storage and block level direct access of secondary storage for recovery of backup data
CN1373402A (en) * 2001-02-28 2002-10-09 廖瑞民 Hard disk data preserving and restoring device

Also Published As

Publication number Publication date
CN101566929B (en) 2013-10-16
CN101566930A (en) 2009-10-28
CN100478865C (en) 2009-04-15
CN101566929A (en) 2009-10-28
CN101566928A (en) 2009-10-28
CN101566930B (en) 2013-10-16
CN1849577A (en) 2006-10-18

Similar Documents

Publication Publication Date Title
CN101566931B (en) Virtual disk drive system and method
CN101501623B (en) Filesystem-aware block storage system, apparatus, and method
US20120124285A1 (en) Virtual disk drive system and method with cloud-based storage media
CN101566928B (en) Virtual disk drive system and method
CN104699420A (en) A dynamically expandable and contractible fault-tolerant storage system permitting variously sized storage devices and a method
EP2385457A2 (en) Virtual disk drive system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160506

Address after: American Texas

Patentee after: DELL International Ltd

Address before: American Minnesota

Patentee before: Compellent Technologies