A kind of method for reconstructing of raid-array
Technical field
The application relates to computer memory technical field, and particularly raid-array (RedundantArray ofIndependent Disks, RAID) technology relates in particular to a kind of method for reconstructing of raid-array.
Background technology
RAID be a kind of polylith independently disk combine by different modes and form a disk group (logic magnetic disc), thereby provide than the higher memory property of single disk and the technology of data redundancy protection be provided.The principle of RAID technology exactly data and corresponding parity information are stored on each disk that forms the RAID system, and parity information is stored in respectively on the different disks with corresponding data.After a data in magnetic disk of RAID system is damaged, utilize remaining data and corresponding parity information to go to recover impaired data.As basis and the critical component of network store system, RAID with its fast, the characteristics of magnanimity and high reliability and being celebrated.After the RAID technology occurs, very extensive in the application demand of the every field such as industry, military affairs, education, also be industrial hot spot to the research of RAID technology always.
The different modes that forms disk array is called RAID rank (RAID Levels).Such as common RAID rank RAID0 is arranged, RAID1, RAID5, RAID6 etc.Different RAID ranks provide different Data Protection Scheme.
The RAID5 that forms take 4 disks has only allowed a hard disk to break down as example, and when a disk broke down, RAID5 had not just possessed the data redundancy defencive function, so need to change as early as possible when breaking down dish.After changing faulty hard disk, Magnetic Disk Controller can utilize data and the parity checking on the normal disk to calculate, and the result who calculates is write on the new disk after the replacing, and this process is called the reconstruction of RAID.
The purpose of rebuilding is in order to allow RAID again have the data redundancy defencive function.When the disk failure of RAID occurring, disk array manufacturer generally realizes the automatic Reconstruction of RAID with the HotSpare disk technology.The HotSpare disk technology in simple terms, is exactly when creating the RAID system, and for this RAID specifies a disk as HotSpare disk, when certain piece member disk of RAID system broke down, HotSpare disk can the automatic replacement failed disk, triggers RAID and rebuilds.As its name suggests, " heat " standby dish does not need the read-write interrupted in the RAID system professional when replacing failed disk, namely during the RAID system reconstructing, still can carry out this RAID system is carried out read-write operation.
In the prior art, when input and output (IO) request on upper strata can not be coiled response by certain member of RAID system, can think all that generally this member's dish lost efficacy, the RAID system can start process of reconstruction automatically.The reconstruction operation expense of RAID system is large, the cycle is long, affects the performance of normal data I/O, and generally during rebuilding, if other disk failure is arranged, the RAID system is collapse directly, and then a little less than allowing the RAID system be highly brittle, therefore should avoid starting reconstruction operation as far as possible.
Summary of the invention
The application provides the method for reconstructing of a kind of RAID, can reduce as far as possible and carry out the probability that RAID rebuilds.
The method for reconstructing of a kind of RAID that the embodiment of the present application provides comprises:
The controller of A, RAID system finds that the first disk in this RAID system can't respond the IO operation, closes separately the power supply of the first disk, and starts the timer of a scheduled duration;
B, during described timer timing, the RAID system carries out normal read-write operation, and all bar reel numbers of write operation occured record during this period;
C, described timer expiry are opened the power supply of the first disk, power on for the first disk;
After D, the first disk power on, the first disk done carry out readwrite tests operation;
E, judge whether the first disk is read and write normally, if, carry out F, otherwise execution in step G;
F, according to the generation of the first disk turnoff time interocclusal record all bar reel numbers of write operation, recover data in the corresponding band of the first disk, be recovered rear process ends;
G, be low-quality disk with the first disk label, will replace the first disk as the second disk of HotSpare disk, calculate according to data and the parity checking of other disks in the RAID system, the result who calculates is write in the described second disk.
Preferably, described readwrite tests operation comprises:
D1, check whether online and driven being loaded in the operating system of the first disk, if not online then the first disk is low-quality disk; If continue online execution in step D2;
D2, this disk is sent " TEST UNIT READY " this scsi command chkdsk whether be ready to and read and write; If cannot read and write then disk is low-quality disk; If can execution in step D3;
D3, RAID metadata corresponding to the first disk that records in the operating system write the position of these disk corresponding element data, if write failure, judge that then the first disk is low-quality disk, successfully continue execution in step D4 if write;
D4, the first disk RAID metadata is done read operation, if the merit of being read as then the first disk is confirmed as dish, reading failure judges that then the first disk is low-quality disk.
As can be seen from the above technical solutions, when certain disk of RAID system can't respond the IO operation, at first it is carried out lower electric treatment, during lower electricity, allow application layer that RAID is is normally read and write, and all bar reel numbers of write operation occur during this period; Then this disk is carried out upper electric treatment, tests it and whether can normally read and write, if, according to the generation of record all bar reel numbers of write operation, begin to recover data in the corresponding band of this disk; Otherwise, be low-quality disk with this disk label, and start conventional process of reconstruction.In most of the cases can make in this way the disk of RAID system recover normal and need not to carry out reconstruction operation.
Description of drawings
The method for reconstructing process flow diagram of a kind of raid-array that Fig. 1 provides for the embodiment of the present application.
Embodiment
In most cases, the IO on upper strata request can not be coiled response by certain member of RAID system, is not that the disk that coils as this member has really damaged.According to disk producer Seagate corporate statistics, disk can not respond IO when request, and 95% situation is that these situations can make disk still effective by simple reparation operation because the software error of firmware, verification and so on causes; Only having in 5% the situation, is because disk really is damaged.Therefore, if in the inreal situation about damaging of disk, just the RAID system is started process of reconstruction, can greatly improve the operation and maintenance cost of RAID system.
The application provides a kind of method for reconstructing of raid-array, and its basic thought is: each the disk groove position interface in the RAID system provides the control disk to realize the circuit of independent upper and lower electricity; When certain disk of RAID system can't respond the IO operation, at first it is carried out lower electric treatment, during lower electricity, allow application layer that RAID is is normally read and write, and all bar reel numbers of write operation occur during this period; Then this disk is carried out upper electric treatment, tests it and whether can normally read and write, if, according to the generation of record all bar reel numbers of write operation, begin to recover data in the corresponding band of this disk; Otherwise, be low-quality disk with this disk label, and start conventional process of reconstruction.
Clearer for the know-why, characteristics and the technique effect that make the present techniques scheme, below in conjunction with specific embodiment the present techniques scheme is described in detail.
The method for reconstructing flow process of a kind of raid-array that the embodiment of the present application provides comprises the steps: as shown in Figure 1
The controller of step 101:RAID system finds that certain the piece disk in this RAID system can't respond the IO operation, closes separately the power supply of this disk, allows this disk cut off the power supply, and starts the timer of a scheduled duration.Below this disk is called the first disk.
Step 102: at (namely between the first disk turnoff time) during the described timer timing, the RAID system carries out normal read-write operation, and all bar reel numbers of write operation occured record during this period.
Step 103: described timer expiry, open the power supply of the first disk, power on for the first disk.
Step 104: after the first disk powers on, the first disk done carry out readwrite tests operation.
In the embodiment of the present application, readwrite tests is done following operation:
D1, check whether online and driven being loaded in the operating system of the first disk, if not online then the first disk is low-quality disk; If continue online execution in step D2;
D2, this disk is sent " TEST UNIT READY " this scsi command chkdsk whether be ready to and read and write; If cannot read and write then disk is low-quality disk; If can execution in step D3;
D3, RAID metadata corresponding to the first disk that records in the operating system write the position of these disk corresponding element data, if write failure, judge that then the first disk is low-quality disk, successfully continue execution in step D4 if write;
D4, the first disk RAID metadata is done read operation, if the merit of being read as then the first disk is confirmed as dish, reading failure judges that then the first disk is low-quality disk.
Step 105: judge whether the first disk is read and write normally, if, execution in step 106, otherwise execution in step 107.
Step 106: according to the generation of the first disk turnoff time interocclusal record all bar reel numbers of write operation, recover data in the corresponding band of the first disk, be recovered rear process ends.
Step 107: be low-quality disk with the first disk label, will replace the first disk as the second disk of HotSpare disk, calculate according to data and the parity checking of other disks in the RAID system, the result who calculates is write in the described second disk.
The above only is the application's preferred embodiment; not in order to limit the application's protection domain; all within the spirit and principle of present techniques scheme, any modification of making, be equal to replacement, improvement etc., all should be included within the scope of the application's protection.