WO1999023562A1

WO1999023562A1 - Automatic backup based on disk drive condition

Info

Publication number: WO1999023562A1
Application number: PCT/US1998/023152
Authority: WO
Inventors: Mahmoud Assaf
Original assignee: Gateway, Inc.
Priority date: 1997-11-03
Filing date: 1998-10-30
Publication date: 1999-05-14
Also published as: CA2307212A1; AU1294099A; JP2001522089A

Abstract

Backup of a personal computer is automatically initiated in response to disk drive performance monitoring software which predicts impending failure or performance degradation and provides messages of such to a tape backup program. A tape backup program initiates a tape backup of data on the disk drive in response to information provided by the disk drive performance monitoring software based on user defined states of performance or other conditions which indicate an impending or possible failure. The tape backup program augments messages normally provided by the self monitoring software by indicating that the disk drive is being backed up at a particular time, and also indicate status of the backup and completion.

Description

AUTOMATIC BACKUP BASED ON DISK DRIVE CONDITION

Field of the Invention

The present invention relates generally to computer systems, and in particular to automated backup of disk drive data based on the condition of the disk drive.

Background of the Invention Hard disk drives are complex electro-mechanical devices which can suffer performance degradation or failure due to a single event or a combination of events. Some hard disk drive failures happen quickly and without advance warning. Such unpredictable failures can be caused by static electricity, handling damage, or thermal-related solder problems. Other hard disk drive failures result from the gradual degradation of the drive's ability to perform. Hard disk drive failures result in lost data and lost time to a user trying to recover the lost data.

One way to protect against data loss associated with hard disk drive failure to use the Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) The failures that result from the degradation of performance are the type of failures that S.M.A.R.T. is designed to predict. S.M.A.R.T. capable devices monitor a variety of information internal to the device to assess reliability and predict an impending device failure. For example, a S.M.A.R.T. capable drive might monitor the fly height of the head above the magnetic media. If the head starts to fly too high or too low, it is likely that the drive could fail. Other drives may monitor different conditions such soft error rates which are errors that occur sporadically and may not appear on successive attempts to read data. The monitoring techniques employed by S.M.A.R.T.-capable drives vary from one manufacturer to another.

When the S.M.A.R.T. capable drive predicts an impending failure, the drive's S.M.A.R.T. capability makes information available through an interface to the disk drive. The information may be presented to a user via drivers and supporting applications. The information reaches an application that can display a warning message to a user. The user is responsible for reacting to the warning message as desired. Thus, present devices require the user, after a warning is given, to back-up vital data and replace suspect devices prior to data loss or unscheduled down time.

However, a problem results if the user is not able to respond by backing- up the data before the failure occurs. One such a situation arises on workstations connected to a network if the user does not have the authority or the ability to back-up the data and replace the drive. Failure of the hard disk drive results in lost data, lost time and in many cases lost money. Further problems may be caused when computers are constantly left running, such as overnight, when a user is not normally monitoring the computer. Several times during normal working hours, the user may also be away from a running computer. There is a need for addressing disk drive problems when the user is not available. There is a further need for enhancing system reliability when a user is not attending the system.

Summary of the Invention Backup of data on a personal computer is automatically initiated in response to selected information provided by disk drive performance monitoring. In one embodiment, performance monitoring capabilities in a disk drive provide information on potential impending failure or performance degradation. The information is provided to an application such as a tape backup program. The tape backup program initiates a tape backup of data on the disk drive. The tape backup is initiated when the information is representative of predefined or user defined states of performance or other conditions which indicate an impending or possible failure. The predefined states are defined to allow a normal backup prior to a predicted failure of the disk drive, and to ensure that the disk drive has sufficient performance to allow optimal data transfer rates during such a backup. In one embodiment, the tape backup program augments information normally provided by the self monitoring functions by indicating that the disk drive is being backed up at a particular time, and also indicate status of the backup and completion. If the user is not at the computer system, the tape backup program will automatically begin the backup by ensuring that a suitable media, such as a tape is in position in the tape drive. If not, it prompts the user to insert a tape. The tape backup program allows a user to continue working, and backing up data real time, such as by use of any writable media, such as tape, diskette or zip drive until the potentially failing disk drive can be repaired. The backup program also allows a user to leave a system unattended, with some assurance that potential disk drive failures are likely to be detected and data backed up without user intervention.

In still further embodiments, other forms of nonvolatile storage devices are used as a backup device, such as another disk drive, or a writable CD ROM. In one variation, the disk drive is backed up via a network connection to a server or other device having suitable storage capabilities.

Brief Description of the Drawings Figure 1 is a block diagram of a computer system employing the present invention. Figure 2 is a block diagram of functional modules used in one embodiment of the present invention.

Figure 3 is a flowchart depicting steps followed by the functional modules in Figure 2 to detect a potential failure condition and initiate a backup of the data in the potentially failing device. Figure 4 is a flowchart depicting steps followed to determine if a backup is required based on prior backup history.

Description of the Embodiments In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural changes may be made without departing from the scope of the present invention. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

A block diagram of a computer system 100 in Figure 1 will be described with respect to the present invention. Further details of software modules implementing the invention will be described with reference to Figure 2, and flowcharts depicting details of the process implemented by the modules and computer system will be described in Figures 3 and 4.

Computer system 100 in one embodiment is a typical personal computer and comprises a processor 110 coupled to a memory 112 and system controller 114. The system controller is also coupled to the processor 110 and both the processor 110 and system controller 114 can access data in memory 112. The system controller 114 is also coupled to a host bus 116. Host bus 116 is also coupled to a plurality of peripheral devices comprising a disk drive 118, a tape drive 120, PCI device interface 122, a graphics controller 124 which is further coupled to a display device 126, and a keyboard/mouse controller 128 which in turn is coupled to a keyboard 130. All of these elements operate together in a well known manner, with software residing in memory 112 such as RAM, BIOS, DRAM or other memory being executed in processor 110. System controller 114 provides an interface to the peripheral devices, allowing data transfers between the peripheral devices and to and from memory 112 without data having to first be routed through processor 110.

Some of the programs that processor 110 executes include an operating system, application programs, peripheral device drivers and other modules or programs. In Figure 2, a block diagram wherein the blocks represent program modules and devices shows blocks involved in detecting potential failures in disk drive device 118, permitting backup of data on disk drive 118 onto tape drive 120. Predictive failure analysis functionality is provided on many disk drives that are available on the market today from disk drive vendors including IBM Corporation, Western Digital Corporation, Seagate and Quantum to name a few. One industry standard for predictive failure analysis functionality is referred to as Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) as indicated in block form at 210.

Information regarding the operational characteristics of the disk drive 118 are provided at registers which are then polled by BIOS/Driver 212 and provided to an application agent 214. Application agent 214 provides messages to a user regarding the status of the disk drive 118 and initiates a tape backup of data on the disk drive if it is determined that a failure of the disk drive is likely to occur within a set time. Application agent 214 first ensures that proper media 216 is available for use by the tape drive 120, and if not, will prompt a user to insert suitable media such as a tape. Application agent 214 then invokes operating system services 220 to start a backup program 222 which can be the same program as normally used to backup the disk drive 118. Backup program 222 initiates the backup, and data from the disk drive is transferred to the tape as represented by a bus 218, such as a PCI bus. It should be noted that backup program 222 can be used to cause backup to any suitable storage device, whether local or remote via network. Application agent 214 serves as a router between the bios 212 and the operating system.

Analysis block 210 monitors a range of attributes and sends attribute and threshold information to application agent 214 via registers. In normal operation, analysis block 210 then decides if an alert is warranted, and sends that message to the system, along with the attribute and threshold information. The attribute and threshold level implementation varies with each disk drive vendor, and are based on historical failure analysis of data collected from information stored in disk drives that have failed. Attribute individualism is important because drive architectures vary from model to model. Attributes and thresholds that detect failure for one model may not be functional for another model.

Predictable failures are characterized by degradation of an attribute over time, before the disc drive fails. This creates a situation where attributes can be monitored, making it possible for predictive failure analysis. Many mechanical failures are typically considered predictable, such as the degradation of head flying height, which would indicate a potential head crash. Certain electronic failures may show degradation before failing, but more commonly, mechanical problems are gradual and predictable.

Though attributes are drive-specific, a variety of typical characteristics can be identified: head flying height, data throughput performance, spin-up time, re-allocated sector count, seek error rate, seek time performance, spin try recount, and drive calibration retry count to name a few. Others may be used in various disk drives dependent upon the design and historical failure information.

There are currently two S.M.A.R.T. specifications which are being implemented in disk drives. S.M.A.R.T. emerged for the ATA/IDE environment when SFF-8035 was placed in the public domain. SCSI drives incorporate a different industry standard specification, as defined in the ANSI-SCSI Informational Exception Control (IEC) document X3T10/94- 190.

The S.M.A.R.T. system technology of attributes and thresholds is similar in ATA/IDE and SCSI environments, but the reporting of information differs. In an ATA/IDE environment, software on the host interprets the alarm signal from the drive generated by the "report status" command of S.M.A.R.T. Application agent 214 polls the drive on a regular basis to check the status of this command, and if it signals imminent failure, sends an alarm to the end user or system administrator. Application agent 214 evaluates the attributes and alarms reported, in addition to the "report status" command from the S.M.A.R.T. analysis block 210.

Generally speaking, SCSI drives with reliability prediction capability only communicate a reliability condition as either good or failing. In a SCSI environment, the failure decision occurs at the disc drive as represented at analysis block 210, which notifies the user, and initiates tape backup. The SCSI specification provides for a sense bit to be flagged if the disc drive determines that a reliability issue exists.

APIs are provided to set ATA registers in ATA/IDE disk drives supporting S.M.A.R.T. via BIOS/DRIVER 212 which is a BIOS or driver which is capable of sending S.M.A.R.T. commands to and receiving S.M.A.R.T. data from the ATA interface registers. Application agent 214, such as a backup program is provided on top of the BIOS or driver to allow a user to control the S.M.A.R.T. device and monitor the status of that device. Some subcommands and their respective codes include ENABLE/DISABLE ATTRIBUTE AUTOSAVE - code D2h, ENABLE S.M.A.R.T. OPERATIONS - code D8h, ENABLE S.M.A.R.T. OPERATIONS - code D9h, and RETURN S.M.A.R.T. STATUS - code DAh. The RETURN S.M.A.R.T. STATUS subcommand is used to retrieve status information from one or more ATA registers.

In Figure 3, steps taken to monitor the status of the disk drive 118 and respond are detailed. These steps may be implemented entirely in a device driver, BIOS or an application program, or spread therebetween. Most implementations will provide for status polling in a driver or BIOS, with other steps implemented in an application program written in any number of high level languages such as C++. At 310, the drive registers or bit is polled. A polling interval can be user defined or preset. A shorter time will provide a better chance of recovering if a failure is quick to develop, but it should be recognized that there are some modes of failure that are currently not predictable. The interval time should be selected to ensure significant system resources are not consumed by the polling and further processing activity associated with each poll. At 312, the register value or values which comprise information regarding the status of the disk drive and attributes such as those listed previously are received and compared with predefined or user defined values. In one embodiment, only the status of the disk drive, which in the case of SCSI devices is a single bit indicating potential failure condition. If a potential failure condition is either received or deduced from the attributes at 320, messages indicating such a failure condition being eminent are provided to the user or a system administrator at 322. If no failure condition is detected, control is returned to polling at 310.

Following detection of a potential failure condition, tape backup is attempted starting at 324, where the tape drive is checked for suitable media such as a tape cartridge. If no media is detected, the user is prompted to insert such media at 328 and a wait state is entered at 330 until such media is detected as present. Following the detection of media at 324, a normal tape backup operation is begun at 336. Such operations are well known in the art and in the past have been user initiated or periodically performed during normal operation. Status of the backup operation via messaging facilities is provided to the user as indicated at 338 either before or during the tape backup operation. When the tape backup is completed at 344, an indication of the completion is provided to the user prior to end 346.

Users interact with application agent 214 via function provided in the flowchart of Figure 4. At block 410, the user is provided an interface via command, graphical user interface, menu driven interface, voice or other constructs to enable or disable the automatic tape backup feature. At 412, the user is permitted to edit the backup criteria via similar interface. This allows a user to attempt to ensure that the data throughput of the disk drive is still sufficient to provide data fast enough to keep the tape drive operating in a streaming mode. If the data transfer rate is too slow, the tape device may only be able to write one block at a time and then try to resynchronize the tape to write the next block of data after stopping and rewinding following the first block if the second block is not immediately available. Buffering techniques can be useful in ensuring that the tape drive operates in a streaming mode, but may not suffice if the performance of the disk drive has deteriorated too far.

The enable/disable and editing criteria interfaces may be combined into a single screen, which may also be combined with normal control of disk drive functions, such as via a control panel as is commonly used in personal computer operating environments or operating systems. At 418, previous backup information which has been stored is interrogated and if the drive has been recently backed up as determined at 422, the backup feature is disabled for a selected period of time. Following this time, which is user definable but defaulted to approximately 24 hours, the backup feature is enabled at 430. The user may also set values at 412 to indicate that the backup feature should not be automatically enabled. If the disk drive has been recently backed up at 422, control is returned at 432. The functions provided by blocks 418, 422 and 430 may also be performed on a periodic basis, which again can be user definable at 412.

CONCLUSION A system for providing automatic backup of disk drive data upon detection of potential future failure of the disk drive has been described. It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. While the invention makes use of the predictive failure analysis capabilities described in S.M.A.R.T., other predictive failure analysis capabilities may also be used to provide an automated backup function. Such capabilities can also easily be integrated into other types of devices which store data and whose potential failure can be predicted, such as CD ROM devices and other devices which may not yet even be invented. Further, while tape drives have been specified in the embodiments described as the backup device, other devices may also be used, such as semiconductor memory devices, or even other disk drives on the same computer system or on a server or other networked computer or storage facility. Many of the functions provided by BIOS or the application can be provided by software, hardware or firmware as is well known to those skilled in the art, and the location of the provider of the functions is also a matter of well known design choice. Further, the present invention could be incorporated with other computer systems, such as a portable computers, servers, midrange computers or other computers.

Claims

What is claimed is:

1. A back up memory system comprising: a poller that polls a disk drive device which provides information regarding the status of the device; a comparator that compares the information with predetermined values; and a backup initiator that initiates a backup of data stored on the disk drive device.

2. The memory system of claim 1 and further comprising device registers which provide the information.

3. The memory system of claim 1 and further comprising a tape drive device coupled to the disk drive device.

4. The memory system of claim 3 and further comprising a messaging system that prompts a user of the memory system to provide suitable media for the tape drive device prior to initiating a backup of data stored on the disk drive device.

5. A computer system comprising: a disk drive device having integrated performance monitoring and status reporting capability; a tape drive device coupled to the disk drive device; a polling module that polls the disk drive to determine the current status of the disk drive device; and a tape drive module that initiates backup of data on the disk drive onto suitable media in the tape drive device based on the status of the disk drive device.

6. The computer system of claim 5 and further comprising a messaging system that prompts a user of the memory system to provide suitable media for the tape drive device prior to initiating a backup of data stored on the disk drive device.

7. The computer system of claim 5 wherein the disk drive device comprises a register, and wherein the polling module polls the register to determine the current status of the disk drive device.

8. The computer system of claim 5 wherein the tape drive module comprises a software application program.

9. The computer system of claim 5 and further comprising an interface module that provides a computer system user the ability to enable and disable initiation of backup of data.

10. A computer readable media comprising a computer program that when executed by a suitably configured computer system causes the computer system to perform the steps comprising: polling a disk drive device which provides information regarding the status of the device; comparing the information with selected values; and initiating a backup of data stored on the disk drive device.

11. The computer readable media of claim 10, wherein the computer program causes the computer to further perform the steps comprising: providing a user interface to edit the selected values.

12. The computer readable media of claim 11 , wherein the computer program causes the computer to further perform the steps comprising: providing a user interface to permit a user to enable and disable initiating backup of data stored on the disk drive device regardless of the comparison of the information to the selected values.

13. The computer readable media of claim 10, wherein the computer program causes the computer to further perform the steps comprising: checking a backup device for suitable media prior to initiating the backup of data stored on the disk drive device.

14. The computer readable media of claim 10, wherein the computer program causes the computer to further perform the steps comprising: providing a plurality of messages regarding status of the backup and disk drive.

15. A computer system comprising : a processor coupled to a memory; a system controller coupled to the processor and to the memory; a system bus coupled to the system controller; a display coupled to the system bus; a disk drive device coupled to the system bus and having integrated performance monitoring and status reporting capability; a tape drive device coupled the system bus; a polling module that polls the disk drive to determine the current status of the disk drive device; and a tape drive module that initiates backup of data on the disk drive onto suitable media in the tape drive device based on the status of the disk drive device.

16. The computer system of claim 15 and further comprising a messaging system that prompts a user of the memory system to provide suitable media for the tape drive device prior to initiating a backup of data stored on the disk drive device.

17. The computer system of claim 16 wherein the messaging system further notifies a user of the status of the backup of data.

18. The computer system of claim 15 wherein the tape drive module stores a history of backup activity and disables initiation of backup of data if a previous backup has been performed within a certain period of time.

19. The computer system of claim 15 wherein the disk drive device comprises a register, and wherein the polling module polls the register to determine the current status of the disk drive device.

20. The computer system of claim 15 and further comprising an interface module that provides a computer system user the ability to enable and disable initiation of backup of data.