WO2000060463A1 - Background synchronization for fault-tolerant systems - Google Patents

Background synchronization for fault-tolerant systems Download PDF

Info

Publication number
WO2000060463A1
WO2000060463A1 PCT/US2000/008940 US0008940W WO0060463A1 WO 2000060463 A1 WO2000060463 A1 WO 2000060463A1 US 0008940 W US0008940 W US 0008940W WO 0060463 A1 WO0060463 A1 WO 0060463A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
active
data
inactive
regions
Prior art date
Application number
PCT/US2000/008940
Other languages
French (fr)
Inventor
Thomas D. Bissett
Paul A. Leveille
Erik Muench
Christopher C. Lord
Original Assignee
Marathon Technologies Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Marathon Technologies Corporation filed Critical Marathon Technologies Corporation
Priority to EP00921672A priority Critical patent/EP1169676A1/en
Priority to AU41959/00A priority patent/AU4195900A/en
Priority to CA002369932A priority patent/CA2369932A1/en
Publication of WO2000060463A1 publication Critical patent/WO2000060463A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1658Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1461Backup scheduling policy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1466Management of the backup or restore process to make the backup process non-disruptive
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2071Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring using a plurality of controllers

Definitions

  • the invention relates to restoring synchronized execution by processors in fault resilient/fault tolerant computer systems.
  • BACKGROUND Computer systems that are capable of surviving hardware failures or other faults generally fall into three categories: fault resilient, fault tolerant, and disaster tolerant.
  • Fault resilient computer systems can continue to function, often in a reduced capacity, in the presence of hardware failures. These systems operate in either an availability mode or an integrity mode, but not both.
  • a system is "available" when a hardware failure does not cause unacceptable delays in user access, which means that a system operating in an availability mode is configured to remain online, if possible, when faced with a hardware error.
  • a system has data integrity when a hardware failure causes no data loss or corruption, which means that a system operating in an integrity mode is configured to avoid data loss or corruption, even if the system must go offline to do so.
  • Fault tolerant systems stress both availability and integrity. A fault tolerant system remains available and retains data integrity when faced with a single hardware failure, and, under some circumstances, when faced with multiple hardware failures.
  • Disaster tolerant systems go beyond fault tolerant systems.
  • disaster tolerant systems require that loss of a computing site due to a natural or man-made disaster will not interrupt system availability or corrupt or lose data.
  • redundancy of components is a fundamental prerequisite for a disaster tolerant, fault tolerant or fault resilient system that recovers from or masks failures. Redundancy can be provided through passive redundancy or active redundancy, each of which has different consequences.
  • a passively redundant system such as a checkpoint-restart system
  • Examples of passively redundant systems include stand-by servers and clustered systems.
  • the mechanism for handling a failure in a passively redundant system is to "fail-over," or switch control, to an alternative server.
  • the current state ofthe failed application may be lost, and the application may need to be restarted in the other system.
  • the fail-over and restart processes may cause some interruption or delay in service to the users.
  • passively redundant systems such as stand-by servers and clusters provide "high availability" and do not deliver the continuous processing usually associated with “fault tolerance.”
  • An actively redundant system such as a replication system, provides an alternative processor that concurrently processes the same task and, in the presence of a failure, provides continuous service. The mechanism for handling failures is to compute through a failure on the remaining processor. Because at least two processors are looking at and manipulating the same data at the same time, the failure of any single component should be invisible both to the application and to the user.
  • the goal of a fault tolerant system is to produce correct results in a repeatable fashion. Repeatability ensures that operations may be resumed after a fault is detected. In a checkpoint- restart system, this entails rolling back to a previous checkpoint and replaying the inputs again from a journal file. In a replication system, repeatability results from simultaneous operation on multiple instances of a computer.
  • Processes performed when a fault occurs in an actively redundant system may include fault detection, fault isolation, fault recovery, repair, and system restoration (including synchronization). For example, when application instructions are executed in the same order on all copies of a processor, and a fault occurs in one copy of the processor: 1. The fault is detected.
  • the fault is identified as coming from a particular copy of the processor and effects ofthe fault are constrained so as not to adversely affect the system.
  • the system recovers from the fault and continues with no side effects to the application, but with a reduced level of fault tolerance. 4.
  • the faulty processor is repaired or replaced. 5.
  • the repaired processor synchronizes itself with the remaining processors to restore the system's normal level of fault tolerance.
  • the synchronization process may be performed as a foreground process or a background process.
  • Foreground synchronization takes complete control of the processors and dedicates them to copying all memory contents and processor context information to the synching processor.
  • Background synchronization copies the memory contents to the synching processor as a background process (i.e., while application programs continue to run), and then switches to a short foreground synchronization process to copy processor context information.
  • Foreground synchronization consumes 100% ofthe processors' operating cycles for the duration ofthe memory copy. This locks out any application programs being run by the processors from all external devices, which may be a problem if the memory copy takes too long. For example, with some network protocols, a network connection that is not serviced at least once every six seconds can be dropped. This places a restriction on the maximum memory size that can be supported without making special provisions in the application program for the fault tolerant state ofthe system, which is undesirable.
  • the maximum memory size that can be supported for a particular maximum synchronization time can be increased by increasing the I/O bandwidth in a direct, linear relationship with the increase in memory size.
  • desired growth in memory size has outpaced the I/O bandwidth of computers, making it difficult to synchronize desired memory sizes with foreground synchronization.
  • Background synchronization renders the linear relationship between I/O bandwidth and memory size unnecessary by allowing the memory copy to occur while the application is still running.
  • the timing constraint associated with background synchronization is that the duration of the foreground process at the end of the bulk memory copy must be less than the permitted maximum (e.g., six seconds).
  • the invention features synchronizing an inactive memory with an active memory in a fault-tolerant computer system that includes an active processor.
  • Data is copied from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding. Regions of the active memory in which changes are made are tracked while the copying is proceeding, and, after copying is complete, a determination is made as to whether data from the regions of the active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations. If the data can be copied to the inactive memory within the predetermined time period using the foreground process, the data is copied to the inactive memory using the foreground process. If the data cannot be copied to the inactive memory within the predetermined time period using the foreground process, the copying, tracking, and determining are repeated for the regions of the active memory in which changes were made.
  • Implementations may include one or more of the following features. For example, an evaluation may be made as to whether the synchronizing is likely to be successful prior to copying data from the active memory to the inactive memory using the background process.
  • evaluating whether the synchronizing is likely to be successful may include comparing a rate at which data in the active memory are modified to a rate at which data can be transferred from the active memory to the inactive memory using the background process.
  • evaluating whether the synchronizing is likely to be successful may include comparing a rate at which data in the active memory are modified to a rate at which data can be transferred from the active memory to the inactive memory using the background process.
  • Mitigation may include, for example, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
  • Mitigation also may include restricting an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
  • a data compression operation may be performed on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made.
  • Active memory may be associated with the active processor.
  • the active processor may include a compute element and an I/O processor, where the compute element implements the copying, tracking and determining.
  • a memory copy list identifying portions of the active memory for which data are to be copied to the inactive memory using the background process may be created and used in copying data using the background process. Tracking regions of the active memory may include creating a new memory copy list.
  • the context of the active processor may be copied to the inactive processor.
  • Tracking regions of the active memory may include using a page table structure including pages of memory and corresponding page table entries, with each page table entry including an indicator bit that is set when a memory location of the corresponding page of memory is modified.
  • the active processor includes a memory control section and tracking regions ofthe active memory includes using a memory block structure allocated by the memory control section, including updating a flag corresponding to a block of memory whenever the block of memory is modified.
  • the method may include, for example, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
  • an amount of working memory for one or more running applications may be restricted to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining.
  • a data compression operation may be performed on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determining.
  • a number of pages of active memory may be allocated to the synchronization process prior to copying data from the active memory to the inactive memory using the background process. Prior to the allocation, the inactive memory is cleared, and only unallocated pages are copied.
  • Another implementation may include clearing the inactive memory and determining which regions of the active memory contain nonzero data.
  • copying data from the active memory to the inactive memory using the background process includes copying data only from the regions of the active memory that contain nonzero data.
  • the active processor may include a memory control section, and determining which regions of the active memory contain nonzero data then may include using the memory control section to store a list of which regions of active memory contain nonzero data.
  • FIG. 1 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system.
  • Figs. 2 and 3 are block diagrams of a fault tolerant system that emulates clock lockstep operation.
  • Fig. 4 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a software-only, page table tracking implementation.
  • Fig. 5 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a software-only, balloon zeroing implementation.
  • Fig. 6 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a hardware-assisted memory controller tracking method.
  • a heuristic approach may be used to provide background synchronization without requiring a custom memory controller.
  • the heuristic approach attempts to perform the foreground part of the synchronization in less than the permitted maximum synchronization time (e.g., six seconds).
  • the background task is not limited in time or in the number of memory sweeps performed.
  • the heuristic approach operates according to a procedure 100.
  • the chances that the synchronization will complete successfully are evaluated (step 105). As discussed in more detail below, this evaluation may involve a comparison ofthe rate at which memory is copied to the rate at which memory is modified. If there is no chance of success
  • step 110 the evaluation is repeated (step 105), with the hope that system conditions will have changed in a way that permits synchronization to complete successfully.
  • the evaluation may be repeated upon expiration of a fixed delay period. If there is some chance of success, a memory copy list is created (step 115). Next, all blocks of memory in the memory copy list are copied to the synching processor using a background process (step 120).
  • a new memory copy list then is created (step 125). This new list identifies all memory blocks that have been modified since the last memory copy list was created. From the new memory copy list, the time required to perform a foreground synchronization process is estimated (step 130).
  • step 135) If the estimated time for foreground synchronization is less than the permitted maximum (e.g., six seconds) (step 135), then the memory copy is completed using foreground synchronization (step 140). After memory copy is completed, the processor context is copied to the synching processor in the foreground to complete the synchronization procedure (step 140).
  • step 135) If the estimated time for foreground synchronization is greater than the permitted maximum (step 135), and background memory copy has not been attempted a maximum number of times (step 150), all blocks of memory in the memory copy list are copied to the synching processor using a background process (step 120). The procedure then proceeds as described above.
  • a memory block structure is used to track the areas of memory that have been modified and to create the memory copy lists.
  • the memory block structure may be provided in several different ways. The most basic way is a page table structure. This structure is provided by processors currently available from Intel Corporation. Each page table entry (“PTE") includes a dirty bit that is set when a memory location ofthe page is modified. The dirty bit is set by hardware, and is not altered by the operating system.
  • the PTEs can be used to track which pages of memory are modified while the background memory copy is performed.
  • the PTEs provide a very detailed list of modified pages. For example, for a one gigabyte memory, there are up to one million page table entries. This list may be too detailed to use for synchronization purposes.
  • An alternate approach is to add a tracking mechanism to the memory control section of the motherboard chip set.
  • a section of memory (part of system memory, on-chip memory, or dedicated external memory) can be allocated to a modified block structure. Whenever a block of memory is written, a corresponding flag in the modified block structure is updated. Between passes of the background copy process, a snapshot of the modified block structure is made to control the next pass ofthe background copy process.
  • each background copy pass requires scanning the structure and setting up transfers (i.e., creating the memory copy list).
  • the transfer time for each block is directly related to the block size, while the scan time is inversely related to the block size.
  • the block modification rate is monitored to see if a background synchronization procedure will ever finish. For example, if two blocks get modified during the time required to copy one block, then the background synchronization will never finish.
  • step 115 To create the first memory copy list (step 115), a list of all blocks of memory is created.
  • a list of all modified blocks of memory is created.
  • the PTEs or the modified block structure are cleared after the memory copy list is created.
  • the procedure is not guaranteed to complete. If the block modification rate is too high, there will always be more modified memory blocks to copy than can be copied in the permitted foreground copy time.
  • the bandwidth that the background copy task is allowed to consume is increased.
  • any running applications are restricted down to their minimum working memory sets.
  • rudimentary data compression is performed on the memory images before transfer. Particular implementations may use one or more of these techniques.
  • the synchronization process can allocate large chunks of system memory to itself. This memory then can be zeroed in both processors, so that it does not need to be copied.
  • synchronization allocation restricts the number of memory pages available for applications to modify.
  • the synchronization process starts with a small set of pages and expands the number of pages until the operating system prevents the process from taking any more.
  • the synchronization process runs as a driver under the operating system, which means that the process is allowed to lock down pages of memory so that they are never sent out to disk storage. Accordingly, any pages given to the synchronization process represent pages of physical memory that do not have to be copied.
  • the physical memory available to applications that are running is compressed until the applications are close to their minimum working sets.
  • the memory data to be copied may be compressed before a copy.
  • a benefit of such a compression is that the processor-to-memory bandwidth is much greater than the memory-to- I/O bandwidth. As a result, a simple compression scheme can reduce the total copy time significantly.
  • the procedure 100 may be implemented by a fault tolerant system 200, such as the Endurance ® 4000 system, which is available from Marathon Technologies Corporation of Boxboro, Massachusetts.
  • the system operates multiple instances of a compute element in instruction stream lockstep, which means that the multiple instances of a compute element perform the same sequence of instructions in the same order.
  • fully phase- locked operation which also may be referred to as clock lockstep operation, occurs when multiple instances of a compute element perform the same sequence of instructions in the same order, with each instruction being performed in the same clock cycle by each instance of the compute element.
  • the instances of a compute element in the system 200 operate in instruction stream lockstep.
  • the time needed to execute the instruction stream varies due to the uncontrolled past history of each compute element.
  • caches, table look-ahead buffers, branch prediction logic, speculative execution logic, and execution pipelines of the compute elements can have different initial values, which, even though the instruction streams being executed are the same, result in varying execution times.
  • Clock lockstep operation may be achieved by using a common oscillator to provide clocks to all instances of the compute element.
  • a common oscillator may be unsuited for fault tolerant operation because it includes a single component, the common oscillator, the failure of which will cause failure ofthe entire system.
  • Emulated clock lockstep operation avoids the single point of failure and is achieved using the techniques described below.
  • Emulated clock lockstep operation offers the considerable additional benefit of permitting the different instances of a compute element to be separated by distances of up to a kilometer or more.
  • all computer systems perform two basic operations: (1) manipulating and transforming data, and (2) moving the data to and from mass storage, networks, and other I/O devices.
  • the system 200 divides these functions, both logically and physically, between two separate processors.
  • each half of the system 200 includes a compute element ("CE") 205 and an I/O processor ("IOP") 210.
  • the compute element 205 processes user application and operating system software.
  • the compute element 205 implements the procedure 100.
  • I/O requests generated by the compute element 205 are redirected to the I/O processor 210. This redirection is implemented at the device driver level.
  • the I/O processor 210 provides I/O resources, including I/O processing, data storage, and network connectivity.
  • the I/O processor 210 also controls synchronization of the compute elements.
  • the system 200 is fault tolerant in that it continues to operate transparently to its users in the presence of any single hardware failure.
  • the system 200 emulates a traditional computing environment by partitioning the environment into two components.
  • the compute element 205 handles all computing tasks for the operating system and any applications.
  • the I/O processor 210 handles all I/O devices. Thus, the I/O processor 210 handles all of the asynchronous activities associated with a computer, while the compute element 205 handles all of the synchronous computing activities.
  • the system 200 includes at least two compute elements 205 and at least two I/O processors 210.
  • the two compute elements 205 operate in lockstep while the two I/O processors 210 are loosely coupled.
  • I/O processors 210 feed both compute elements 205 the exact same data at a controlled place in the instruction streams of the compute elements.
  • the I/O processors 210 verify that the compute elements 205 generate the same I/O operations and produce the same output data at the same time.
  • the I/O processors 210 also cross check each other for proper completion of requested I/O activity.
  • the system 200 uses a software-based approach in a configuration based on inexpensive, industry standard processors.
  • the compute elements 205 and I/O processors 210 may be implemented using Pentium Pro processors available from Intel Corporation.
  • the system may run unmodified, industry-standard operating system software, such as the Windows NT operating system available from Microsoft Corporation, as well as industry-standard applications software.
  • Each compute element 205 includes a processor 215, memory 220, and an interface card 225 (also referred to as a Marathon interface card, or MIC).
  • the interface card 225 includes drivers for communicating with two I/O processors simultaneously, as well as comparison and test logic that assures results received from the two I/O processors are identical.
  • the interface card 225 of each compute element 205 is connected by high speed links 230, such as fiber optic links, to interface cards 225 ofthe two
  • the interface cards 225 may be implemented as PCI-based adapters.
  • Each I/O processor 210 includes a processor 215, memory 220, an interface card 225, and I/O adapters 235 for connection to I/O devices such as a hard drive 240 and a network 245.
  • the interface card 225 of each I/O processor 210 is connected by high speed links 230 to the interface cards 225 of the two compute elements 205.
  • a high speed link 250 such as a private Ethernet link, is provided between the two I/O processors 210.
  • All I/O task requests from the compute elements 205 are redirected to the I/O processors 210 for handling.
  • the I/O processor 210 runs specialized software that handles all of the fault handling, disk mirroring, system management, and resynchronization tasks required by the system 200.
  • a multitasking operating system such as Windows NT
  • the I/O processor 210 may run other, non-fault tolerant applications.
  • a compute element may run Windows NT Server as an operating system while, depending on the way that the I/O processor is to be used, an I/O processor may run either Windows NT Server or Windows NT Workstation as an operating system.
  • the two compute elements 205 run lockstep control software, also referred to as quantum synchronization software, and execute the operating system and the applications in emulated clock lockstep. Disk mirroring takes place by duplicating writes on the disks 240 associated with each I/O processor 210. If one of the compute elements 205 should fail, the other compute element 205 keeps the system running with a pause of only a few milliseconds to remove the failed compute element 205 from the configuration. The failed compute element 205 then can be physically removed, repaired, reconnected, and turned on. The repaired compute element then is brought back automatically into the configuration by transferring the state of the running compute element to the repaired compute element over the high speed links 230 and resynchronizing using, for example, the procedure 100. The states of the operating system and applications are maintained through the few seconds it takes to resynchronize the two compute elements 205 so as to minimize any impact on system users.
  • an I/O processor 210 fails, the other I/O processor 210 continues to keep the system running. The failed I/O processor then can be physically removed, repaired and activated. Since the I/O processors are not running in lockstep, the repaired system may go through a full operating system reboot, and then may be resynchronized. After being resynchronized, the repaired I/O processor automatically rejoins the configuration and the mirrored disks are re-mirrored in background mode over the private connection 250 between the I/O processors 210. A failure of one ofthe mirrored disks is handled through the same process.
  • the connections to the network 245 also are fully redundant.
  • Network connections from each I/O processor 210 are booted with the same address. Only one network connection is allowed to transmit messages, while both are allowed to receive messages. In this way, each network connection monitors the other through the private Ethernet 250. Should either network connection fail, the I/O processors 210 will detect the failure and the remaining connection will carry the load. The I/O processors 210 notify the system manager in the event of a failure so that a repair can be initiated. While Fig. 2 shows both connections on a single network segment, this is not a requirement. Each I/O processor's network connection may be on a different segment of the same network. The system also accommodates multiple networks, each with its own redundant connections.
  • connection between the tuples be optical fiber or a connection having compatible speed.
  • the tuples may be spaced by distances of a kilometer or more. Since the compute elements are synchronized over this distance, the failure of a component or a site will be transparent to the users.
  • Fig. 3 provides a summarized view of the system 200 of Fig. 2.
  • the system includes redundant compute elements 205 ("CEs") and I/O processors 210 ("IOPs").
  • Each CE 205 is responsible for all computing and may be implemented using an industry standard motherboard.
  • Each IOP 210 is responsible for access to I/O devices, and for system control.
  • the IOPs 210 run asynchronously of each other and verify that the CEs 205 are performing the same operations in the same order.
  • the IOPs 210 also track each other's I/O completion to ensure that no I/O is lost.
  • the CEs 205 generate the same outputs in the exact same sequence, and run in emulated clock lockstep, even though the CE clocks are asynchronous to each other.
  • the CEs 205 are initialized to the same state and are fed consistent inputs at exactly the same time.
  • the CEs 205 are periodically realigned using a self-generated interrupt that is related to the occurrence of a quantum of clock cycles (e.g., 100,000 clock cycles) and is referred to as a quantum interrupt ("QI"). All inputs to the CEs 205 are delivered at either an output window or after the completion of an instruction quantum. Both of these points are guaranteed to occur at the same point in the instruction streams ofthe CEs 205.
  • QI quantum interrupt
  • All inputs to the CEs 205 are delivered at either an output window or after the completion of an instruction quantum. Both of these points are guaranteed to occur at the same point in the instruction streams ofthe CEs 205.
  • the approach employed by the system 200 is described in U.S. Patent Nos. 5,600,784 and 5,615,403, both of which are incorporated by reference.
  • a CE must be synchronized back into the system 200 following removal ofthe CE.
  • the CE may have been removed for any number of reasons: a transient failure, a hard failure and repair, or even a scheduled removal.
  • a foreground synchronization procedure may be used to transfer the static state of a suspended CE to a synchronizing CE. A large part of this procedure involves the transfer of CE main memory.
  • the CE operating system is suspended for the duration of the foreground synchronization procedure. This suspension is visible to users, since applications and network communications are temporarily stopped. The CE operating system resumes operation after the synchronization procedure has completed. The temporary suspension, however, may cause network session timeouts or exceed a user's requirements for application dead time. Network connections are able to survive a full foreground synchronization on systems that adhere to the 128 MB guideline for CE memory capacity. At 16 MB/s, foreground synchronization of 128 MB is typically completed in approximately eight seconds. Users with more than this amount of memory are advised to disable automatic CE synchronization and, as necessary, to initiate the synchronization procedure at a convenient time of day or night. Although this may be a viable work-around for some users, it is unacceptable to others.
  • One ofthe major benefits ofthe system 200 is its hands-off operation. Components are automatically removed, joined (IOPs), mirrored, and synchronized as necessary to maintain a high level of availability. This benefit cannot be fully realized by users that need to run with larger CE memory sizes, yet have connectivity or real-time-like constraints. These users will need to disable automatic CE synchronization and attempt to find a safe time to manually initiate a resynchronization.
  • Permitted memory sizes may be increased by increasing the speed of the CE-to-CE interconnect (i.e., the MIC). However, as shown below in Table 1, modest speed improvements alone are not likely to reduce the foreground synchronization time to acceptable levels. More aggressive interconnect speeds are possible, but only at much higher cost or by imposing distance restrictions.
  • MB/s are used here for illustrative purposes only.
  • the ideal CE synchronization time is based on the worst-case session timeout period for a protocol, such as TCP/IP, used by the system. In general, such a protocol will sustain connections over longer periods of silence, but the exact time tolerated is determined and adjusted dynamically by the protocol stacks. Tighter limits may be imposed by users with real- time or substantially-real-time application requirements.
  • a protocol such as TCP/IP
  • background synchronization refers to the process of transferring portions of a running CE's memory context to a synchronizing CE, without suspension ofthe operating system.
  • the CE operating system and applications are unaware of this controlled- rate transfer, although the transfer does consume some portion ofthe MIC's available bandwidth.
  • the CE operating system continues to run applications and service network clients, with some tolerable level of degraded performance.
  • the CE operating system is suspended and foreground synchronization is performed to transfer areas modified during the background synchronization.
  • the CE operating system workload profile can outpace the background memory transfer such that the ensuing foreground synchronization will not complete within the desired target time interval.
  • software can pre-determine the foreground synchronization time and choose to abort the synchronization process if user-selected limits are exceeded. This allows automatic CE synchronization features to remain enabled, ensuring that network and application timeout limits will not be exceeded.
  • the goals of the background synchronization are to allow all users to run the system with automatic CE synchronization enabled, and to ensure that foreground synchronization will never exceed a time limit established by the user.
  • this approach to a software-only implementation of background synchronization employs the Pentium® architecture's 'dirty' indicators. These indicators are provided for each page table entry and can be used to track processor modifications to memory.
  • the procedure 400 includes the same steps as the procedure 100 of Fig. 1, and adds two additional steps. First, prior to performing a background copy of all pages to the target CE (i.e., after step 115 and before step 120), various page-table maintained indicators are set or cleared (step 405).
  • these indicators are rechecked to determine what pages were modified during the copy (step 410), and, therefore, must be copied again during a subsequent background copy or during a foreground completion phase.
  • Advantages to this approach are that it works on existing Pentium-based systems, it is not tied to hardware chip schedules, it permits page-level granularity of memory modifications, and it should provide convergence on an acceptable foreground synchronization time under most loads.
  • Potential drawbacks of this approach are that its implementation is operating system-specific, and that page tables only track processor-originated memory modifications, and do not track adapter direct memory access (“DMA").
  • This approach also is sensitive to operating system process context switching (e.g., page directory, and PTE management), and may require access to the operating system source code to understand context-switch issues. Finally, this approach raises the risk that unforeseen operating system behaviors, and future operating system changes, may cause problems.
  • operating system process context switching e.g., page directory, and PTE management
  • a balloon zeroing procedure 500 provides a simpler approach to a software-only implementation. Instead of performing a background copy of all memory, this implementation begins by having the target CE clear all of its physical memory (step 505). The master CE then drops into a foreground synchronization mode and transfers only pages that contain non-zero data (step 510). This approach capitalizes on the observation that zero-filled pages are quite common in most virtual-memory operating systems. In addition, cooperative threads can be used to forcibly zero out a large portion of memory just prior to attempting the foreground synchronization. This pre-zeroing effort is referred to as balloon zeroing.
  • Balloon zeroing may be combined with other approaches.
  • the procedure 400 can be modified so that the target CE clears all of its memory prior to initiation of the background copy. In this case, only non-zero memory ofthe master CE would be copied during the first iteration of the background copy.
  • a simple design and implementation that increases the chances for convergence to an acceptable foreground wrap-up is possible.
  • a small but effective bitmap of block modifications can be maintained by the memory controller chip.
  • the granularity of this bitmap is far less than what is obtainable with PTE tracking, but is adequate for the purpose of background synchronization.
  • the procedure 600 differs from the procedure 100 shown in Fig. 1 only in that in step 605, which replaces step 115, the original memory copy list is stored using the memory controller chip, and in step 610, which replaces step 125, the updated memory copy list is created using the memory controller chip.
  • the memory controller provides a service whereby modifications made to areas of physical memory are flagged in a bitmap structure, preferably register-based.
  • Software chooses the time to clear this bitmap and to enable the logging of memory writes to it.
  • Software also requires the ability to select the resolution ofthe bitmap, expressed as the number of base 2 kilobytes represented by each bit.
  • this bitmap is used to accumulate block-level modifications to main memory on the sourcing CE.
  • the size ofthe bitmap determines the tracking granularity, or resolution, of each bit.
  • a 1024-bit bitmap covers the same 4 GB range with a resolution of 0.1%, or 4 MB per bit. The advantage is that automatic synchronization can be enabled at all times, allowing periodic attempts at CE synchronization, which has a high probability of success.
  • the minimum recommended size of a hardware-maintained bitmap is such that a 0.1% resolution is achievable.
  • Table 2 lists the recommended bitmap sizes for various maximum configurations.
  • bitmap of only 1024 bits is sufficient to support background synchronization on configurations with up to 4 GB of memory. This results in a worst-case resolution of 4MB, or 0.1 % of the total memory, per bitmap bit. Allowing the resolution of the bitmap to be software settable allows smaller memory configurations to be tracked with comparable resolution.
  • bitmap size implemented is larger than the appropriate minimum recommended size from Table 2. For instance, a bitmap of 2048 bits on a 4 GB (max) chipset requires a 0.5 MB/bit resolution to use all bitmap bits if only 1 GB is actually present.
  • This approach may be implemented in a system that provides a register-based bitmap of (at least) the recommended size (Table 2), or a memory-based bitmap using software-specified base address and range. This latter approach is preferred because it gives software control over resolution.
  • a bitmap can be managed as a set of 32-bit registers, or as an internal or external RAM array which is private to the memory controller chip.
  • the system also allows the bitmap to be cleared by software, preferably using "longword writes” (i.e., overwriting the bitmap with words containing all zeroes).
  • the system also allows bitmap logging to be disabled (default) and enabled by software. Disabling and enabling does not alter the contents of the bitmap. When enabled, all processor or MIC originated writes to memory are tracked in the bitmap. Memory scrubbing operations performed by the memory controller itself are not tracked.
  • the system also allows the block size (resolution) of the bitmap to be software settable. Resolutions finer than 1 MB per bit are not essential, but are certainly desirable. If physical memory is used for the bitmap, software sets the base physical address of the map, along with the resolution (and, therefore, size).

Abstract

An inactive memory is synchronized with an active memory in a fault-tolerant computer system that includes an active processor. Data is copied from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding. Regions of the active memory in which changes are made are tracked while the copying is proceeding, and, after copying is complete, a determination is made as to whether data from the regions of the active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations. If the data can be copied to the inactive memory within the predetermined time period using the foreground process, the data is copied to the inactive memory using the foreground process. If the data cannot be copied to the inactive memory within the predetermined time period using the foreground process, the copying, tracking, and determining are repeated for the regions of the active memory in which changes were made.

Description

BACKGROUND SYNCHRONIZATION FOR FAULT- TOLERANT SYSTEMS
TECHNICAL FIELD
The invention relates to restoring synchronized execution by processors in fault resilient/fault tolerant computer systems.
BACKGROUND Computer systems that are capable of surviving hardware failures or other faults generally fall into three categories: fault resilient, fault tolerant, and disaster tolerant.
Fault resilient computer systems can continue to function, often in a reduced capacity, in the presence of hardware failures. These systems operate in either an availability mode or an integrity mode, but not both. A system is "available" when a hardware failure does not cause unacceptable delays in user access, which means that a system operating in an availability mode is configured to remain online, if possible, when faced with a hardware error. A system has data integrity when a hardware failure causes no data loss or corruption, which means that a system operating in an integrity mode is configured to avoid data loss or corruption, even if the system must go offline to do so. Fault tolerant systems stress both availability and integrity. A fault tolerant system remains available and retains data integrity when faced with a single hardware failure, and, under some circumstances, when faced with multiple hardware failures.
Disaster tolerant systems go beyond fault tolerant systems. In general, disaster tolerant systems require that loss of a computing site due to a natural or man-made disaster will not interrupt system availability or corrupt or lose data.
All three cases require an alternative component that continues to function in the presence ofthe failure of a component. Thus, redundancy of components is a fundamental prerequisite for a disaster tolerant, fault tolerant or fault resilient system that recovers from or masks failures. Redundancy can be provided through passive redundancy or active redundancy, each of which has different consequences.
A passively redundant system, such as a checkpoint-restart system, provides access to alternative components that are not associated with the current task and must be either activated or modified in some way to account for a failed component. The consequent transition may cause a significant interruption of service. Subsequent system performance also may be degraded. Examples of passively redundant systems include stand-by servers and clustered systems. The mechanism for handling a failure in a passively redundant system is to "fail-over," or switch control, to an alternative server. The current state ofthe failed application may be lost, and the application may need to be restarted in the other system. The fail-over and restart processes may cause some interruption or delay in service to the users. Despite any such delay, passively redundant systems such as stand-by servers and clusters provide "high availability" and do not deliver the continuous processing usually associated with "fault tolerance." An actively redundant system, such as a replication system, provides an alternative processor that concurrently processes the same task and, in the presence of a failure, provides continuous service. The mechanism for handling failures is to compute through a failure on the remaining processor. Because at least two processors are looking at and manipulating the same data at the same time, the failure of any single component should be invisible both to the application and to the user.
The goal of a fault tolerant system is to produce correct results in a repeatable fashion. Repeatability ensures that operations may be resumed after a fault is detected. In a checkpoint- restart system, this entails rolling back to a previous checkpoint and replaying the inputs again from a journal file. In a replication system, repeatability results from simultaneous operation on multiple instances of a computer.
Processes performed when a fault occurs in an actively redundant system may include fault detection, fault isolation, fault recovery, repair, and system restoration (including synchronization). For example, when application instructions are executed in the same order on all copies of a processor, and a fault occurs in one copy of the processor: 1. The fault is detected.
2. The fault is identified as coming from a particular copy of the processor and effects ofthe fault are constrained so as not to adversely affect the system.
3. The system recovers from the fault and continues with no side effects to the application, but with a reduced level of fault tolerance. 4. The faulty processor is repaired or replaced. 5. The repaired processor synchronizes itself with the remaining processors to restore the system's normal level of fault tolerance.
The synchronization process may be performed as a foreground process or a background process. Foreground synchronization takes complete control of the processors and dedicates them to copying all memory contents and processor context information to the synching processor. Background synchronization copies the memory contents to the synching processor as a background process (i.e., while application programs continue to run), and then switches to a short foreground synchronization process to copy processor context information. Foreground synchronization consumes 100% ofthe processors' operating cycles for the duration ofthe memory copy. This locks out any application programs being run by the processors from all external devices, which may be a problem if the memory copy takes too long. For example, with some network protocols, a network connection that is not serviced at least once every six seconds can be dropped. This places a restriction on the maximum memory size that can be supported without making special provisions in the application program for the fault tolerant state ofthe system, which is undesirable.
The maximum memory size that can be supported for a particular maximum synchronization time can be increased by increasing the I/O bandwidth in a direct, linear relationship with the increase in memory size. In recent years, desired growth in memory size has outpaced the I/O bandwidth of computers, making it difficult to synchronize desired memory sizes with foreground synchronization.
Background synchronization renders the linear relationship between I/O bandwidth and memory size unnecessary by allowing the memory copy to occur while the application is still running. The timing constraint associated with background synchronization is that the duration of the foreground process at the end of the bulk memory copy must be less than the permitted maximum (e.g., six seconds).
One prior approach to background synchronization was to sweep all memory with direct processor read writes or with direct memory access ("DMA"). Every location touched was transferred to the synching processor. This sweep was done while application processes were running. Any location that an application process modified also needed to be transferred. This was achieved using a custom memory controller that, every time that a memory write occurred, automatically transferred the address/data pair associated with the memory write to the synching processor. This approach guaranteed that, at the end of the memory sweep, all memory had been transferred. The final foreground task consisted of sending the processor context to the synching processor. The background sweep could be stretched out over seconds to hours depending on the memory size and the system I/O bandwidth that the operator was willing to dedicate to the synchronization process. The final foreground part ofthe synchronization occurred in less than one second.
SUMMARY In one general aspect, the invention features synchronizing an inactive memory with an active memory in a fault-tolerant computer system that includes an active processor. Data is copied from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding. Regions of the active memory in which changes are made are tracked while the copying is proceeding, and, after copying is complete, a determination is made as to whether data from the regions of the active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations. If the data can be copied to the inactive memory within the predetermined time period using the foreground process, the data is copied to the inactive memory using the foreground process. If the data cannot be copied to the inactive memory within the predetermined time period using the foreground process, the copying, tracking, and determining are repeated for the regions of the active memory in which changes were made.
Implementations may include one or more of the following features. For example, an evaluation may be made as to whether the synchronizing is likely to be successful prior to copying data from the active memory to the inactive memory using the background process.
Such an evaluation may be accomplished in one of several ways. For example, evaluating whether the synchronizing is likely to be successful may include comparing a rate at which data in the active memory are modified to a rate at which data can be transferred from the active memory to the inactive memory using the background process. When the result ofthe evaluating indicates that the synchronizing is not likely to be successful, efforts may be made to mitigate the problem. Mitigation may include, for example, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made. Mitigation also may include restricting an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made. In addition, a data compression operation may be performed on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made.
Active memory may be associated with the active processor. The active processor may include a compute element and an I/O processor, where the compute element implements the copying, tracking and determining. A memory copy list identifying portions of the active memory for which data are to be copied to the inactive memory using the background process may be created and used in copying data using the background process. Tracking regions of the active memory may include creating a new memory copy list.
When the computer system includes an inactive processor associated with the inactive memory, and the active processor is associated with the active memory, the context of the active processor may be copied to the inactive processor.
Tracking regions of the active memory may include using a page table structure including pages of memory and corresponding page table entries, with each page table entry including an indicator bit that is set when a memory location of the corresponding page of memory is modified. In another example, the active processor includes a memory control section and tracking regions ofthe active memory includes using a memory block structure allocated by the memory control section, including updating a flag corresponding to a block of memory whenever the block of memory is modified.
When the data from the regions of the active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, the method may include, for example, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made. As an alternative, or in addition, an amount of working memory for one or more running applications may be restricted to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining. Also, a data compression operation may be performed on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determining. These techniques also may be used prior to the initial copying of data from the active memory to the inactive memory using the background process. A number of pages of active memory may be allocated to the synchronization process prior to copying data from the active memory to the inactive memory using the background process. Prior to the allocation, the inactive memory is cleared, and only unallocated pages are copied.
Another implementation may include clearing the inactive memory and determining which regions of the active memory contain nonzero data. In this implementation, copying data from the active memory to the inactive memory using the background process includes copying data only from the regions of the active memory that contain nonzero data. The active processor may include a memory control section, and determining which regions of the active memory contain nonzero data then may include using the memory control section to store a list of which regions of active memory contain nonzero data.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages ofthe invention will be apparent from the description and drawings, and from the claims.
DESCRIPTION OF DRAWINGS Fig. 1 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system.
Figs. 2 and 3 are block diagrams of a fault tolerant system that emulates clock lockstep operation. Fig. 4 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a software-only, page table tracking implementation.
Fig. 5 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a software-only, balloon zeroing implementation.
Fig. 6 is a flow chart that illustrates a procedure for providing background synchronization for a fault tolerant system using a hardware-assisted memory controller tracking method.
Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTION
A heuristic approach may be used to provide background synchronization without requiring a custom memory controller. The heuristic approach attempts to perform the foreground part of the synchronization in less than the permitted maximum synchronization time (e.g., six seconds). The background task is not limited in time or in the number of memory sweeps performed.
Referring to Fig. 1, the heuristic approach operates according to a procedure 100. First, the chances that the synchronization will complete successfully are evaluated (step 105). As discussed in more detail below, this evaluation may involve a comparison ofthe rate at which memory is copied to the rate at which memory is modified. If there is no chance of success
(step 110), the evaluation is repeated (step 105), with the hope that system conditions will have changed in a way that permits synchronization to complete successfully. The evaluation may be repeated upon expiration of a fixed delay period. If there is some chance of success, a memory copy list is created (step 115). Next, all blocks of memory in the memory copy list are copied to the synching processor using a background process (step 120).
A new memory copy list then is created (step 125). This new list identifies all memory blocks that have been modified since the last memory copy list was created. From the new memory copy list, the time required to perform a foreground synchronization process is estimated (step 130).
If the estimated time for foreground synchronization is less than the permitted maximum (e.g., six seconds) (step 135), then the memory copy is completed using foreground synchronization (step 140). After memory copy is completed, the processor context is copied to the synching processor in the foreground to complete the synchronization procedure (step
145).
If the estimated time for foreground synchronization is greater than the permitted maximum (step 135), and background memory copy has not been attempted a maximum number of times (step 150), all blocks of memory in the memory copy list are copied to the synching processor using a background process (step 120). The procedure then proceeds as described above.
If the estimated time for foreground synchronization is greater than the permitted maximum (step 135), and background memory copy has been attempted a maximum number of times (step 150), this attempt at synchronization has failed. Upon failure of synchronization, the chances that the synchronization will complete successfully are again evaluated (step 105). A memory block structure is used to track the areas of memory that have been modified and to create the memory copy lists. The memory block structure may be provided in several different ways. The most basic way is a page table structure. This structure is provided by processors currently available from Intel Corporation. Each page table entry ("PTE") includes a dirty bit that is set when a memory location ofthe page is modified. The dirty bit is set by hardware, and is not altered by the operating system. The PTEs can be used to track which pages of memory are modified while the background memory copy is performed. The PTEs provide a very detailed list of modified pages. For example, for a one gigabyte memory, there are up to one million page table entries. This list may be too detailed to use for synchronization purposes.
An alternate approach is to add a tracking mechanism to the memory control section of the motherboard chip set. A section of memory (part of system memory, on-chip memory, or dedicated external memory) can be allocated to a modified block structure. Whenever a block of memory is written, a corresponding flag in the modified block structure is updated. Between passes of the background copy process, a snapshot of the modified block structure is made to control the next pass ofthe background copy process.
There is a tradeoff between the resolution ofthe modified block structure and the transfer time for a block. Each background copy pass requires scanning the structure and setting up transfers (i.e., creating the memory copy list). The transfer time for each block is directly related to the block size, while the scan time is inversely related to the block size. To evaluate whether the synchronization will complete successfully (step 105), the block modification rate is monitored to see if a background synchronization procedure will ever finish. For example, if two blocks get modified during the time required to copy one block, then the background synchronization will never finish.
To create the first memory copy list (step 115), a list of all blocks of memory is created.
To create subsequent memory copy lists (step 125), a list of all modified blocks of memory is created. In either case, the PTEs or the modified block structure are cleared after the memory copy list is created. The procedure is not guaranteed to complete. If the block modification rate is too high, there will always be more modified memory blocks to copy than can be copied in the permitted foreground copy time. Several refinements can be made to improve the chances of success. In a first example, the bandwidth that the background copy task is allowed to consume is increased. In a second example, any running applications are restricted down to their minimum working memory sets. In a third example, rudimentary data compression is performed on the memory images before transfer. Particular implementations may use one or more of these techniques.
The more bandwidth that is allowed to the background synchronization copy, the fewer blocks of memory the applications will have time to modify during the background copy. At 100% background allocation, synchronization will always succeed. However, this is likely to violate the maximum synchronization time requirement, since background synchronization with a 100% allocation is really foreground synchronization.
The synchronization process can allocate large chunks of system memory to itself. This memory then can be zeroed in both processors, so that it does not need to be copied. In addition, synchronization allocation restricts the number of memory pages available for applications to modify. To implement synchronization allocation, the synchronization process starts with a small set of pages and expands the number of pages until the operating system prevents the process from taking any more. The synchronization process runs as a driver under the operating system, which means that the process is allowed to lock down pages of memory so that they are never sent out to disk storage. Accordingly, any pages given to the synchronization process represent pages of physical memory that do not have to be copied. The physical memory available to applications that are running is compressed until the applications are close to their minimum working sets.
The memory data to be copied may be compressed before a copy. A benefit of such a compression is that the processor-to-memory bandwidth is much greater than the memory-to- I/O bandwidth. As a result, a simple compression scheme can reduce the total copy time significantly.
Referring to Fig. 2, the procedure 100 may be implemented by a fault tolerant system 200, such as the Endurance® 4000 system, which is available from Marathon Technologies Corporation of Boxboro, Massachusetts. The system operates multiple instances of a compute element in instruction stream lockstep, which means that the multiple instances of a compute element perform the same sequence of instructions in the same order. By contrast, fully phase- locked operation, which also may be referred to as clock lockstep operation, occurs when multiple instances of a compute element perform the same sequence of instructions in the same order, with each instruction being performed in the same clock cycle by each instance of the compute element.
As noted, the instances of a compute element in the system 200 operate in instruction stream lockstep. The time needed to execute the instruction stream varies due to the uncontrolled past history of each compute element. For example, caches, table look-ahead buffers, branch prediction logic, speculative execution logic, and execution pipelines of the compute elements can have different initial values, which, even though the instruction streams being executed are the same, result in varying execution times.
Clock lockstep operation may be achieved by using a common oscillator to provide clocks to all instances of the compute element. However, such an implementation may be unsuited for fault tolerant operation because it includes a single component, the common oscillator, the failure of which will cause failure ofthe entire system. Emulated clock lockstep operation avoids the single point of failure and is achieved using the techniques described below. Emulated clock lockstep operation offers the considerable additional benefit of permitting the different instances of a compute element to be separated by distances of up to a kilometer or more. In general, all computer systems perform two basic operations: (1) manipulating and transforming data, and (2) moving the data to and from mass storage, networks, and other I/O devices. The system 200 divides these functions, both logically and physically, between two separate processors. For this purpose, each half of the system 200, called a tuple, includes a compute element ("CE") 205 and an I/O processor ("IOP") 210. The compute element 205 processes user application and operating system software. Thus, the compute element 205 implements the procedure 100. I/O requests generated by the compute element 205 are redirected to the I/O processor 210. This redirection is implemented at the device driver level. The I/O processor 210 provides I/O resources, including I/O processing, data storage, and network connectivity. The I/O processor 210 also controls synchronization of the compute elements.
The system 200 is fault tolerant in that it continues to operate transparently to its users in the presence of any single hardware failure. The system 200 emulates a traditional computing environment by partitioning the environment into two components. The compute element 205 handles all computing tasks for the operating system and any applications. The I/O processor 210 handles all I/O devices. Thus, the I/O processor 210 handles all of the asynchronous activities associated with a computer, while the compute element 205 handles all of the synchronous computing activities.
To provide the necessary redundancy for fault tolerance, the system 200 includes at least two compute elements 205 and at least two I/O processors 210. The two compute elements 205 operate in lockstep while the two I/O processors 210 are loosely coupled. The
I/O processors 210 feed both compute elements 205 the exact same data at a controlled place in the instruction streams of the compute elements. The I/O processors 210 verify that the compute elements 205 generate the same I/O operations and produce the same output data at the same time. The I/O processors 210 also cross check each other for proper completion of requested I/O activity. The system 200 uses a software-based approach in a configuration based on inexpensive, industry standard processors. For example, the compute elements 205 and I/O processors 210 may be implemented using Pentium Pro processors available from Intel Corporation. The system may run unmodified, industry-standard operating system software, such as the Windows NT operating system available from Microsoft Corporation, as well as industry-standard applications software. This permits a fault tolerant system to be configured by combining off-the-shelf, Intel Pentium Pro-based servers from a variety of manufacturers, which results in a fault tolerant or disaster tolerant system with low acquisition and life cycle costs. Each compute element 205 includes a processor 215, memory 220, and an interface card 225 (also referred to as a Marathon interface card, or MIC). The interface card 225 includes drivers for communicating with two I/O processors simultaneously, as well as comparison and test logic that assures results received from the two I/O processors are identical. In the fault tolerant system 200, the interface card 225 of each compute element 205 is connected by high speed links 230, such as fiber optic links, to interface cards 225 ofthe two
I/O processors 210. The interface cards 225 may be implemented as PCI-based adapters.
Each I/O processor 210 includes a processor 215, memory 220, an interface card 225, and I/O adapters 235 for connection to I/O devices such as a hard drive 240 and a network 245. As noted above, the interface card 225 of each I/O processor 210 is connected by high speed links 230 to the interface cards 225 of the two compute elements 205. In addition, a high speed link 250, such as a private Ethernet link, is provided between the two I/O processors 210.
All I/O task requests from the compute elements 205 are redirected to the I/O processors 210 for handling. The I/O processor 210 runs specialized software that handles all of the fault handling, disk mirroring, system management, and resynchronization tasks required by the system 200. By using a multitasking operating system, such as Windows NT, the I/O processor 210 may run other, non-fault tolerant applications. In general, a compute element may run Windows NT Server as an operating system while, depending on the way that the I/O processor is to be used, an I/O processor may run either Windows NT Server or Windows NT Workstation as an operating system. The two compute elements 205 run lockstep control software, also referred to as quantum synchronization software, and execute the operating system and the applications in emulated clock lockstep. Disk mirroring takes place by duplicating writes on the disks 240 associated with each I/O processor 210. If one of the compute elements 205 should fail, the other compute element 205 keeps the system running with a pause of only a few milliseconds to remove the failed compute element 205 from the configuration. The failed compute element 205 then can be physically removed, repaired, reconnected, and turned on. The repaired compute element then is brought back automatically into the configuration by transferring the state of the running compute element to the repaired compute element over the high speed links 230 and resynchronizing using, for example, the procedure 100. The states of the operating system and applications are maintained through the few seconds it takes to resynchronize the two compute elements 205 so as to minimize any impact on system users.
If an I/O processor 210 fails, the other I/O processor 210 continues to keep the system running. The failed I/O processor then can be physically removed, repaired and activated. Since the I/O processors are not running in lockstep, the repaired system may go through a full operating system reboot, and then may be resynchronized. After being resynchronized, the repaired I/O processor automatically rejoins the configuration and the mirrored disks are re-mirrored in background mode over the private connection 250 between the I/O processors 210. A failure of one ofthe mirrored disks is handled through the same process.
The connections to the network 245 also are fully redundant. Network connections from each I/O processor 210 are booted with the same address. Only one network connection is allowed to transmit messages, while both are allowed to receive messages. In this way, each network connection monitors the other through the private Ethernet 250. Should either network connection fail, the I/O processors 210 will detect the failure and the remaining connection will carry the load. The I/O processors 210 notify the system manager in the event of a failure so that a repair can be initiated. While Fig. 2 shows both connections on a single network segment, this is not a requirement. Each I/O processor's network connection may be on a different segment of the same network. The system also accommodates multiple networks, each with its own redundant connections. The extension of the system to disaster tolerance requires only that the connection between the tuples be optical fiber or a connection having compatible speed. With such connections, the tuples may be spaced by distances of a kilometer or more. Since the compute elements are synchronized over this distance, the failure of a component or a site will be transparent to the users.
Fig. 3 provides a summarized view of the system 200 of Fig. 2. The system includes redundant compute elements 205 ("CEs") and I/O processors 210 ("IOPs"). Each CE 205 is responsible for all computing and may be implemented using an industry standard motherboard. Each IOP 210 is responsible for access to I/O devices, and for system control. The IOPs 210 run asynchronously of each other and verify that the CEs 205 are performing the same operations in the same order. The IOPs 210 also track each other's I/O completion to ensure that no I/O is lost. The CEs 205 generate the same outputs in the exact same sequence, and run in emulated clock lockstep, even though the CE clocks are asynchronous to each other. The CEs 205 are initialized to the same state and are fed consistent inputs at exactly the same time. The CEs 205 are periodically realigned using a self-generated interrupt that is related to the occurrence of a quantum of clock cycles (e.g., 100,000 clock cycles) and is referred to as a quantum interrupt ("QI"). All inputs to the CEs 205 are delivered at either an output window or after the completion of an instruction quantum. Both of these points are guaranteed to occur at the same point in the instruction streams ofthe CEs 205. The approach employed by the system 200 is described in U.S. Patent Nos. 5,600,784 and 5,615,403, both of which are incorporated by reference.
Foreground Synchronization
A CE must be synchronized back into the system 200 following removal ofthe CE. The CE may have been removed for any number of reasons: a transient failure, a hard failure and repair, or even a scheduled removal. To rejoin the system, a foreground synchronization procedure may be used to transfer the static state of a suspended CE to a synchronizing CE. A large part of this procedure involves the transfer of CE main memory.
The CE operating system is suspended for the duration of the foreground synchronization procedure. This suspension is visible to users, since applications and network communications are temporarily stopped. The CE operating system resumes operation after the synchronization procedure has completed. The temporary suspension, however, may cause network session timeouts or exceed a user's requirements for application dead time. Network connections are able to survive a full foreground synchronization on systems that adhere to the 128 MB guideline for CE memory capacity. At 16 MB/s, foreground synchronization of 128 MB is typically completed in approximately eight seconds. Users with more than this amount of memory are advised to disable automatic CE synchronization and, as necessary, to initiate the synchronization procedure at a convenient time of day or night. Although this may be a viable work-around for some users, it is unacceptable to others.
Some users wish to impose a rigid limit on the amount of time an application can be suspended, for any reason. Foreground synchronization will typically exceed this limit, requiring the user to disable automatic CE synchronization. One ofthe major benefits ofthe system 200 is its hands-off operation. Components are automatically removed, joined (IOPs), mirrored, and synchronized as necessary to maintain a high level of availability. This benefit cannot be fully realized by users that need to run with larger CE memory sizes, yet have connectivity or real-time-like constraints. These users will need to disable automatic CE synchronization and attempt to find a safe time to manually initiate a resynchronization.
Permitted memory sizes may be increased by increasing the speed of the CE-to-CE interconnect (i.e., the MIC). However, as shown below in Table 1, modest speed improvements alone are not likely to reduce the foreground synchronization time to acceptable levels. More aggressive interconnect speeds are possible, but only at much higher cost or by imposing distance restrictions. The foreground synchronization rates of 50 MB/s and 100
MB/s are used here for illustrative purposes only.
Figure imgf000017_0001
Table 1 - Foreground Synchronization Times The ideal CE synchronization time is based on the worst-case session timeout period for a protocol, such as TCP/IP, used by the system. In general, such a protocol will sustain connections over longer periods of silence, but the exact time tolerated is determined and adjusted dynamically by the protocol stacks. Tighter limits may be imposed by users with real- time or substantially-real-time application requirements.
Background Synchronization
As discussed above, background synchronization refers to the process of transferring portions of a running CE's memory context to a synchronizing CE, without suspension ofthe operating system. The CE operating system and applications are unaware of this controlled- rate transfer, although the transfer does consume some portion ofthe MIC's available bandwidth. As the transfer is taking place, the CE operating system continues to run applications and service network clients, with some tolerable level of degraded performance. After many seconds of background transfer, depending on CE memory size and other parameters, the CE operating system is suspended and foreground synchronization is performed to transfer areas modified during the background synchronization.
The CE operating system workload profile can outpace the background memory transfer such that the ensuing foreground synchronization will not complete within the desired target time interval. However, software can pre-determine the foreground synchronization time and choose to abort the synchronization process if user-selected limits are exceeded. This allows automatic CE synchronization features to remain enabled, ensuring that network and application timeout limits will not be exceeded.
The goals of the background synchronization are to allow all users to run the system with automatic CE synchronization enabled, and to ensure that foreground synchronization will never exceed a time limit established by the user. With these goals in mind, it is clear that the overall time required to integrate a CE into the system is not particularly important. Some degradation of CE operating system performance is also permissible during synchronization, and this will likely also be settable by the user.
Software-Only, Page Table Tracking Implementation Referring to Fig. 4, this approach to a software-only implementation of background synchronization employs the Pentium® architecture's 'dirty' indicators. These indicators are provided for each page table entry and can be used to track processor modifications to memory. The procedure 400 includes the same steps as the procedure 100 of Fig. 1, and adds two additional steps. First, prior to performing a background copy of all pages to the target CE (i.e., after step 115 and before step 120), various page-table maintained indicators are set or cleared (step 405). Following the background copy (step 120), these indicators are rechecked to determine what pages were modified during the copy (step 410), and, therefore, must be copied again during a subsequent background copy or during a foreground completion phase. Advantages to this approach are that it works on existing Pentium-based systems, it is not tied to hardware chip schedules, it permits page-level granularity of memory modifications, and it should provide convergence on an acceptable foreground synchronization time under most loads. Potential drawbacks of this approach are that its implementation is operating system-specific, and that page tables only track processor-originated memory modifications, and do not track adapter direct memory access ("DMA"). This approach also is sensitive to operating system process context switching (e.g., page directory, and PTE management), and may require access to the operating system source code to understand context-switch issues. Finally, this approach raises the risk that unforeseen operating system behaviors, and future operating system changes, may cause problems.
Software-Only, Balloon Zeroing Implementation
Referring to Fig. 5, a balloon zeroing procedure 500 provides a simpler approach to a software-only implementation. Instead of performing a background copy of all memory, this implementation begins by having the target CE clear all of its physical memory (step 505). The master CE then drops into a foreground synchronization mode and transfers only pages that contain non-zero data (step 510). This approach capitalizes on the observation that zero-filled pages are quite common in most virtual-memory operating systems. In addition, cooperative threads can be used to forcibly zero out a large portion of memory just prior to attempting the foreground synchronization. This pre-zeroing effort is referred to as balloon zeroing. Advantages of this approach are that it works on existing Pentium-based systems; it is not tied to hardware chip schedules; it is very simple to implement; and it permits page-level granularity of zero/non-zero data. Potential disadvantages of this approach are that the scan memory time to determine if foreground synchronization should be performed now may take much more time, depending on memory size and processor speed; it may cause undesirable application behavior if balloon zeroing is done too aggressively; and convergence on an acceptable foreground synchronization time may not occur under many loads.
Balloon zeroing may be combined with other approaches. For example, the procedure 400 can be modified so that the target CE clears all of its memory prior to initiation of the background copy. In this case, only non-zero memory ofthe master CE would be copied during the first iteration of the background copy.
Hardware-Assisted Memory Controller Tracking Method
Referring to Fig. 6, with a small amount of hardware assistance from the memory controller chip, a simple design and implementation that increases the chances for convergence to an acceptable foreground wrap-up is possible. Rather than using the page table structure and internal knowledge of the operating system's use of PTEs and other mechanisms, a small but effective bitmap of block modifications can be maintained by the memory controller chip. The granularity of this bitmap is far less than what is obtainable with PTE tracking, but is adequate for the purpose of background synchronization. The procedure 600 differs from the procedure 100 shown in Fig. 1 only in that in step 605, which replaces step 115, the original memory copy list is stored using the memory controller chip, and in step 610, which replaces step 125, the updated memory copy list is created using the memory controller chip.
Advantages of this approach are that it is operating system independent, it is insensitive to operating system uses of page table structures, it is insensitive to operating system methods of process context switches, it is insensitive to future changes to operating system memory management and kernel, it provides compact and efficient scanning of bitmap results, the bitmap tracks all memory writes, including those originating from the processor and from the MIC, and convergence on an acceptable foreground synchronization time should be very high. This approach may be combined with balloon zeroing method to reduce background copy traffic. Potential drawbacks of this approach are that it does not work with existing systems, it is dependent on the motherboard and chip design cycle, its features are not supported on all future chipsets, and that the granularity of memory modifications is limited by bitmap size. Hardware Approach to Background Synchronization
With the aid of specific features designed into the memory controller, a simple and effective background synchronization mechanism may be implemented. The memory controller provides a service whereby modifications made to areas of physical memory are flagged in a bitmap structure, preferably register-based. Software chooses the time to clear this bitmap and to enable the logging of memory writes to it. Software also requires the ability to select the resolution ofthe bitmap, expressed as the number of base 2 kilobytes represented by each bit. During the background transfer of CE memory, this bitmap is used to accumulate block-level modifications to main memory on the sourcing CE.
The size ofthe bitmap determines the tracking granularity, or resolution, of each bit. A 1024-bit bitmap covers the same 4 GB range with a resolution of 0.1%, or 4 MB per bit. The advantage is that automatic synchronization can be enabled at all times, allowing periodic attempts at CE synchronization, which has a high probability of success.
The minimum recommended size of a hardware-maintained bitmap is such that a 0.1% resolution is achievable. Table 2 lists the recommended bitmap sizes for various maximum configurations.
Figure imgf000021_0001
Table 2 - Recommended Bitmap Sizes
As indicated here, a bitmap of only 1024 bits is sufficient to support background synchronization on configurations with up to 4 GB of memory. This results in a worst-case resolution of 4MB, or 0.1 % of the total memory, per bitmap bit. Allowing the resolution of the bitmap to be software settable allows smaller memory configurations to be tracked with comparable resolution.
Selectable resolutions finer than 1 MB/bit are necessary if the bitmap size implemented is larger than the appropriate minimum recommended size from Table 2. For instance, a bitmap of 2048 bits on a 4 GB (max) chipset requires a 0.5 MB/bit resolution to use all bitmap bits if only 1 GB is actually present.
This approach may be implemented in a system that provides a register-based bitmap of (at least) the recommended size (Table 2), or a memory-based bitmap using software-specified base address and range. This latter approach is preferred because it gives software control over resolution. Alternatively, a bitmap can be managed as a set of 32-bit registers, or as an internal or external RAM array which is private to the memory controller chip.
The system also allows the bitmap to be cleared by software, preferably using "longword writes" (i.e., overwriting the bitmap with words containing all zeroes).
The system also allows bitmap logging to be disabled (default) and enabled by software. Disabling and enabling does not alter the contents of the bitmap. When enabled, all processor or MIC originated writes to memory are tracked in the bitmap. Memory scrubbing operations performed by the memory controller itself are not tracked.
The system also allows the block size (resolution) of the bitmap to be software settable. Resolutions finer than 1 MB per bit are not essential, but are certainly desirable. If physical memory is used for the bitmap, software sets the base physical address of the map, along with the resolution (and, therefore, size).
Software guarantees that L1/L2 caches will be swept prior to evaluating the contents of the memory controller's bitmap. Software also guarantees that writes (clears) to the bitmap registers will never occur while bitmap tracking is enabled, ensuring that no dual-access hazards need to be arbitrated by the memory controller. Other software operations affecting a possible implementation, such as changing the base address of an external bitmap, occur only when the logging feature is disabled.
A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope ofthe following claims.

Claims

WHAT IS CLAIMED IS:
A method of synchronizing an inactive memory with an active memory in a fault- tolerant computer system that includes an active processor, the method comprising: copying data from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding; while the copying is proceeding, tracking regions of the active memory in which changes are made; after copying is complete, determining whether data from the regions of the active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations; if the data from the regions ofthe active memory in which changes were made can be copied to the inactive memory within the predetermined time period using the foreground process, copying the data from the regions ofthe active memory in which changes were made to the inactive memory using the foreground process; and if the data from the regions ofthe active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made.
2. The method of claim 1 further comprising, before copying data from the active memory to the inactive memory using the background process, evaluating whether the synchronizing is likely to be successful.
3. The method of claim 2 wherein evaluating whether the synchronizing is likely to be successful comprises comparing a rate at which data in the active memory are modified to a rate at which data can be transferred from the active memory to the inactive memory using the background process.
4. The method of claim 2 further comprising, when the result of the evaluating indicates that the synchronizing is not likely to be successful, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made.
5. The method of claim 2 further comprising, when the result ofthe evaluating indicates that the synchronizing is not likely to be successful, restricting an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
6. The method of claim 2 further comprising, when the result of the evaluating indicates that the synchronizing is not likely to be successful, performing a data compression operation on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made.
7. The method of claim 1 wherein the active memory is associated with the active processor.
8. The method of claim 1 wherein the active processor comprises a compute element and an I/O processor and the compute element implements the copying, tracking and determining.
9. The method of claim 1 further comprising creating a memory copy list identifying portions of the active memory for which data are to be copied to the inactive memory using the background process, wherein copying data using the background process comprises using the memory copy list.
10. The method of claim 9 wherein tracking regions of the active memory comprises creating a new memory copy list.
11. The method of claim 1 wherein the fault-tolerant system further comprises an inactive processor associated with the inactive memory, the active processor is associated with the active memory, and the method further comprises copying a context ofthe active processor to the inactive processor. .
12. The method of claim 1 wherein tracking regions of the active memory comprises using a page table structure comprising pages of memory and corresponding page table entries, with each page table entry including an indicator bit that is set when a memory location of the corresponding page of memory is modified.
13. The method of claim 1 wherein the active processor comprises a memory control section and tracking regions of the active memory comprises using a memory block structure allocated by the memory control section, including updating a flag corresponding to a block of memory whenever the block of memory is modified.
14. The method of claim 1 further comprising, when the data from the regions of the active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, increasing an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
15. The method of claim 1 further comprising, when the data from the regions of the active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, restricting an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
16. The method of claim 1 further comprising restricting an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to copying data from the active memory to the inactive memory using a background process.
17. The method of claim 1 further comprising, when the data from the regions ofthe active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, performing a data compression operation on the data from the regions of active memory in which changes were made prior to repeating the copying, tracking, and determimng for the regions of the active memory in which changes were made.
18. The method of claim 1 further comprising performing a data compression operation on the data from the regions of active memory in which changes were made prior to copying data from the active memory to the inactive memory using a background process.
19. The method of claim 1, wherein the method is implemented by a synchronization process, the method further comprising allocating a number of pages of active memory to the synchronization process prior to copying data from the active memory to the inactive memory using a background process.
20. The method of claim 19 further comprising clearing pages of the inactive memory corresponding to the allocated pages ofthe active memory.
21. The method of claim 19 further comprising clearing all of the inactive memory.
22. The method of claim 1 further comprising: clearing the inactive memory; and determining which regions of the active memory contain nonzero data, and wherein the copying data from the active memory to the inactive memory using the background process further comprises copying only data from the regions ofthe active memory that contain nonzero data.
23. Software for use in synchronizing an inactive memory with an active memory in a fault-tolerant computer system that includes an active processor, the software residing on a computer-readable medium and comprising instructions causing the fault-tolerant computer system to: copy data from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding; while the copying is proceeding, track regions ofthe active memory in which changes are made; after copying is complete, determine whether data from the regions of the active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations; if the data from the regions of the active memory in which changes were made can be copied to the inactive memory within the predetermined time period using the foreground process, copy the data from the regions of the active memory in which changes were made to the inactive memory using the foreground process; and if the data from the regions ofthe active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, repeat the copying, tracking, and determining for the regions of the active memory in which changes were made.
24. The software of claim 23, further comprising instructions causing the fault-tolerant computer system to evaluate whether the synchronizing is likely to be successful, prior to copying data from the active memory to the inactive memory using a background process.
25. The software of claim 23 wherein the fault-tolerant computer system further comprises an inactive processor associated with the inactive memory, the active processor is associated with the active memory, and the software further comprises instructions for causing the system to copy a context ofthe active processor to the inactive processor.
26. The software of claim 23 wherein instructions causing the system to track regions ofthe active memory comprise instructions causing the system to use a page table structure including pages of memory and corresponding page table entries, with each page table entry including an indicator bit that is set when a memory location ofthe corresponding page of memory is modified.
27. The software of claim 23 wherein the active processor comprises a memory control section and instructions causing the system to track regions ofthe active memory comprise instructions causing the system to use a memory block structure allocated by the memory control section, including instructions causing the system to update a flag corresponding to a block of memory whenever the block of memory is modified.
28. The software of claim 23 further comprising, when the data from the regions ofthe active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, instructions causing the system to increase an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
29. The software of claim 23 further comprising instructions causing the system to restrict an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to copying data from the active memory to the inactive memory using a background process.
30. The software of claim 23 further comprising instructions causing the system to perform a data compression operation on the data from the regions of active memory in which changes were made prior to copying data from the active memory to the inactive memory using a background process.
31. The software of claim 23 further comprising instructions for causing the system to allocate a number of pages of active memory to a synchronization process prior to copying data from the active memory to the inactive memory using a background process.
32. The software of claim 23 further comprising instructions for causing the system to: clear the inactive memory; and determine which regions of the active memory contain nonzero data, and wherein the instructions for causing the system to copy data from the active memory to the inactive memory using the background process further comprise instructions for causing the system to copy only data from the regions of the active memory that contain nonzero data.
33. A fault-tolerant computer system comprising an active processor with associated active memory and an inactive processor with associated inactive memory, the system configured to synchronize the inactive memory with the active memory by: copying data from the active memory to the inactive memory using a background process that permits the active processor to perform normal operations while the copying is proceeding; while the copying is proceeding, tracking regions of the active memory in which changes are made; after copying is complete, determining whether data from the regions ofthe active memory in which changes were made can be copied to the inactive memory within a predetermined time period using a foreground process that prevents the active processor from performing normal operations; if the data from the regions ofthe active memory in which changes were made can be copied to the inactive memory within the predetermined time period using the foreground process, copying the data from the regions ofthe active memory in which changes were made to the inactive memory using the foreground process; and if the data from the regions of the active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process, repeating the copying, tracking, and determining for the regions of the active memory in which changes were made.
34. The fault-tolerant computer system of claim 33, the system being further configured to evaluate whether the synchronizing is likely to be successful before copying data from the active memory to the inactive memory using the background process.
35. The fault-tolerant computer system of claim 33, the system being further configured to copy a context ofthe active processor to the inactive processor.
36. The fault-tolerant computer system of claim 33 wherein tracking regions of the active memory comprises using a page table structure including pages of memory and corresponding page table entries, with each page table entry including an indicator bit that is set when a memory location of the corresponding page of memory is modified.
37. The fault-tolerant computer system of claim 33 wherein the active processor comprises a memory control section and tracking regions of the active memory comprises using a memory block structure allocated by the memory control section, including updating a flag corresponding to a block of memory whenever the block of memory is modified.
38. The fault-tolerant computer system of claim 33, the system being further configured to increase an amount of bandwidth allocated to the background process prior to repeating the copying, tracking, and determining for the regions ofthe active memory in which changes were made when the data from the regions of the active memory in which changes were made cannot be copied to the inactive memory within the predetermined time period using the foreground process.
39. The fault-tolerant computer system of claim 33, the system being further configured to restrict an amount of working memory for one or more running applications to a minimum amount that still permits the one or more running applications to run prior to copying data from the active memory to the inactive memory using a background process.
40. The fault-tolerant computer system of claim 33, the system being further configured to perform a data compression operation on the data from the regions of active memory in which changes were made prior to copying data from the active memory to the inactive memory using a background process.
41. The fault-tolerant computer system of claim 33, the system being further configured to allocate a number of pages of active memory to a synchronization process prior to copying data from the active memory to the inactive memory using a background process.
42. The fault-tolerant computer system of claim 33, the system being further configured to: clear the inactive memory; and determine which regions ofthe active memory contain nonzero data, and wherein copying data from the active memory to the inactive memory using the background process further comprises copying only data from the regions of the active memory that contain nonzero data.
PCT/US2000/008940 1999-04-05 2000-04-05 Background synchronization for fault-tolerant systems WO2000060463A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP00921672A EP1169676A1 (en) 1999-04-05 2000-04-05 Background synchronization for fault-tolerant systems
AU41959/00A AU4195900A (en) 1999-04-05 2000-04-05 Background synchronization for fault-tolerant systems
CA002369932A CA2369932A1 (en) 1999-04-05 2000-04-05 Background synchronization for fault-tolerant systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12788199P 1999-04-05 1999-04-05
US60/127,881 1999-04-05

Publications (1)

Publication Number Publication Date
WO2000060463A1 true WO2000060463A1 (en) 2000-10-12

Family

ID=22432448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/008940 WO2000060463A1 (en) 1999-04-05 2000-04-05 Background synchronization for fault-tolerant systems

Country Status (4)

Country Link
EP (1) EP1169676A1 (en)
AU (1) AU4195900A (en)
CA (1) CA2369932A1 (en)
WO (1) WO2000060463A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1376361A1 (en) * 2001-03-26 2004-01-02 Duaxes Corporation Server duplexing method and duplexed server system
EP1380951A1 (en) * 2002-07-11 2004-01-14 Nec Corporation Fault tolerant information processing apparatus
FR2881308A1 (en) * 2005-01-21 2006-07-28 Meiosys Soc Par Actions Simpli METHOD OF ACCELERATING THE TRANSMISSION OF JOURNALIZATION DATA IN A MULTI-COMPUTER ENVIRONMENT AND SYSTEM USING THE SAME
US9424125B2 (en) 2013-01-16 2016-08-23 Google Inc. Consistent, disk-backed arrays
CN111711964A (en) * 2020-04-30 2020-09-25 国家计算机网络与信息安全管理中心 System disaster tolerance capability test method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5579220A (en) * 1993-07-28 1996-11-26 Siemens Aktiengesellschaft Method of updating a supplementary automation system
US5608865A (en) * 1995-03-14 1997-03-04 Network Integrity, Inc. Stand-in Computer file server providing fast recovery from computer file server failures

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5579220A (en) * 1993-07-28 1996-11-26 Siemens Aktiengesellschaft Method of updating a supplementary automation system
US5608865A (en) * 1995-03-14 1997-03-04 Network Integrity, Inc. Stand-in Computer file server providing fast recovery from computer file server failures

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1376361A1 (en) * 2001-03-26 2004-01-02 Duaxes Corporation Server duplexing method and duplexed server system
EP1376361A4 (en) * 2001-03-26 2005-11-16 Duaxes Corp Server duplexing method and duplexed server system
US7340637B2 (en) * 2001-03-26 2008-03-04 Duaxes Corporation Server duplexing method and duplexed server system
EP1380951A1 (en) * 2002-07-11 2004-01-14 Nec Corporation Fault tolerant information processing apparatus
US7418626B2 (en) 2002-07-11 2008-08-26 Nec Corporation Information processing apparatus
FR2881308A1 (en) * 2005-01-21 2006-07-28 Meiosys Soc Par Actions Simpli METHOD OF ACCELERATING THE TRANSMISSION OF JOURNALIZATION DATA IN A MULTI-COMPUTER ENVIRONMENT AND SYSTEM USING THE SAME
US9424125B2 (en) 2013-01-16 2016-08-23 Google Inc. Consistent, disk-backed arrays
US10067674B2 (en) 2013-01-16 2018-09-04 Google Llc Consistent, disk-backed arrays
CN111711964A (en) * 2020-04-30 2020-09-25 国家计算机网络与信息安全管理中心 System disaster tolerance capability test method
CN111711964B (en) * 2020-04-30 2024-02-02 国家计算机网络与信息安全管理中心 System disaster recovery capability test method

Also Published As

Publication number Publication date
AU4195900A (en) 2000-10-23
CA2369932A1 (en) 2000-10-12
EP1169676A1 (en) 2002-01-09

Similar Documents

Publication Publication Date Title
US6279119B1 (en) Fault resilient/fault tolerant computing
US10430224B2 (en) Methods and apparatus for providing hypervisor level data services for server virtualization
US5537533A (en) System and method for remote mirroring of digital data from a primary network server to a remote network server
US7779291B2 (en) Four site triangular asynchronous replication
EP0864126B1 (en) Remote checkpoint memory system and method for fault-tolerant computer system
Scales et al. The design of a practical system for fault-tolerant virtual machines
US5751939A (en) Main memory system and checkpointing protocol for fault-tolerant computer system using an exclusive-or memory
US8812907B1 (en) Fault tolerant computing systems using checkpoints
US9576040B1 (en) N-site asynchronous replication
US8069218B1 (en) System, method and computer program product for process migration with planned minimized down-time
US5745672A (en) Main memory system and checkpointing protocol for a fault-tolerant computer system using a read buffer
US9606881B1 (en) Method and system for rapid failback of a computer system in a disaster recovery environment
US11429466B2 (en) Operating system-based systems and method of achieving fault tolerance
US20150033068A1 (en) Failover to backup site in connection with triangular asynchronous replication
US20120017040A1 (en) Maintaining Data Consistency in Mirrored Cluster Storage Systems Using Bitmap Write-Intent Logging
US7752404B2 (en) Toggling between concurrent and cascaded triangular asynchronous replication
CN101038565A (en) System and method for data copying between management storage systems
WO2001013235A9 (en) Remote mirroring system, device, and method
US7734884B1 (en) Simultaneous concurrent and cascaded triangular asynchronous replication
KR20030066331A (en) Flexible remote data mirroring
US20070234105A1 (en) Failover to asynchronous backup site in connection with triangular asynchronous replication
US9645766B1 (en) Tape emulation alternate data path
US7680997B1 (en) Data recovery simulation
US8942073B1 (en) Maintaining tape emulation consistency
WO2000060463A1 (en) Background synchronization for fault-tolerant systems

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref document number: 2369932

Country of ref document: CA

Ref country code: CA

Ref document number: 2369932

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 2000921672

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2000921672

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 2000921672

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2000921672

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP