US20110214006A1 - Automated learning of failure recovery policies - Google Patents

Automated learning of failure recovery policies Download PDF

Info

Publication number
US20110214006A1
US20110214006A1 US12/713,195 US71319510A US2011214006A1 US 20110214006 A1 US20110214006 A1 US 20110214006A1 US 71319510 A US71319510 A US 71319510A US 2011214006 A1 US2011214006 A1 US 2011214006A1
Authority
US
United States
Prior art keywords
policy
repair
action
model
new policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/713,195
Other versions
US8024611B1 (en
Inventor
Christopher A. Meek
Guy Shani
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ServiceNow Inc
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Priority to US12/713,195 priority Critical patent/US8024611B1/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SHANI, GUY, MEEK, CHRISTOPHER A.
Publication of US20110214006A1 publication Critical patent/US20110214006A1/en
Application granted granted Critical
Publication of US8024611B1 publication Critical patent/US8024611B1/en
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Assigned to SERVICENOW, INC. reassignment SERVICENOW, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT TECHNOLOGY LICENSING, LLC
Assigned to SERVICENOW, INC. reassignment SERVICENOW, INC. CORRECTIVE ASSIGNMENT TO CORRECT THE RECORDAL TO REMOVE INADVERTENTLY RECOREDED PROPERTIES SHOWN IN ATTACHED SHEET PREVIOUSLY RECORDED AT REEL: 047681 FRAME: 0916. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: MICROSOFT TECHNOLOGY LICENSING, LLC
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems

Definitions

  • Recovery from computer failures due to software or hardware problems is a significant part of managing large data centers. Sometimes the failures may be automatically fixed through actions such as rebooting or re-imaging of the computer.
  • repair policy/controller are manually defined and created by a human expert. More particularly, the expert creates policies that map the state of the system to recovery actions by specifying a set of rules or conditions under which an action is to be taken.
  • policies/controllers are almost never optimal, in that even though they often fix an error, the error may actually be fixed in a faster (and thus correspondingly less expensive) way. As such, these policies may result in longer failure periods (system downtime) than needed.
  • various aspects of the subject matter described herein are directed towards a technology by which a learning mechanism automatically constructs a new policy that automatically controls a process based upon collected observable interactions of an existing policy with the process.
  • the learning mechanism builds a model of the process, including effects of possible actions, and computes the new policy base upon the learned model.
  • the new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process.
  • the model comprises a partially observable Markov decision process (POMDP).
  • POMDP partially observable Markov decision process
  • An expectation maximization algorithm e.g., an adapted Baum-Welch algorithm learns the POMDP model.
  • the policy is then computed using a point-based value iteration algorithm, such as executed by a cost-based indefinite-horizon formalization.
  • FIG. 1 is a block diagram representing example components/and environment for automated learning of failure recovery policies, and their application.
  • FIG. 2 is a flow diagram representing example steps taken with respect to automated learning of failure recovery policies.
  • FIG. 3 shows an illustrative example of a computing environment into which various aspects of the present invention may be incorporated.
  • Various aspects of the technology described herein are generally directed towards a learning mechanism that uses the logged experience of an existing, imperfect recovery policy (used interchangeably with “controller” herein) for automatically learning a new, improved policy.
  • the technology uses previous state sequences of interactions between the existing recovery policy, containing the observations (e.g., error messages) that the existing policy obtained, and the actions that the existing recovery policy executed.
  • a hidden state model is learned, which provides the new, improved policy for failure recovery.
  • the model may comprise a partially observable Markov decision process (POMDP) for generating recovery policies that automatically improves existing recovery policies.
  • POMDP partially observable Markov decision process
  • any of the examples herein are non-limiting. Indeed, while data center failure recovery is exemplified herein, many other real world applications, such as assembly lines, medical diagnosis systems, and failure detection and recovery systems, are also controlled by hand-made controllers, and thus may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and automated recovery in general.
  • FIG. 1 shows a general example in which a process 102 (which can also be considered a system) is controlled by a repair policy 104 .
  • the illustrated process 102 is representative of any of a plurality of such processes/systems, such as in a large datacenter in which there is often a domain on the order of hundreds or thousands of such machines. Note that in many such domains, such machines are identical and independent, e.g., computers in service farms such as answering independent queries to a search engine may share the same configuration and execute independent programs. It is therefore unlikely, if not impossible, for errors to propagate from one machine to another.
  • FIG. 1 shows the repair policy 104 deciding which repair actions 106 are to be taken to attempt to guide the process 102 towards a desired outcome.
  • the repair policy 104 (running on a computer system or the like) receives some observations about the state 108 of the process 102 , such as from one or more watchdogs 109 , comprising one or more other computers that probe the system 102 , (or possibly from the process 102 itself). Given these observations, the repair policy 104 decides on what it determines is the most appropriate action. In other words, the policy 104 maps observations or states of the process 102 into appropriate actions. The observable interactions of the policy can be described as a sequence of observations and actions.
  • the state 108 may include error messages that are received before, during or after a repair action is attempted, e.g., the repair policy may attempt a repair action based on one error message, and that repair action may result in another error message being generated, such as if the system soon fails again. More particularly, the repair policy 104 typically receives failure messages 108 about the process 102 via the one or more watchdog 109 . Messages from watchdogs are often aggregated into a small set of notifications, such as “Software Error” or “Hardware Error.”
  • a model of the environment may be created and used to create a revised policy that is approximately optimal given the model.
  • a POMDP is used by the learning mechanism for modeling such a process.
  • a POMDP captures the (hidden) states of the process, the stochastic transitions between states given actions, the utilities of states and the cost of actions, and the probabilities of observations given states and actions.
  • POMDPs are further described by Michael L. Littman and Nishkam Ravi in “An instance-based state representation for network repair,” in Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI, pages 287-292, (2004) and R. D. Smallwood and E. J. Sondik, in “The optimal control of partially observable markov decision processes over a finite horizon,” Operations Research, 21:1071-1098, 1973.
  • the technology described herein is generally directed towards automatically learning such models.
  • the learning process is based in part on evaluating parameters, such as the probability that a failure will be fixed given an action.
  • logs 110 of the recovery interactions (collected in a known manner) for an existing imperfect policy 112 are used in POMDP model learning (by a suitable algorithm) 114 .
  • These logs 110 contain sequences of action observations that the policy 14 received as input, typically error messages 108 from the process 102 or information about the outcome of a repair action 106 .
  • These logs 110 may be used to process the input messages as noisy observations over the true failures of the system in order to learn the model parameters using standard Hidden Markov Model (HMM) learning techniques, such as expectation maximization (EM).
  • HMM Hidden Markov Model
  • EM expectation maximization
  • One example of EM is the forward-backward (Baum-Welch) algorithm, described by Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss in “A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains,” The Annals of Mathematical Statistics, 41(1):164-171, 1970).
  • an (approximately) optimal policy 118 for the learned model 115 may be computed, as represented in FIG. 1 via block 116 . More particularly, with the learned POMDP model parameters, a policy 118 that optimizes repair costs may be computed, and then applied to the system as needed thereafter.
  • an approximate policy 118 may be rapidly computed via a point-based value iteration algorithm, (such as described by Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun, in “Point-based value iteration: An anytime algorithm for pomdps”; In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, August 2003).
  • a point-based value iteration algorithm such as described by Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun, in “Point-based value iteration: An anytime algorithm for pomdps”; In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, August 2003.
  • IJCAI International Joint Conference on Artificial Intelligence
  • repair actions may succeed or fail stochastically, and often provide an escalating behavior.
  • Actions may be labeled using increasing levels, where problems fixed by an action at level i, are also fixed by any action of level j>i. Probabilistically, this means that if j>i then pr(healthy
  • a watchdog may report an error for a machine that is fully operational, or report a “healthy” status for a machine that experiences a failure.
  • escalation policy In view of the escalating nature of actions and costs, one choice for a policy is an escalation policy. Such policies choose a starting entry level based on the first observation, and execute an action at that level. In many cases, due to the non-deterministic success of repair actions, each action is tried several times. After the controller decides that the action at the current level cannot fix the problem, the controller escalates to the next (more costly) action level, and so on.
  • FIG. 2 summarizes some example steps of automated policy learning, such as to learn many of these variable features based on actual observations. For example, given an observation, the entry level for that observation and the number of retries of each action before an escalation occurs may be optimized, via the learning algorithm 114 that uses the logs 110 of the controller execution collected by system administrators. Steps 202 and 204 of FIG. 2 represent executing the existing policy and collecting the log, respectively, used in learning.
  • sessions end with the machine declared as “dead,” but in practice a technician is called for these machines, repairing or replacing them. Therefore, it can be assumed that the sessions end successfully in the healthy state.
  • step 206 To learn the policies from the system logs, various alternatives for computing the recovery policy may be used, as represented in FIG. 2 by step 206 (likely not an actual step, but a determination made beforehand for a domain).
  • One alternative is to begin with a simple, model-free, history-based policy computation, as represented via step 208 .
  • an optimal policy can be expressed as a mapping from action-observation histories to actions. More particularly, histories are directly observable, allowing use of the standard Q function terminology, where Q(h,a) is the expected cost of executing action a with history h and continuing the session until it terminates.
  • Q(h,a) is the expected cost of executing action a with history h and continuing the session until it terminates.
  • This approach is known as model-free, because, for example, the parameters of a POMDP are never learned; note that histories are directly observable, and do not require any assumption about the unobserved state space.
  • the system log L is used to compute Q:
  • Q ⁇ ( h , a ) ⁇ l ⁇ L ⁇ ⁇ ⁇ ( h + a , l ) ⁇ Cost ⁇ ( l ⁇ h ⁇ ) ⁇ l ⁇ L ⁇ ⁇ ⁇ ( h + a , l ) ( 2 )
  • l i is a suffix of l starting at action a i
  • C(a) is the cost of action a
  • h+a is the history h with the action a appended at its end
  • ⁇ (h,l) 1 if h is a prefix of l and 0 otherwise.
  • the Q function is the average cost until repair of executing the action a in history h, under the policy that generated L. Learning a Q function is much faster than learning the POMDP parameters, requiring only a single pass over the training sequences in the system log.
  • a problem of learning a direct mapping from history to actions is that such policies do not generalize; that is, if a history sequence was not observed in the logs, then the expected cost cannot be evaluated until the error is repaired.
  • model-based policy learning as represented by step 210 , while it is assumed that the behavior of a machine can be captured perfectly using a POMDP as described above, in practice the parameters of the POMDP are not known beforehand. In practice, the parameters that are known are the set of possible repair actions and the set of possible observations, however even the number of possible errors is not initially known, let alone the probability of repair or observation.
  • a learning algorithm learns the parameters of the POMDP.
  • an adapted Baum-Welch algorithm may be used, comprising an EM algorithm developed for computing the parameters of Hidden Markov Models (HMMs).
  • HMMs Hidden Markov Models
  • the Baum-Welch algorithm takes as input the number of states (the number of possible errors) and a set of training sequences. Then, using the forward-backward procedure, the parameters of the POMDP are computed, attempting to maximize the likelihood of the data (the observation sequences). After the POMDP parameters have been learned, the above-described “Perseus” solver algorithm is executed to compute a policy.
  • a cost-based POMDP may be defined through a tuple ⁇ S, A, tr, C, ⁇ , O> where S is a set of states.
  • A is a set of actions, such as rebooting a machine or re-imaging it;
  • C(s,a) is a cost function, assigning a cost to each state-action pair. Often, costs can be measured as the time (minutes) for executing the action. For example, a reboot may take ten minutes, while re-imaging takes two hours.
  • observations are messages from the watchdogs 109 , such as a notification of a hard disk failure, or a service reporting an error, and notifications about the success or failure of an action.
  • O(a,s 0 ,o) is an observation function, assigning a probability to each observation pr(o
  • a belief state b ⁇ B comprising a probability distribution over states, where b(s) is the probability that the system is at state s. Every repair session is assumed to start with an error observation, typically provided by one of the watchdogs 109 , whereby b 0 o is defined as the prior distribution over states given an initial observation o. Also maintained is a probability distribution pr 0 (o) over initial observations. While this probability distribution is not used in model learning, it is useful for evaluating the quality of policies, e.g., through trials.
  • a policy for a POMDP is a mapping from belief states to actions ⁇ :B ⁇ A.
  • the general goal is to find an optimal policy that brings the machine to the healthy state with the minimal cost.
  • One method for computing a policy is through a value function, V, assigning a value to each belief state b.
  • V can be expressed as a set of
  • dimensional vectors known as ⁇ -vectors, i.e., V ⁇ 1 , . . . , ⁇ n ⁇ .
  • ⁇ b min ⁇ v ⁇ b is the optimal ⁇ -vector for belief state b
  • ⁇ b ⁇ i ⁇ i b i is the standard vector inner product.
  • the value function may be updated by creating a single ⁇ -vector that is optimized for a specific belief state.
  • Such methods known as point-based value iteration, compute a value function over a finite set of belief states, resulting in a finite size value function.
  • An relatively fast point-based solver that incrementally updates a value function over a randomly chosen set of belief points is described by Matthijs T. J. Spaan and Nikos Vlassis in “Perseus: Randomized point-based value iteration for POMDPs,” Journal of Artificial Intelligence Research, 24:195-220 (2005). This solver ensures that at each iteration, the value for each belief state is improved, while maintaining a compact value function representation.
  • the indefinite horizon POMDP framework (described by Eric A. Hansen in “Indefinite-horizon pomdps with action-based termination,” AAAI, pages 1237-1242, (2007)) is adopted as being appropriate for failure recovery.
  • the POMDP has a single special action a T , available in any state, that terminates the repair session, such as by calling a technician, deterministically repairing the machine, but with a huge cost. Executing a T in s H incurs no cost.
  • Using indefinite horizon defines a lower bound on the value function using a T , and executes any point-based algorithm, such as the above-described “Perseus” solver algorithm (step 212 ).
  • likelihood may be tested on a held out set of sequences that are not used in training, in order to ensure that the resulting model does not over-fit the data.
  • evidence for the quality of the learned models is provided to help system administrators make a decision whether to replace the existing policy with a new policy.
  • an imperfect learner such as Baum-Welch does not guarantee that the resulting model indeed maximizes the likelihood of the observations given the policy, even for the same policy that was used to generate the training data.
  • the loss function used for learning the model ignores action costs, thus ignoring this aspect of the problem.
  • step 214 some evaluation outside of the domain (or performed on some subset of the domain machines) may be performed before actually implementing the new policy corresponding to the model.
  • the average cost may be used to provide evidence for the validity of the model, including after being implemented.
  • step 216 if the new policy does not pass the test/evaluation, then the existing policy is kept, (step 218 ), until some change such as more logged information results in an actually improved policy being created at some later time. If the new policy is tested/evaluated and determined to be improved, then the new policy replaces the existing policy (step 220 ); note that this policy then becomes the existing policy which itself may become improved at a later time.
  • the approach adopts the indefinite-horizon POMDP formalization, and uses the existing controller's logs to learn the unknown model parameters, using an EM algorithm (e.g., an adapted Baum-Welch algorithm).
  • the computed policy for the learned model may then be used in a data center in place of the manually built controller.
  • FIG. 3 illustrates an example of a suitable computing and networking environment 300 on which the examples of FIGS. 1 and 2 may be implemented.
  • the computing system environment 300 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 300 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 300 .
  • the invention is operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in local and/or remote computer storage media including memory storage devices.
  • an exemplary system for implementing various aspects of the invention may include a general purpose computing device in the form of a computer 310 .
  • Components of the computer 310 may include, but are not limited to, a processing unit 320 , a system memory 330 , and a system bus 321 that couples various system components including the system memory to the processing unit 320 .
  • the system bus 321 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • the computer 310 typically includes a variety of computer-readable media.
  • Computer-readable media can be any available media that can be accessed by the computer 310 and includes both volatile and nonvolatile media, and removable and non-removable media.
  • Computer-readable media may comprise computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 310 .
  • Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
  • the system memory 330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 331 and random access memory (RAM) 332 .
  • ROM read only memory
  • RAM random access memory
  • BIOS basic input/output system
  • RAM 332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 320 .
  • FIG. 3 illustrates operating system 334 , application programs 335 , other program modules 336 and program data 337 .
  • the computer 310 may also include other removable/non-removable, volatile/nonvolatile computer storage media.
  • FIG. 3 illustrates a hard disk drive 341 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 351 that reads from or writes to a removable, nonvolatile magnetic disk 352 , and an optical disk drive 355 that reads from or writes to a removable, nonvolatile optical disk 356 such as a CD ROM or other optical media.
  • removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like.
  • the hard disk drive 341 is typically connected to the system bus 321 through a non-removable memory interface such as interface 340
  • magnetic disk drive 351 and optical disk drive 355 are typically connected to the system bus 321 by a removable memory interface, such as interface 350 .
  • the drives and their associated computer storage media provide storage of computer-readable instructions, data structures, program modules and other data for the computer 310 .
  • hard disk drive 341 is illustrated as storing operating system 344 , application programs 345 , other program modules 346 and program data 347 .
  • operating system 344 application programs 345 , other program modules 346 and program data 347 are given different numbers herein to illustrate that, at a minimum, they are different copies.
  • a user may enter commands and information into the computer 310 through input devices such as a tablet, or electronic digitizer, 364 , a microphone 363 , a keyboard 362 and pointing device 361 , commonly referred to as mouse, trackball or touch pad.
  • Other input devices not shown in FIG. 3 may include a joystick, game pad, satellite dish, scanner, or the like.
  • These and other input devices are often connected to the processing unit 320 through a user input interface 360 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB).
  • a monitor 391 or other type of display device is also connected to the system bus 321 via an interface, such as a video interface 390 .
  • the monitor 391 may also be integrated with a touch-screen panel or the like. Note that the monitor and/or touch screen panel can be physically coupled to a housing in which the computing device 310 is incorporated, such as in a tablet-type personal computer. In addition, computers such as the computing device 310 may also include other peripheral output devices such as speakers 395 and printer 396 , which may be connected through an output peripheral interface 394 or the like.
  • the computer 310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 380 .
  • the remote computer 380 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 310 , although only a memory storage device 381 has been illustrated in FIG. 3 .
  • the logical connections depicted in FIG. 3 include one or more local area networks (LAN) 371 and one or more wide area networks (WAN) 373 , but may also include other networks.
  • LAN local area network
  • WAN wide area network
  • Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • the computer 310 When used in a LAN networking environment, the computer 310 is connected to the LAN 371 through a network interface or adapter 370 .
  • the computer 310 When used in a WAN networking environment, the computer 310 typically includes a modem 372 or other means for establishing communications over the WAN 373 , such as the Internet.
  • the modem 372 which may be internal or external, may be connected to the system bus 321 via the user input interface 360 or other appropriate mechanism.
  • a wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN.
  • program modules depicted relative to the computer 310 may be stored in the remote memory storage device.
  • FIG. 3 illustrates remote application programs 385 as residing on memory device 381 . It may be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • An auxiliary subsystem 399 (e.g., for auxiliary display of content) may be connected via the user interface 360 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state.
  • the auxiliary subsystem 399 may be connected to the modem 372 and/or network interface 370 to allow communication between these systems while the main processing unit 320 is in a low power state.

Abstract

Described is automated learning of failure recovery policies based upon existing information regarding previous policies and actions. A learning mechanism automatically constructs a new policy for controlling a recovery process, based upon collected observable interactions of an existing policy with the process. In one aspect, the learning mechanism builds a partially observable Markov decision process (POMDP) model, and computes the new policy base upon the learned model. The new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process.

Description

    BACKGROUND
  • Recovery from computer failures due to software or hardware problems is a significant part of managing large data centers. Sometimes the failures may be automatically fixed through actions such as rebooting or re-imaging of the computer.
  • In large environments, it is prohibitively expensive to have a technician decide on a repair action for each observed problem. As a result, the data centers often employ recovery systems that use some automatic repair policy or controller to choose appropriate repair actions. Typically the repair policy/controller are manually defined and created by a human expert. More particularly, the expert creates policies that map the state of the system to recovery actions by specifying a set of rules or conditions under which an action is to be taken.
  • However, such policies/controllers are almost never optimal, in that even though they often fix an error, the error may actually be fixed in a faster (and thus correspondingly less expensive) way. As such, these policies may result in longer failure periods (system downtime) than needed.
  • SUMMARY
  • This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
  • Briefly, various aspects of the subject matter described herein are directed towards a technology by which a learning mechanism automatically constructs a new policy that automatically controls a process based upon collected observable interactions of an existing policy with the process. In one aspect, the learning mechanism builds a model of the process, including effects of possible actions, and computes the new policy base upon the learned model. The new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process.
  • In one implementation, the model comprises a partially observable Markov decision process (POMDP). An expectation maximization algorithm (e.g., an adapted Baum-Welch algorithm) learns the POMDP model. The policy is then computed using a point-based value iteration algorithm, such as executed by a cost-based indefinite-horizon formalization.
  • Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
  • FIG. 1 is a block diagram representing example components/and environment for automated learning of failure recovery policies, and their application.
  • FIG. 2 is a flow diagram representing example steps taken with respect to automated learning of failure recovery policies.
  • FIG. 3 shows an illustrative example of a computing environment into which various aspects of the present invention may be incorporated.
  • DETAILED DESCRIPTION
  • Various aspects of the technology described herein are generally directed towards a learning mechanism that uses the logged experience of an existing, imperfect recovery policy (used interchangeably with “controller” herein) for automatically learning a new, improved policy. To this end, the technology uses previous state sequences of interactions between the existing recovery policy, containing the observations (e.g., error messages) that the existing policy obtained, and the actions that the existing recovery policy executed. In one implementation, from these sequences a hidden state model is learned, which provides the new, improved policy for failure recovery. For example, the model may comprise a partially observable Markov decision process (POMDP) for generating recovery policies that automatically improves existing recovery policies.
  • It should be understood that any of the examples herein are non-limiting. Indeed, while data center failure recovery is exemplified herein, many other real world applications, such as assembly lines, medical diagnosis systems, and failure detection and recovery systems, are also controlled by hand-made controllers, and thus may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and automated recovery in general.
  • FIG. 1 shows a general example in which a process 102 (which can also be considered a system) is controlled by a repair policy 104. The illustrated process 102 is representative of any of a plurality of such processes/systems, such as in a large datacenter in which there is often a domain on the order of hundreds or thousands of such machines. Note that in many such domains, such machines are identical and independent, e.g., computers in service farms such as answering independent queries to a search engine may share the same configuration and execute independent programs. It is therefore unlikely, if not impossible, for errors to propagate from one machine to another.
  • As is known, these systems/processes may be automatically controlled by policies, in which a controlled system/process may be influenced by external actions. FIG. 1 shows the repair policy 104 deciding which repair actions 106 are to be taken to attempt to guide the process 102 towards a desired outcome.
  • Typically, the repair policy 104 (running on a computer system or the like) receives some observations about the state 108 of the process 102, such as from one or more watchdogs 109, comprising one or more other computers that probe the system 102, (or possibly from the process 102 itself). Given these observations, the repair policy 104 decides on what it determines is the most appropriate action. In other words, the policy 104 maps observations or states of the process 102 into appropriate actions. The observable interactions of the policy can be described as a sequence of observations and actions.
  • Note that the state 108 may include error messages that are received before, during or after a repair action is attempted, e.g., the repair policy may attempt a repair action based on one error message, and that repair action may result in another error message being generated, such as if the system soon fails again. More particularly, the repair policy 104 typically receives failure messages 108 about the process 102 via the one or more watchdog 109. Messages from watchdogs are often aggregated into a small set of notifications, such as “Software Error” or “Hardware Error.”
  • As described herein, a model of the environment may be created and used to create a revised policy that is approximately optimal given the model. In one implementation, a POMDP is used by the learning mechanism for modeling such a process. A POMDP captures the (hidden) states of the process, the stochastic transitions between states given actions, the utilities of states and the cost of actions, and the probabilities of observations given states and actions. POMDPs are further described by Michael L. Littman and Nishkam Ravi in “An instance-based state representation for network repair,” in Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI, pages 287-292, (2004) and R. D. Smallwood and E. J. Sondik, in “The optimal control of partially observable markov decision processes over a finite horizon,” Operations Research, 21:1071-1098, 1973.
  • However, specifying the parameters of such a model in a way that provides good results is often no easier than specifying a policy directly. Thus, the technology described herein is generally directed towards automatically learning such models. The learning process is based in part on evaluating parameters, such as the probability that a failure will be fixed given an action.
  • To this end, as represented in FIG. 1, logs 110 of the recovery interactions (collected in a known manner) for an existing imperfect policy 112 are used in POMDP model learning (by a suitable algorithm) 114. These logs 110 contain sequences of action observations that the policy 14 received as input, typically error messages 108 from the process 102 or information about the outcome of a repair action 106. These logs 110 may be used to process the input messages as noisy observations over the true failures of the system in order to learn the model parameters using standard Hidden Markov Model (HMM) learning techniques, such as expectation maximization (EM). One example of EM is the forward-backward (Baum-Welch) algorithm, described by Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss in “A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains,” The Annals of Mathematical Statistics, 41(1):164-171, 1970).
  • Given the learned POMDP model 115/parameters, an (approximately) optimal policy 118 for the learned model 115 may be computed, as represented in FIG. 1 via block 116. More particularly, with the learned POMDP model parameters, a policy 118 that optimizes repair costs may be computed, and then applied to the system as needed thereafter.
  • In one implementation, an approximate policy 118 may be rapidly computed via a point-based value iteration algorithm, (such as described by Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun, in “Point-based value iteration: An anytime algorithm for pomdps”; In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, August 2003). As this policy 118 is learned from the interactions logged from the existing policy 112, it is not guaranteed to be optimal for that system, yet it typically improves upon the previous policy. The improvement may be verified before the new policy 118 is implemented, as described below.
  • Turning to additional details, repair actions may succeed or fail stochastically, and often provide an escalating behavior. Actions may be labeled using increasing levels, where problems fixed by an action at level i, are also fixed by any action of level j>i. Probabilistically, this means that if j>i then pr(healthy|aj,e)>pr(healthy|ai,e) for any error a Action costs are also typically escalating, where lower level actions that fix minor problems are relatively cheap, while higher level actions are more expensive. In many real world systems this escalation can be considered roughly exponential. For example, restarting a service takes five seconds, rebooting a machine takes approximately ten minutes, while re-imaging the machine takes about two hours.
  • Another stochastic feature is inexact failure detection. A watchdog may report an error for a machine that is fully operational, or report a “healthy” status for a machine that experiences a failure.
  • In view of the escalating nature of actions and costs, one choice for a policy is an escalation policy. Such policies choose a starting entry level based on the first observation, and execute an action at that level. In many cases, due to the non-deterministic success of repair actions, each action is tried several times. After the controller decides that the action at the current level cannot fix the problem, the controller escalates to the next (more costly) action level, and so on.
  • FIG. 2 summarizes some example steps of automated policy learning, such as to learn many of these variable features based on actual observations. For example, given an observation, the entry level for that observation and the number of retries of each action before an escalation occurs may be optimized, via the learning algorithm 114 that uses the logs 110 of the controller execution collected by system administrators. Steps 202 and 204 of FIG. 2 represent executing the existing policy and collecting the log, respectively, used in learning.
  • More particularly, the learning algorithm takes as input a log L of repair sessions, in which each repair session comprises a sequence I=o0, a1, o1, . . . , onl, starting with an error notification, followed by a set of repair actions and observations until the problem is fixed. In some cases, sessions end with the machine declared as “dead,” but in practice a technician is called for these machines, repairing or replacing them. Therefore, it can be assumed that the sessions end successfully in the healthy state.
  • To learn the policies from the system logs, various alternatives for computing the recovery policy may be used, as represented in FIG. 2 by step 206 (likely not an actual step, but a determination made beforehand for a domain). One alternative is to begin with a simple, model-free, history-based policy computation, as represented via step 208. Another is a method that learns the POMDP model parameters (step 210), and then uses the POMDP to compute a policy (step 212).
  • With respect to model-free learning (of Q-values) at step 208, an optimal policy can be expressed as a mapping from action-observation histories to actions. More particularly, histories are directly observable, allowing use of the standard Q function terminology, where Q(h,a) is the expected cost of executing action a with history h and continuing the session until it terminates. This approach is known as model-free, because, for example, the parameters of a POMDP are never learned; note that histories are directly observable, and do not require any assumption about the unobserved state space.
  • The system log L is used to compute Q:
  • Cost ( l i ) = j = i + 1 l C ( a j ) ( 1 ) Q ( h , a ) = l L δ ( h + a , l ) Cost ( l h ) l L δ ( h + a , l ) ( 2 )
  • where li is a suffix of l starting at action ai, C(a) is the cost of action a, h+a is the history h with the action a appended at its end, and δ(h,l)=1 if h is a prefix of l and 0 otherwise. The Q function is the average cost until repair of executing the action a in history h, under the policy that generated L. Learning a Q function is much faster than learning the POMDP parameters, requiring only a single pass over the training sequences in the system log.
  • Given the learned Q function, the following policy may be defined:
  • π Q ( h ) = min a Q ( h , a ) ( 3 )
  • Note that a problem of learning a direct mapping from history to actions is that such policies do not generalize; that is, if a history sequence was not observed in the logs, then the expected cost cannot be evaluated until the error is repaired. One approach that generalizes well is to use a finite history window of size k, discarding the observations and action occurring more than k steps ago. For example, when k=1 the result is a completely reactive Q function, computing Q(o,a) using the last observation only.
  • With respect to model-based policy learning as represented by step 210, while it is assumed that the behavior of a machine can be captured perfectly using a POMDP as described above, in practice the parameters of the POMDP are not known beforehand. In practice, the parameters that are known are the set of possible repair actions and the set of possible observations, however even the number of possible errors is not initially known, let alone the probability of repair or observation.
  • Given the log of repair sessions, a learning algorithm learns the parameters of the POMDP. As described above, an adapted Baum-Welch algorithm may be used, comprising an EM algorithm developed for computing the parameters of Hidden Markov Models (HMMs). The Baum-Welch algorithm takes as input the number of states (the number of possible errors) and a set of training sequences. Then, using the forward-backward procedure, the parameters of the POMDP are computed, attempting to maximize the likelihood of the data (the observation sequences). After the POMDP parameters have been learned, the above-described “Perseus” solver algorithm is executed to compute a policy.
  • For a POMDP for error recovery given the problem features above, a cost-based POMDP may be defined through a tuple <S, A, tr, C, Ω, O> where S is a set of states. A factored representation may be adopted, where s=<e0, . . . , en> where eiε{0,1} indicates whether error i exists. That is, states are sets of failures, or errors of a machine, such as software error or a hardware failure. Another state representing the healthy state may be added, that is, sH=<0, . . . , 0>. In the tuple, A is a set of actions, such as rebooting a machine or re-imaging it; tr(s,a,s′) is a state transition function, specifying the probabilities of moving between states. The transition function is restricted such that tr(s,a,s′)>0 if and only if ∀i if si=0 then s′i=0. That is, an action may only fix an error, not generate new errors. C(s,a) is a cost function, assigning a cost to each state-action pair. Often, costs can be measured as the time (minutes) for executing the action. For example, a reboot may take ten minutes, while re-imaging takes two hours.
  • Ω represents a set of possible observations. For example, observations are messages from the watchdogs 109, such as a notification of a hard disk failure, or a service reporting an error, and notifications about the success or failure of an action. O(a,s0,o) is an observation function, assigning a probability to each observation pr(o|a,s′).
  • In a POMDP, the true state is not directly observable and thus a belief state bεB is maintained, comprising a probability distribution over states, where b(s) is the probability that the system is at state s. Every repair session is assumed to start with an error observation, typically provided by one of the watchdogs 109, whereby b0 o is defined as the prior distribution over states given an initial observation o. Also maintained is a probability distribution pr0(o) over initial observations. While this probability distribution is not used in model learning, it is useful for evaluating the quality of policies, e.g., through trials.
  • It is convenient to define a policy for a POMDP as a mapping from belief states to actions π:B→A. The general goal is to find an optimal policy that brings the machine to the healthy state with the minimal cost. One method for computing a policy is through a value function, V, assigning a value to each belief state b. Such a value function can be expressed as a set of |S| dimensional vectors known as α-vectors, i.e., V={α1, . . . , αn}. Then, αb=minαεv α·b is the optimal α-vector for belief state b, and V(b)=b·αb is the value that the value function V assigns to b, where α·b=Σiαibi is the standard vector inner product. By associating an action a(α) which each vector, a policy π:B→A can be defined through π(b)=a(αb).
  • The value function may be updated by creating a single α-vector that is optimized for a specific belief state. Such methods, known as point-based value iteration, compute a value function over a finite set of belief states, resulting in a finite size value function. An relatively fast point-based solver that incrementally updates a value function over a randomly chosen set of belief points is described by Matthijs T. J. Spaan and Nikos Vlassis in “Perseus: Randomized point-based value iteration for POMDPs,” Journal of Artificial Intelligence Research, 24:195-220 (2005). This solver ensures that at each iteration, the value for each belief state is improved, while maintaining a compact value function representation.
  • The indefinite horizon POMDP framework (described by Eric A. Hansen in “Indefinite-horizon pomdps with action-based termination,” AAAI, pages 1237-1242, (2007)) is adopted as being appropriate for failure recovery. In this framework the POMDP has a single special action aT, available in any state, that terminates the repair session, such as by calling a technician, deterministically repairing the machine, but with a huge cost. Executing aT in sH incurs no cost. Using indefinite horizon defines a lower bound on the value function using aT, and executes any point-based algorithm, such as the above-described “Perseus” solver algorithm (step 212).
  • While training the model parameters, likelihood may be tested on a held out set of sequences that are not used in training, in order to ensure that the resulting model does not over-fit the data. Further, when employing automatic learning methods to create an improved policy, evidence for the quality of the learned models is provided to help system administrators make a decision whether to replace the existing policy with a new policy. Using an imperfect learner such as Baum-Welch does not guarantee that the resulting model indeed maximizes the likelihood of the observations given the policy, even for the same policy that was used to generate the training data. Also, the loss function used for learning the model ignores action costs, thus ignoring this aspect of the problem.
  • For these reasons, it is possible that the resulting model will describe the domain poorly. Thus, as represented by step 214 some evaluation outside of the domain (or performed on some subset of the domain machines) may be performed before actually implementing the new policy corresponding to the model. The average cost may be used to provide evidence for the validity of the model, including after being implemented. As represented via step 216, if the new policy does not pass the test/evaluation, then the existing policy is kept, (step 218), until some change such as more logged information results in an actually improved policy being created at some later time. If the new policy is tested/evaluated and determined to be improved, then the new policy replaces the existing policy (step 220); note that this policy then becomes the existing policy which itself may become improved at a later time.
  • There is this described herein a passive policy learning approach that uses available information to improve an existing repair policy. In one implementation, the approach adopts the indefinite-horizon POMDP formalization, and uses the existing controller's logs to learn the unknown model parameters, using an EM algorithm (e.g., an adapted Baum-Welch algorithm). The computed policy for the learned model may then be used in a data center in place of the manually built controller.
  • Exemplary Operating Environment
  • FIG. 3 illustrates an example of a suitable computing and networking environment 300 on which the examples of FIGS. 1 and 2 may be implemented. The computing system environment 300 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 300 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 300.
  • The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
  • With reference to FIG. 3, an exemplary system for implementing various aspects of the invention may include a general purpose computing device in the form of a computer 310. Components of the computer 310 may include, but are not limited to, a processing unit 320, a system memory 330, and a system bus 321 that couples various system components including the system memory to the processing unit 320. The system bus 321 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • The computer 310 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 310 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 310. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
  • The system memory 330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 331 and random access memory (RAM) 332. A basic input/output system 333 (BIOS), containing the basic routines that help to transfer information between elements within computer 310, such as during start-up, is typically stored in ROM 331. RAM 332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 320. By way of example, and not limitation, FIG. 3 illustrates operating system 334, application programs 335, other program modules 336 and program data 337.
  • The computer 310 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 3 illustrates a hard disk drive 341 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 351 that reads from or writes to a removable, nonvolatile magnetic disk 352, and an optical disk drive 355 that reads from or writes to a removable, nonvolatile optical disk 356 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 341 is typically connected to the system bus 321 through a non-removable memory interface such as interface 340, and magnetic disk drive 351 and optical disk drive 355 are typically connected to the system bus 321 by a removable memory interface, such as interface 350.
  • The drives and their associated computer storage media, described above and illustrated in FIG. 3, provide storage of computer-readable instructions, data structures, program modules and other data for the computer 310. In FIG. 3, for example, hard disk drive 341 is illustrated as storing operating system 344, application programs 345, other program modules 346 and program data 347. Note that these components can either be the same as or different from operating system 334, application programs 335, other program modules 336, and program data 337. Operating system 344, application programs 345, other program modules 346, and program data 347 are given different numbers herein to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 310 through input devices such as a tablet, or electronic digitizer, 364, a microphone 363, a keyboard 362 and pointing device 361, commonly referred to as mouse, trackball or touch pad. Other input devices not shown in FIG. 3 may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 320 through a user input interface 360 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 391 or other type of display device is also connected to the system bus 321 via an interface, such as a video interface 390. The monitor 391 may also be integrated with a touch-screen panel or the like. Note that the monitor and/or touch screen panel can be physically coupled to a housing in which the computing device 310 is incorporated, such as in a tablet-type personal computer. In addition, computers such as the computing device 310 may also include other peripheral output devices such as speakers 395 and printer 396, which may be connected through an output peripheral interface 394 or the like.
  • The computer 310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 380. The remote computer 380 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 310, although only a memory storage device 381 has been illustrated in FIG. 3. The logical connections depicted in FIG. 3 include one or more local area networks (LAN) 371 and one or more wide area networks (WAN) 373, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.
  • When used in a LAN networking environment, the computer 310 is connected to the LAN 371 through a network interface or adapter 370. When used in a WAN networking environment, the computer 310 typically includes a modem 372 or other means for establishing communications over the WAN 373, such as the Internet. The modem 372, which may be internal or external, may be connected to the system bus 321 via the user input interface 360 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 3 illustrates remote application programs 385 as residing on memory device 381. It may be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • An auxiliary subsystem 399 (e.g., for auxiliary display of content) may be connected via the user interface 360 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 399 may be connected to the modem 372 and/or network interface 370 to allow communication between these systems while the main processing unit 320 is in a low power state.
  • CONCLUSION
  • While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

Claims (20)

1. A system comprising:
a model learning component configured to access collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
a policy computation component configured to compute a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
a controller configured to apply the new policy to the process and, in an instance when the first state is identified, retry the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions; and
one or more processing units configured to execute at least one of the model learning component, the policy computation component, or the policy application component.
2. The system of claim 1 further comprising a collection component configured to collect the observable interactions into a data structure for access by the model learning component.
3. (canceled)
4. The system of claim 3 wherein the new policy performs automatic fault recovery.
5. The system of claim 1, wherein the model comprises a partially observable Markov decision process.
6. The system of claim 5, wherein the model learning component uses an expectation maximization algorithm to learn the partially observable Markov decision process.
7. The system of claim 5, wherein the policy computation component computes the new policy using a point-based value iteration algorithm.
8. The system of claim 7, wherein the point-based value iteration algorithm is executed by a cost-based indefinite-horizon formalization.
9. The system of claim 1, wherein the process corresponds to a computing machine in a datacenter.
10-20. (canceled)
21. A method comprising:
accessing collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
computing a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
applying the new policy to the process and, in an instance when the first state is identified, retrying the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions.
22. The method according to claim 21, the first state reflecting an error message generated by the process.
23. The method according to claim 21, the observable interactions being accessed from a recovery log associated with the process.
24. The method according to claim 21, the model being computed based on a first action cost associated with the first repair action and a second action cost associated with the second repair action.
25. The method according to claim 24, the second action cost being higher than the first action cost.
26. One or more computer-readable storage devices comprising instructions which, when executed by one or more processing units, cause the one or more processing units to perform:
accessing collected observable interactions of an existing repair policy with a process to build a model of the process, the model mapping states of the process to repair actions of the existing repair policy;
computing a new policy based upon the model, the new policy identifying a number of times to retry a first one of the repair actions when the process is in a first one of the states of the process; and
applying the new policy to the process and, in an instance when the first state is identified, retrying the first repair action the number of times identified by the new policy before escalating the first repair action to a second one of the repair actions.
27. The one or more computer-readable storage devices according to claim 26, the first state reflecting an error message generated by the process.
28. The one or more computer-readable storage devices according to claim 26, the observable interactions being accessed from a recovery log associated with the process.
29. The one or more computer-readable storage devices according to claim 29, the model being computed based on a first action cost associated with the first repair action and a second action cost associated with the second repair action.
30. The one or more computer-readable storage devices according to claim 26, the second action cost being higher than the first action cost.
US12/713,195 2010-02-26 2010-02-26 Automated learning of failure recovery policies Active US8024611B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/713,195 US8024611B1 (en) 2010-02-26 2010-02-26 Automated learning of failure recovery policies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/713,195 US8024611B1 (en) 2010-02-26 2010-02-26 Automated learning of failure recovery policies

Publications (2)

Publication Number Publication Date
US20110214006A1 true US20110214006A1 (en) 2011-09-01
US8024611B1 US8024611B1 (en) 2011-09-20

Family

ID=44505950

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/713,195 Active US8024611B1 (en) 2010-02-26 2010-02-26 Automated learning of failure recovery policies

Country Status (1)

Country Link
US (1) US8024611B1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130288222A1 (en) * 2012-04-27 2013-10-31 E. Webb Stacy Systems and methods to customize student instruction
US20140310564A1 (en) * 2013-04-15 2014-10-16 Centurylink Intellectual Property Llc Autonomous Service Management
EP2766809A4 (en) * 2011-10-10 2015-08-05 Hewlett Packard Development Co Methods and systems for identifying action for responding to anomaly in cloud computing system
US20150262231A1 (en) * 2014-03-14 2015-09-17 International Business Machines Corporation Generating apparatus, generation method, information processing method and program
US9448811B2 (en) 2011-11-23 2016-09-20 Freescale Semiconductor, Inc. Microprocessor device, and method of managing reset events therefor
US10255124B1 (en) * 2013-06-21 2019-04-09 Amazon Technologies, Inc. Determining abnormal conditions of host state from log files through Markov modeling
US10324779B1 (en) 2013-06-21 2019-06-18 Amazon Technologies, Inc. Using unsupervised learning to monitor changes in fleet behavior
US10438156B2 (en) 2013-03-13 2019-10-08 Aptima, Inc. Systems and methods to provide training guidance
WO2019240229A1 (en) * 2018-06-14 2019-12-19 日本電信電話株式会社 System state estimation device, system state estimation method, and program
US10552764B1 (en) * 2012-04-27 2020-02-04 Aptima, Inc. Machine learning system for a training model of an adaptive trainer
EP3605953A4 (en) * 2017-03-29 2020-02-26 KDDI Corporation Failure automatic recovery system, control device, procedure creation device, and computer-readable storage medium
US10613954B1 (en) * 2013-07-01 2020-04-07 Amazon Technologies, Inc. Testing framework for host computing devices
US10846606B2 (en) 2008-03-12 2020-11-24 Aptima, Inc. Probabilistic decision making system and methods of use
US20210149766A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection
US20210192297A1 (en) * 2019-12-19 2021-06-24 Raytheon Company Reinforcement learning system and method for generating a decision policy including failsafe
US11176473B2 (en) * 2017-01-06 2021-11-16 International Business Machines Corporation Partially observed Markov decision process model and its use
US20230023869A1 (en) * 2021-07-23 2023-01-26 Dell Products, L.P. System and method for providing intelligent assistance using a warranty bot

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9727441B2 (en) * 2011-08-12 2017-08-08 Microsoft Technology Licensing, Llc Generating dependency graphs for analyzing program behavior
US9836697B2 (en) * 2014-10-06 2017-12-05 International Business Machines Corporation Determining variable optimal policy for an MDP paradigm
US10839302B2 (en) 2015-11-24 2020-11-17 The Research Foundation For The State University Of New York Approximate value iteration with complex returns by bounding
JP6684243B2 (en) * 2017-03-30 2020-04-22 Kddi株式会社 Failure recovery procedure optimization system and failure recovery procedure optimization method
US10419274B2 (en) 2017-12-08 2019-09-17 At&T Intellectual Property I, L.P. System facilitating prediction, detection and mitigation of network or device issues in communication systems
US11372572B2 (en) 2020-05-08 2022-06-28 Kyndryl, Inc. Self-relocating data center based on predicted events

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5465321A (en) * 1993-04-07 1995-11-07 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Hidden markov models for fault detection in dynamic systems
US6009246A (en) * 1997-01-13 1999-12-28 International Business Machines Corporation Method and system for evaluating intrusive repair for plurality of devices
US6076083A (en) * 1995-08-20 2000-06-13 Baker; Michelle Diagnostic system utilizing a Bayesian network model having link weights updated experimentally
US6226627B1 (en) * 1998-04-17 2001-05-01 Fuji Xerox Co., Ltd. Method and system for constructing adaptive and resilient software
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
US6535865B1 (en) * 1999-07-14 2003-03-18 Hewlett Packard Company Automated diagnosis of printer systems using Bayesian networks
US20030093514A1 (en) * 2001-09-13 2003-05-15 Alfonso De Jesus Valdes Prioritizing bayes network alerts
US20050028033A1 (en) * 2003-07-31 2005-02-03 The Boeing Company Method, apparatus and computer program product for constructing a diagnostic network model
US20050160324A1 (en) * 2003-12-24 2005-07-21 The Boeing Company, A Delaware Corporation Automatic generation of baysian diagnostics from fault trees
US20050240796A1 (en) * 2004-04-02 2005-10-27 Dziong Zbigniew M Link-based recovery with demand granularity in mesh networks
US20070288795A1 (en) * 2003-09-17 2007-12-13 Leung Ying T Diagnosis of equipment failures using an integrated approach of case based reasoning and reliability analysis
US20080010483A1 (en) * 2006-05-11 2008-01-10 Fuji Xerox Co., Ltd. Computer readable medium storing an error recovery program, error recovery method, error recovery apparatus, and computer system
US20080059839A1 (en) * 2003-10-31 2008-03-06 Imclone Systems Incorporation Intelligent Integrated Diagnostics
US7451340B2 (en) * 2003-03-31 2008-11-11 Lucent Technologies Inc. Connection set-up extension for restoration path establishment in mesh networks
US20080300879A1 (en) * 2007-06-01 2008-12-04 Xerox Corporation Factorial hidden markov model with discrete observations
US7536595B1 (en) * 2005-10-19 2009-05-19 At&T Intellectual Property, Ii, L.P. Systems, devices, and methods for initiating recovery
US7539748B2 (en) * 2003-05-16 2009-05-26 Time Warner Cable, A Division Of Time Warner Entertainment Company, L.P. Data transfer application monitor and controller

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5465321A (en) * 1993-04-07 1995-11-07 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Hidden markov models for fault detection in dynamic systems
US6076083A (en) * 1995-08-20 2000-06-13 Baker; Michelle Diagnostic system utilizing a Bayesian network model having link weights updated experimentally
US6009246A (en) * 1997-01-13 1999-12-28 International Business Machines Corporation Method and system for evaluating intrusive repair for plurality of devices
US6226627B1 (en) * 1998-04-17 2001-05-01 Fuji Xerox Co., Ltd. Method and system for constructing adaptive and resilient software
US6535865B1 (en) * 1999-07-14 2003-03-18 Hewlett Packard Company Automated diagnosis of printer systems using Bayesian networks
US20020019870A1 (en) * 2000-06-29 2002-02-14 International Business Machines Corporation Proactive on-line diagnostics in a manageable network
US20030093514A1 (en) * 2001-09-13 2003-05-15 Alfonso De Jesus Valdes Prioritizing bayes network alerts
US7451340B2 (en) * 2003-03-31 2008-11-11 Lucent Technologies Inc. Connection set-up extension for restoration path establishment in mesh networks
US7539748B2 (en) * 2003-05-16 2009-05-26 Time Warner Cable, A Division Of Time Warner Entertainment Company, L.P. Data transfer application monitor and controller
US20050028033A1 (en) * 2003-07-31 2005-02-03 The Boeing Company Method, apparatus and computer program product for constructing a diagnostic network model
US20070288795A1 (en) * 2003-09-17 2007-12-13 Leung Ying T Diagnosis of equipment failures using an integrated approach of case based reasoning and reliability analysis
US20080059839A1 (en) * 2003-10-31 2008-03-06 Imclone Systems Incorporation Intelligent Integrated Diagnostics
US20050160324A1 (en) * 2003-12-24 2005-07-21 The Boeing Company, A Delaware Corporation Automatic generation of baysian diagnostics from fault trees
US20050240796A1 (en) * 2004-04-02 2005-10-27 Dziong Zbigniew M Link-based recovery with demand granularity in mesh networks
US7536595B1 (en) * 2005-10-19 2009-05-19 At&T Intellectual Property, Ii, L.P. Systems, devices, and methods for initiating recovery
US20080010483A1 (en) * 2006-05-11 2008-01-10 Fuji Xerox Co., Ltd. Computer readable medium storing an error recovery program, error recovery method, error recovery apparatus, and computer system
US20080300879A1 (en) * 2007-06-01 2008-12-04 Xerox Corporation Factorial hidden markov model with discrete observations

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10846606B2 (en) 2008-03-12 2020-11-24 Aptima, Inc. Probabilistic decision making system and methods of use
EP2766809A4 (en) * 2011-10-10 2015-08-05 Hewlett Packard Development Co Methods and systems for identifying action for responding to anomaly in cloud computing system
US10599506B2 (en) 2011-10-10 2020-03-24 Hewlett Packard Enterprise Development Lp Methods and systems for identifying action for responding to anomaly in cloud computing system
US9448811B2 (en) 2011-11-23 2016-09-20 Freescale Semiconductor, Inc. Microprocessor device, and method of managing reset events therefor
US9733952B2 (en) 2011-11-23 2017-08-15 Nxp Usa, Inc. Microprocessor, and method of managing reset events therefor
US11188848B1 (en) 2012-04-27 2021-11-30 Aptima, Inc. Systems and methods for automated learning
US10552764B1 (en) * 2012-04-27 2020-02-04 Aptima, Inc. Machine learning system for a training model of an adaptive trainer
US20130288222A1 (en) * 2012-04-27 2013-10-31 E. Webb Stacy Systems and methods to customize student instruction
US10290221B2 (en) * 2012-04-27 2019-05-14 Aptima, Inc. Systems and methods to customize student instruction
US10438156B2 (en) 2013-03-13 2019-10-08 Aptima, Inc. Systems and methods to provide training guidance
US20140310564A1 (en) * 2013-04-15 2014-10-16 Centurylink Intellectual Property Llc Autonomous Service Management
US9292402B2 (en) * 2013-04-15 2016-03-22 Century Link Intellectual Property LLC Autonomous service management
US10324779B1 (en) 2013-06-21 2019-06-18 Amazon Technologies, Inc. Using unsupervised learning to monitor changes in fleet behavior
US10255124B1 (en) * 2013-06-21 2019-04-09 Amazon Technologies, Inc. Determining abnormal conditions of host state from log files through Markov modeling
US11263069B1 (en) 2013-06-21 2022-03-01 Amazon Technologies, Inc. Using unsupervised learning to monitor changes in fleet behavior
US10613954B1 (en) * 2013-07-01 2020-04-07 Amazon Technologies, Inc. Testing framework for host computing devices
US10929259B2 (en) 2013-07-01 2021-02-23 Amazon Technologies, Inc. Testing framework for host computing devices
US9858592B2 (en) * 2014-03-14 2018-01-02 International Business Machines Corporation Generating apparatus, generation method, information processing method and program
US9747616B2 (en) * 2014-03-14 2017-08-29 International Business Machines Corporation Generating apparatus, generation method, information processing method and program
US20150294354A1 (en) * 2014-03-14 2015-10-15 International Business Machines Corporation Generating apparatus, generation method, information processing method and program
US20150262231A1 (en) * 2014-03-14 2015-09-17 International Business Machines Corporation Generating apparatus, generation method, information processing method and program
US11176473B2 (en) * 2017-01-06 2021-11-16 International Business Machines Corporation Partially observed Markov decision process model and its use
US11080128B2 (en) 2017-03-29 2021-08-03 Kddi Corporation Automatic failure recovery system, control device, procedure creation device, and computer-readable storage medium
EP3605953A4 (en) * 2017-03-29 2020-02-26 KDDI Corporation Failure automatic recovery system, control device, procedure creation device, and computer-readable storage medium
WO2019240229A1 (en) * 2018-06-14 2019-12-19 日本電信電話株式会社 System state estimation device, system state estimation method, and program
JPWO2019240229A1 (en) * 2018-06-14 2021-06-10 日本電信電話株式会社 System state estimation device, system state estimation method, and program
JP6992896B2 (en) 2018-06-14 2022-01-13 日本電信電話株式会社 System state estimation device, system state estimation method, and program
US20210149766A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Supervised reimaging of vulnerable computing devices with prioritization, auto healing, and pattern detection
US20210192297A1 (en) * 2019-12-19 2021-06-24 Raytheon Company Reinforcement learning system and method for generating a decision policy including failsafe
US20230023869A1 (en) * 2021-07-23 2023-01-26 Dell Products, L.P. System and method for providing intelligent assistance using a warranty bot

Also Published As

Publication number Publication date
US8024611B1 (en) 2011-09-20

Similar Documents

Publication Publication Date Title
US8024611B1 (en) Automated learning of failure recovery policies
US7328376B2 (en) Error reporting to diagnostic engines based on their diagnostic capabilities
US8225142B2 (en) Method and system for tracepoint-based fault diagnosis and recovery
US10489232B1 (en) Data center diagnostic information
US7293202B2 (en) Isolating the evaluation of actual test results against expected test results from the test module that generates the actual test results
US10831579B2 (en) Error detecting device and error detecting method for detecting failure of hierarchical system, computer readable recording medium, and computer program product
JP5198154B2 (en) Fault monitoring system, device, monitoring apparatus, and fault monitoring method
US10635521B2 (en) Conversational problem determination based on bipartite graph
US10795793B1 (en) Method and system for simulating system failures using domain-specific language constructs
CN111680407A (en) Satellite health assessment method based on Gaussian mixture model
CN114168429A (en) Error reporting analysis method and device, computer equipment and storage medium
US20230376758A1 (en) Multi-modality root cause localization engine
CN116991615A (en) Cloud primary system fault self-healing method and device based on online learning
CN115700549A (en) Model training method, failure determination method, electronic device, and program product
Shani et al. Improving existing fault recovery policies
CN113010339A (en) Method and device for automatically processing fault in online transaction test
CN112733155B (en) Software forced safety protection method based on external environment model learning
Baysse et al. Hidden Markov Model for the detection of a degraded operating mode of optronic equipment
US20240061739A1 (en) Incremental causal discovery and root cause localization for online system fault diagnosis
Chandrasekaran et al. Test & Evaluation Best Practices for Machine Learning-Enabled Systems
Salfner et al. Architecting dependable systems with proactive fault management
CN117472474B (en) Configuration space debugging method, system, electronic equipment and storage medium
CN114844778B (en) Abnormality detection method and device for core network, electronic equipment and readable storage medium
CN116931843B (en) User online management system based on large language model
US11461007B2 (en) Method, device, and computer program product for determining failures and causes of storage devices in a storage system

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEEK, CHRISTOPHER A.;SHANI, GUY;SIGNING DATES FROM 20100224 TO 20100225;REEL/FRAME:023994/0663

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034564/0001

Effective date: 20141014

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: SERVICENOW, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT TECHNOLOGY LICENSING, LLC;REEL/FRAME:047681/0916

Effective date: 20181115

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

AS Assignment

Owner name: SERVICENOW, INC., CALIFORNIA

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE RECORDAL TO REMOVE INADVERTENTLY RECOREDED PROPERTIES SHOWN IN ATTACHED SHEET PREVIOUSLY RECORDED AT REEL: 047681 FRAME: 0916. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:MICROSOFT TECHNOLOGY LICENSING, LLC;REEL/FRAME:049797/0119

Effective date: 20181115

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12