WO2015148328A1

WO2015148328A1 - System and method for accelerating problem diagnosis in software/hardware deployments

Info

Publication number: WO2015148328A1
Application number: PCT/US2015/021914
Authority: WO
Inventors: Rakesh Dhoopar; Julie Basu
Original assignee: Diagknowlogy, Inc.
Priority date: 2014-03-23
Filing date: 2015-03-23
Publication date: 2015-10-01

Abstract

Disclosed are a system and a method to accelerate problem diagnosis in critical business applications. The system generates problem cases by analyzing a set of machine data, such as metrics and logs, and application metadata, such as component versions, workloads. The problem cases are stored in a knowledge repository that may be shared with other enterprises to leverage community experience.

Description

SYSTEM AND METHOD FOR ACCELERATING PROBLEM DIAGNOSIS IN

SOFTWARE/HARDWARE DEPLOYMENTS

CROSS RELATED REFERENCES

[001] This patent application claims priority from the US Provisional Patent Application Nos. 61/969,217, filed on March 23, 2014 and 61/986,889 filed on May I, 2014, that are incorporated herein by reference.

FIELD OF INVENTION

[002] The present invention relates to system and method for problem troubleshooting in complex software/hardware deployments, such as Enterprise Applications, Cloud Applications and Large Scale Application Services. More particularly, the present invention relates to system and method for accelerating problem diagnosis in software/hardware deployments in enterprises/organizations and also, for efficiently identifying root causes and solutions for the diagnosed problems.

BACKGROUND OF THE INVENTION

[003] Over the past decade, applications have moved from "within the Enterprise" to "Web Access" to "Mobile Access" and increasingly towards "API Access". To support the rapid growth of these workloads, enterprise applications have evolved to "scale-out" architectures that typically consist of many servers. Moreover, software development and deployment are done using agile methodologies, so behavior changes are frequent. However, troubleshooting problems in these dynamic, modem, scale-out applications is increasingly difficult because of a variety of reasons, as outlined below. [004] Monitoring Tools Limitations:

[005] Current monitoring tools are unable to handle the deluge of operational data scattered across many repositories,

[006] No structured, repeatable workflows are available to aid problem identification, isolation and investigation. Often, 10-30 people across different domains will concurrently work on each problem crisis.

[007] There are no efficient ways to collaborate on conclusions reached during problem analysis, or to share and reuse troubleshooting efforts of others within or outside the enterprise. Information exchange through email and chats are inadequate.

[008] Operational Challenges:

[009] Operators of enterprise application deployments are often overwhelmed with the huge volume and variety of operational machine data that is generated, which is a form of ' B g Data'.

[0010] Operators can't keep up with domain knowledge of rapidly evolving applications developed and deployed with agile methodologies. Problem diagnosis requires escalation to application developers to regularly perform operational troubleshooting, which is not an optimal use of enterprise resources.

[0011] Operators switch jobs, companies or roles taking domain knowledge with them, and it reduces the overall team expertise that may have taken years to build up.

[0012] Today, when a problem occurs in a certain application deployment, the expertise developed in troubleshooting and solving the problem generally remains in silos within the minds of specific experts. There are no systematic ways to share diagnostic knowledge related to these problems across the community of users and enterprises that have similar software deployments. Troubleshooting guides are provided by software vendors, however they often lag in terms of currency of information and need to be read, understood, and internalized by an IT or applications operator before the tips can actually be applied. Many operators resort to text-based Google searches for help on how to resolve a problem, but irrelevant and confusing information needs to be filtered out manually from the search results.

[0013] Modern software is generally designed to be modular and pluggable and this has transformed many of the software infrastructure pieces into "commodity" products. As a result, enterprises often deploy their software using off-the-shelf software components that are de-facto standards, such as the Linux servers, Apache Web server. Tomcat servlet engine, MySQL database, Hadoop etc. This implies that many software deployments today share common platform and infrastructure services. Different applications can run on top of these services, but overall there is a significant percentage of software that is common across enterprises. For such situations, leveraging "crowd-sourced" diagnostic knowledge from the community of users of common software stacks may be an effective troubleshooting approach that seems to be lacking in the current technologies.

[0014] Therefore, there exists a need for developing and providing a systematic approach for problem diagnosis in business critical environments, where software/hardware for applications such as enterprise or cloud applications, are deployed. There further exists a need to provide a system to effectively address the limitations and challenges aforementioned and to enable operators and enterprises to rapidly resolve problems in a streamlined and efficient manner. It is also a need to leverage prior diagnostic knowledge in resolving current issues in the hardware/software deployments and create and share problem symptoms and solutions systematically in order to expedite future troubleshooting.

[0015] Today's diagnosis process is akin to looking for a needle in a haystack. The present invention focu ses that search to "a box in the h aystack" by generating a set of problem symptom s through automated analysis of machine data such as metrics and logs, and by guiding the user based on knowledge gathered from prior diagnostic efforts, while enabling effective team collaboration through sharing of interesting diagnostic artifacts found by users during problem investigation.

SUMMARY OF THE INVENTION

[0016] An objective of the present invention is to develop and provide a system and method for problem diagnosis in business critical Application deployments (such as in the Enterprise Data Center or in the Cloud), especially those having modem scale-out architectures.

[0017] Another objective of the present invention is to expedite problem diagnosis in the business application deployments by generating a set of problem symptoms through automated analysis of machine data such as metrics and logs and by using them for automated problem matching and knowledge sharing.

[0018] Yet another objective of the present invention is to expedite problem diagnosis by leveraging knowledge gathered from prior diagnostic efforts and by enabling effective team collaboration through sharing of interesting diagnostic artifacts found by users during problem investigation. [0019] An additional objective of the present invention is to expedite problem diagnosis by providing systematic ways to store and share "crowd-sourced" diagnostic knowledge related to problem troubleshooting (such as problem symptoms and solutions) across community of users and enterprises that have similar software/hardware deployments.

[0020] A further objective of the present invention is to expedite problem diagnosis by utilizing expert / domain knowledge collected from user input such as in the creation of problem symptoms and clues, and also automatically learned from tracking user actions during problem analysis.

[0021] A system and a method are disclosed to accelerate problem troubleshooting in business critical application deployments (in the Enterprise Data Center or in the Cloud), especially those having modern scale-out architectures. The system automates key steps in the problem diagnosis process that today are done mostly manually. The present invention also introduces a systematic workflow for problem investigation process that today is mostly ad-hoc, non-repeatable, and difficult to share. Further, results deduced from investigation may also be shared and stored within a knowledge repositor corresponding to their problems.

[0022] When a new problem occurs in an application deployment, the problem will typically manifest as abnormal metric values measured by application monitoring tools, and/or as unexpected log events in the application or infrastructure logs. The system automatically analyzes operational machine data, such as these metrics and logs, to generate a compact set of problem symptoms. The problem symptoms capture the essential characteristics of a problem, such as abnormal metrics and anomalous log events, and along with a diagnostic context (hereinafter referred to as the DiagContext™) form the basis of a structured representation called a 'Problem Case' that is used for automated similarity matching among problems and for knowledge sharing among a community of users within and across enterprises. In an embodiment, a user input may also be accepted to refine or augment problem symptoms, as and when deemed necessary by the user. After generating problem symptoms and the DiagContext, automated fuzzy (i.e., approximate) symptom-matching algorithms may be employed on a knowledge repository of problem cases to quickly identify any similar problems seen earlier. The system provides computed log anomalies, such as Rare Events and Unusual Patterns, as starting points for problem investigations. Techniques such as classification, "diagnostic rules" (hereinafter referred to as the DiagRuies™), and history of prior diagnostic actions are also used by the system to provide suggestions and guide users in their analysis, especially for problems that have not been seen before.

[0023] During problem analysis, the users can identify and share 'clues' such as anomalous metrics and log anomalies that are considered relevant for diagnosis. Other artifacts such as email, chat, documents or other communication may also be identified as important for diagnosis. These relevant diagnostic artifacts are posted on a shared 'diagnostic wall' (hereinafter referred to as the DiagWall™), for supporting efficient collaboration and mutual sharing of work in progress during simultaneous problem investigation by multiple users. To facilitate root cause analysis, clues and other artifacts on the DiagWall may be visualized and filtered along various dimensions such as time, servers, tiers, etc., and may be organized into a causality graph to show the flow of occurrence of symptoms and events leading to the problem. Clues can also be used to build alternative theories and scenarios for problem occurrence and root cause. In this way, the system 100 may maintain and share investigation results in the shared contained of diagnostic artifacts, identified by multiple users for problem diagnosis. The DiagWall is saved as part of a problem case for future reference and learning. These collaborative features reduce duplicate efforts from different users and increase the efficiency of current and future investigations.

[0024] Another aspect of the system is automated knowledge capture - user actions are tracked to record what steps are taken for problem analysis, such as (but not limited to) which symptoms and anomalies are examined. This information is saved in a 'diagnostic decision tree', known as and hereinafter called the DiagTree™, which is an automated mechanism to represent the search space to be explored for problem analysis and indicate information such as (but not limited to) which parts have been checked by the user, what still remains to be examined, where diagnostic artifacts such as clues were identified by the user during analysis, and the like. The DiagTree is constructed based on the problem symptoms, the topology of the application, such as tiers and servers in each tier, and other information related to the problem and application that is relevant for investigating the problem, such as data in the problem alert raised by the application monitoring tools. Additionally, the DiagTree also tracks the decision process for diagnosis, and allows users to specify which paths of the tree were found to be useful or useless for root cause analysis. This knowledge is stored in the DiagTree as part of a problem case and used to track the current investigation and also to provide suggestions during future troubleshooting.

[0025] This summary is provided to introduce, in a simplified form, concepts related to the present invention that are further described below in the Detailed Description. This summary is not intended to identify any key features of the present invention nor is it intended to be used as an aid in determining the scope of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS [0026] The preferred embodiment of the invention will hereinafter be described in conjunction with the appended drawings provided to illustrate and not to limit the scope of the invention, wherein like designation denote like element and in which :

[0027] FIG. 1 depicts a block diagram of an exemplary system, such as a System 100, developed for accelerating problem diagnosis in business critical application deployments, in accordance with an embodiment of the present invention.

[0028] FIG. 2 depicts a block diagram of an exemplary deployment architecture showing two enterprises sharing troubleshooting information using a shared Solutions Cloud Service, in accordance with an embodiment of the present invention.

[0029] FIG. 3 shows a block diagram illustrating a logic flow in making problem symptoms and solutions effectively shareable in the System 100, in accordance with an embodiment of the present invention.

[0030] FIG. 4 shows a flowchart depicting a method implemented by an operator (user) of the System 100 for diagnosing a problem and investigating an effective solution for it, in accordance with an embodiment of the present invention.

[0031] FIG. 5 shows a flowchart depicting a method of exemplary steps for sharing troubleshooting information using a shared Solutions Cloud Service, in accordance with an embodiment of the present invention.

[0032] FIG. 6 depicts a block diagram of an exemplary Enterprise Application deployment, in accordance with an embodiment of the present invention. [0033] FIG, 7 depicts the information contained in an exemplary Problem Case #111000, in accordance with an embodiment of the present invention.

[0034] FIG, 8 depicts an exemplar}' DiagTree for Problem #111000 showing the diagnostic search space, what has been explored and where clues have been identified, in accordance with an embodiment of the present invention.

[0035] FIG. 9 depicts an exemplary DiagWall for Problem #1 1 1000 showing clues identified by multiple users, in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS

[0036] In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a thorough understanding of the embodiment of invention. However, it will be obvious to a person skilled in the art that the embodiments of the invention may be practiced with or without these specific details. In other instances well known methods, procedures and components have not been described in details so as not to unnecessarily obscure aspects of the embodiments of the invention.

[0037] Furthermore, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions and equivalents will be apparent to those skilled in the art, without parting from the spirit and scope of the invention.

[0038] FIG. 1 depicts a block diagram of an exemplary system, such as a System 100, developed for accelerating problem diagnosis in business critical application deployments, in accordance with a preferred embodiment of the present invention. [0039] The present invention provides a System 100 and a method 400 and related technology to expedite problem troubleshooting and diagnosis in complex software/hardware deployments, such as (but not limited to) Enterprise and Cloud Applications. These application deployments are generally critical for business operations and there is often a lot of time pressure to diagnose a problem quickly. Examples of problems include performance problems such as (but not limited to) slow end-user response times, system overload resulting in CPU spikes, as well as anomalous behavior exhibited in other respects that manifest as abnormal metric values and/or log events. A primary purpose of the System 100 is to automate problem investigation tasks that are mostly done manually today, thereby accelerating root cause analysis. For example, the System 100 generates compact problem symptoms by analyzing and summarizing large volumes of machine data, stores and reuses symptoms, contexts, solutions and other troubleshooting knowledge in structured representations called problem cases, uses current and former problem cases for automated similarity matching, and automatically learns from the actions taken by the user during problem analysis. The System 100 also provides a systematic workflow and collaborative facilities for simultaneous problem investigation by multiple users, and also allows for sharing the investigation results within a service cloud for use in diagnosing solutions and root causes for further problem cases.

[0040] Unlike the solely metrics-based approaches used in prior art technologies, the approach developed in the present invention utilizes both metrics and logs for symptom generation, which yields a significantly more accurate representation of problem symptoms because metric values only capture the external manifestation of the behavior of an application while log events capture the internal behavior of the application. Additionally, unlike the prior art systems, the present invention provides a System 100 that accepts human input such as (but not limited to) user input for problem symptoms and clues, and also tracks and learns from user actions during problem analysis in order to incorporate expert knowledge and expedite problem diagnosis.

[0041] The System 100 incorporates a notion of tiers or groups/collections of servers that run similar software within an application. This is a realistic model of practical application deployments which are typically not homogeneous, that is, all servers may not be running the same software. In the System 100, problem symptoms are only compared amongst tiers that are alike, such as those running the same database software, thus increasing the accuracy of problem matching. This approach is unique to the System 100 of the present invention.

[0042] The System 100 gets input data from Application Monitoring and Management Tools 122, and automatically computes problem symptoms based on machine data such as metrics and logs; creates and saves problem cases for future comparison and optional sharing; enables efficient team collaboration via shared diagnostic artifacts during joint investigation; learns from user actions and input; and effectively reuses the captured knowledge to guide and assist the user in problem investigation, thereby expediting root cause diagnosis. It may be obvious to a person ordinarily skilled in the art, that there exist a number of application management and monitoring tools in the prior art that may provide input to the system 100, such as Splunk^®, ElasticSeareh ^', Sumo Logic , and New Relic^™ .

[0043] According to a preferred embodiment of the present invention, the System 100 comprises two main parts: a Troubleshooting Tool 102 and a Solutions Cloud Service 124, as described below and shown in FIG. 1. Alternative embodiments of the present invention may involve other modules and components that are not shown in FIG. 1.

[0044] Troubleshooting Tool 102 [0045] Using machine learning and big data pattern analysis, the Machine Learning & data Analytics Engine 114 within the Troubleshooting Tool 102 automatically analyzes machine data, such as metrics and logs, to generate symptoms that represent the essential characteristics of a problem. Human input is also accepted in the creation of these symptoms, in order to utilize domain knowledge from experts. For each problem, the DiagContext Generator 115 computes a DiagContext representing the diagnostic context from application metadata relevant for diagnosis, such as (but not limited to) application component versions and workloads. Problems are represented in the System 100 as 'cases' that contain the symptoms, the DiagContext, the root cause if known, any solutions, and annotations such as classifications, tags, user reviews and ratings, along with other diagnostic aids such as the DiagTree representing the diagnostic search tree for root cause analysis and the DiagWall representing the diagnostic wall for posting and sharing relevant diagnostic artifacts. These problem cases are created and handled by the Problem Case Handler 112 and used for automated problem matching and sharing of diagnostic knowledge. The Troubleshooting Tool also provides a Collaborative Structured Workflow (104) module for efficient problem investigation by multiple users; a module for Automated Problem Matching & Verification (106); and a module for User Feedback and Automated Knowledge Capture (108), which are described later. Further, investigation results may also be shared with the knowledge repository 128.

[0046] Solutions Cloud Service 124

[0047] The System 100 includes a multi-tenant capable 'Solutions Cloud' service 124, with a shared Knowledge Repository 128 to store, search and reuse a catalog of known problem cases. The Solutions Cloud Service 124 may be deployed in a private and/or shared cloud, and may be used to expedite the investigation of future problems across the same or similar application deployments. In order to leverage community knowledge, problem cases in the Knowledge Repository 128 may be made 'shareable' across different organizations and enterprises by standardizing symptoms and contexts for comparison and anonymizing sensitive private data. This sharing may be controlled based on administrator-specified policies. Also, the results deduced from the investigation may also be stored in the knowledge repository (128). The Solutions Cloud Service also contains the Problem Identification Engine (126) and Learning Engine (130), which are described later.

[0048] The Troubleshooting Tool 102 does bi-directional synchronization with the Cloud Service 124, in order to store all the relevant data in a Knowledge Repositor 128 maintained at the Solutions Cloud Service 124. Further, the Troubleshooting Tool 102 is responsible for interfacing with the users via the User Interface 1 10 module and for coordinating access to other parts of the System 100. In an embodiment, the users of the Troubleshooting Tool 102 may include and are not limited to operators of enterprises data centers, database and system administrators of enterprise application deployments, code developers, and customer support analysts, and the like.

[0049] FIG. I also shows Application Monitoring and Management Tools 122 that monitor and manage activities happening in software/hardware deployments, such as Enterprise and Cloud applications, to produce data for various metrics, to collect logs generated by the application and infrastructure modules, and also to generate various alerts and events. These data serve as input to the Troubleshooting Tool 102 within the System 100. Therefore, the Troubleshooting Tool 102 takes as input Metrics and Logs Data 116 as well as Alerts and Events Data 118 from the existing Application Monitoring and Management Tools 122. [0050] When a problem occurs in an application deployment, the problem typically manifests itself through the metrics and log events in the Metrics and Logs Data 116 as well as through Alerts and Events Data 118. In an embodiment, when an alert is received by the Troubleshooting Tool 102 indicating that a problem has occurred, the Machine Learning & Data Analytics Engine 114 is invoked by the Troubleshooting Tool 102 to analyze these metrics and logs to generate a compact set of problem symptoms. In alternative embodiments, the Machine Learning & Data Analytics Engine 1 14 may be continuously analyzing metrics and logs data to check for abnormal behavior and proactively compute problem symptoms. The problem symptoms along with a DiagContext computed from application metadata represent the essential characteristics of a problem that is utilized by the System 100 to compare a new problem with known problems that have occurred earlier. The Troubleshooting tool 102 implements a Machine Learning and Data Analytics Engine 114 to analyze the input from Metrics & Log Data 1 16 and Alerts & Events Data 118. Further, the input from an Application Metadata repository 120, such as (but not limited to) a Configuration Management Database (CMDB), may also be received by the Troubleshooting Tool 102.

[0051] The Troubleshooting Tool implements a DiagContext Generator 115 to compute a DiagContext using the problem symptoms and the input from Application Metadata repository 120. Collectively, all the inputs, namely the metrics, logs, alerts, metadata and the like, define a problem case that represents a problem in the System 100. All the problem cases are developed by the Problem Case Handier 112 module of the Troubleshooting Tool 102.

[0052] The problem cases with symptoms, solutions and other associated diagnostic data are stored in a Knowledge Repository 128 maintained within the Solutions Cloud Service 124, which is a service that may be deployed locally within an enterprise or remotely in a shared cloud. Alternative embodiments of the Solutions Cloud Sendee 12.4 include and are not limited to a 'hybrid' deployment scenario where diagnostic knowledge resides in an enterprise's private knowledge repository and also in a shared knowledge repository in the cloud.

[0053] The System 100 further provides a Collaborative Structured Workflow 104 module within the Troubieshooting Tool 102 for systematic investigation and team collaboration during problem analysis. The structured workflow involves automated problem identification by matching with known problems, examination and verification of matches by the users, and guided investigation in case the problem is unknown and needs to be analyzed in detail. Guided investigation includes, for example, isolating the fault to one or more application components using suggestions from the Collaborative Structured Workflow 104 module. A diagnostic wall referred to as the DiagWall is a facility provided by the Collaborative Stmctured Workflow 104 for team collaboration during problem investigation by multiple users. The DiagWall is a collaborative container mechanism for collecting and sharing interesting diagnostic artifacts such as user-selected and/or user-defined clues or symptoms that are significant for diagnosis, thus reducing duplicate efforts and increasing troubleshooting efficiency.

[0054] As mentioned above, the Problem Cases are stored in the Knowledge Repository 128 of the Solutions Cloud Sendee 124. The problem cases may later be reused by the Problem Identification Engine 126 to find similar problems that happened earlier, and by a Learning Engine 130 to automatically derive diagnostic rules known as DiagRules.

[0055] Further, an Automated Knowledge Capture 108 module residing within the Troubleshooting Tool 102 automatically tracks user actions and records them in a DiagTree, such as which parts of the diagnostic search space have been explored by the user, whether a certain investigation path was useful or useless in identifying the root cause, and the like.

[0056] The Troubleshooting Tool 102 and the Cloud service 124 remain in communication with each other, where the Troubleshooting Tool 102 keeps on updating the Knowledge Repository 128 of the Solutions Cfoud Service 124 with new and updated problem cases. Further, the Solutions Cloud Service 124 provides matching problem cases that were stored earlier in the Knowledge Repository 128, to the Troubleshooting Tool 102. A Problem Identification Engine (abbreviated as PIE) 126 residing at the Solutions Cloud Service 124 is invoked when matching problem cases are to be identified and sent to the Troubleshooting Tool 102. These matching problem cases are presented to the user via the User Interface 110 for verification and feedback.

[0057] Generating Problem Symptoms

[0058] When a problem happens in an enterprise application such as the one shown in FIG. 6, the monitoring and/or event management tools 641 will typically raise an alert indicating that the expected behavior or a Service Level Agreement (SLA) has been violated. These alerts are provided as data input to the Troubleshooting Tool 102 within the System 100, and users can select any alert displayed in the Troubleshooting Tool 102 to investigate the underlying problem.

[0059] When an alert is received by the Troubleshooting Tool 102 for a new problem, the Machine Learning & Data Analytics Engine 1 14 generates the problem symptoms by automatically analyzing huge volumes of machine data from metrics and logs obtained from the Metrics & Log Data 116 using Statistical techniques, Machine Learning, and Big Data pattern analysis [3,4]. The result is a compact set of problem symptoms comprising two main parts:

[0060] (1) A set of metrics that is pertinent for problem classification and diagnosis. [0061] Typically, numerous metrics are collected by the Application Monitoring and Management Tools 122 for an enterprise application deployment (shown in FIG, 6 and described later). Feature Selection techniques, such as Logistic Regression, are applied by the Machine Learning & Data Analytics Engine 1 14 on the set of collected metrics for dimension reduction and ease of reasoning. The set of metrics selected by these automated algorithms may be modified by the user to incorporate domain knowledge. Using Statistical methods and Machine Learning, values of pertinent metrics are discretized over a (user-customizable) time period and categorized as normal or abnormally high or low during each period. Discretizing continuous data makes it easier for the users to visually compare such data. For the purposes of categorization, normal behavior may be specified by the user as one or more windows of time where the application behaved as expected.

[0062] (2) A set of log anomalies such as Rare Events and Unusual Patterns.

[0063] Rare Events are log events identified by Machine Learning & Data Analytics Engine 114 as uncommon occurrences or outliers based on pattern analysis of log data and computed event frequencies. Unusual Patterns are groups of log events that have either decreased or increased in frequency around the time of problem occurrence as compared to normal operation. Clustering and pattern analysis methods are used to compute these and other log anomalies.

[0064] Enterprise Application deployments are typically not homogeneous but may have tiers or groups/collections of servers that run similar software, such as a database or a web server, as will be shown in FIG. 6 and described below. For such multi-tier applications, problem symptoms are generated per tier and the Problem Identification Engine 126 compares symptoms on a per-tier basis. Using statistical and machine learning techniques, the Machine Learning & Data Analytics Engine 114 computes an 'Abnormality Score' for each tier in the enterprise application. This numerical score indicates which tiers may have deviated the most from its norma! behavior. These Abnormality Scores may be used to compute the DiagContext and also to indicate good starting points for problem isolation and investigation, as described later.

[0065] Creating Problem Cases

[0066] A problem case is a container for all information that is relevant for investigating a specific problem. Problem cases are created and developed by the Problem Case Handler 1 12 module within the Troubleshooting Tool 102. Initially, when an alert is raised indicating that a problem has occurred, the Troubleshooting Tool 102 computes problem symptoms for each tier in the application and stores them in the problem case along with the data from the problem alert. Diagnostic aids such as the DiagTree and the DiagWall (initially empty) are also created and stored with the problem case. Functions of the DiagTree and the DiagWall are described later.

[0067] Also included in the problem case is a diagnostic context called DiagContext, which represents the occurrence context or environment in which the problem happened, e.g., in terms of relevant components in the software stack for the deployed application, workloads and the like. The DiagContext for a problem is computed by the DiagContext Generator 115 module within the Troubleshooting Tool 102 at the time of problem occurrence from application metadata, such as (but not limited to) configuration information obtained from a Configuration Management Database (CMDB) for the specific application deployment, and relevant application component versions. Relevant application components for the DiagContext are identified using data from the problem alert such as which server is malfunctioning, and from problem symptoms such as which application tier, metrics and/or logs are exhibiting anomalous behavior, and the like. For example, in an embodiment, the problem symptoms may indicate that the database tier in an application is behaving normally and hence the database version may not be relevant for troubleshooting and therefore for the DiagContext. As another example, if the Abnormality Scores show that the web tier is behaving abnormally then the versions of its software components are relevant for troubleshooting the problem, and hence also relevant for computing its DiagContext. Other application metadata, such as application workloads, may also be part of the computation of DiagContext by the Troubleshooting Tool 102.

[0068] As investigation progresses, additional information such as the root cause, user-specified tags for classification, solutions, user reviews and ratings may be recorded in the problem case, and the associated DiagT ee and DiagWall are updated to reflect user actions.

[0069] Creating & Using DiagRuies

[0070] Diagnostic rules known as DiagRuies may be created in the System 100 to help with the diagnosis of problems. These rules may represent domain knowledge gathered from users and other heuristic knowledge that the Learning Engine 130 automatically deduces in the course of operation. Such rules may be specified by the user, for example, to filter out noisy data from logs and metrics, and to ignore log events that are irrelevant for diagnosis of a problem. Heuristic DiagRuies may be learned automatically by the Learning Engine 130, based on prior problem cases and history of user actions. One example of an automatically learned DiagRule is to identify which parts of the operational machine data (i.e., metrics and logs) are commonly examined by users for root cause diagnosis for a certain class of problems. Other examples of DiagRuies that may be automatically deduced include event correlation through analysis of event patterns found in the metrics and log anomalies for problems, such as a certain event pattern A typically implies the occurrence of another event pattern B.

[0071] DiagRules may be used by the System 100 for various purposes, for example, during symptom generation and case matching, and to provide suggestions on investigation paths, DiagRules may be scoped in their effect, such as apply to a single problem only or to a broader set of problems. This allows the user to exercise control over which diagnostic rules are applied for a problem case,

[0072] Storing Probiem Cases in the Knowledge Repositoiy 128

[0073] Problem Cases are stored in the Knowledge Repository 128 of the Solutions Cloud Service 124. The probiem cases may later be reused by the Problem Identification Engine 126 module to find similar problems that happened earlier, and by the Learning Engine 130 to automat c alty derive DiagRules. DiagRules that apply to a particular problem are associated with that problem case only, for example, a rule to filter out a specific log event that is indicated by the user to be irrelevant for diagnosis of that particular problem. DiagRules that apply to a broader set of problems are stored separately from any specific problem case, and may be checked for applicability to any problem at the time of creating a new problem case and during subsequent analysis.

[0074] Manual browsing and searching of problem cases is supported on the Knowledge Repository 128. For example, the user can search for problem cases that are tagged with a certain string, e.g., 'web tier' or 'JDBC. This searching facility allows users to manually check for known problem cases using their own criteria. [0075] The Knowledge Repository 128 may be initially seeded with diagnostic data from product manuals, troubleshooting guides, and known issues from testing environments. The Knowledge Repository 128 grows over time through continued use of the System 100. This diagnostic knowledge may be optionally shared based on administrator-defined policies, as described later.

[0076] Problem Cases stored in the Knowledge Repository 128 may be augmented by users at a later point in time. For example, users can add new solutions, review and rate existing solutions, and tag and classify known problem cases. This user-specified information may be utilized by the System 100 in a variety of ways. For example, the only solution to a problem might be rated badly by reviewers and it will cause the Problem identification Engine 126 to reduce the confidence level for a match with that problem case. Likewise, if a solution is rated very highly by users, then the confidence level of that match is increased by the Problem Identification Engine 126,

[0077] Automated Problem Matching

[0078] After a case has been constructed for a new problem from its symptoms and DiagContext, the Automated Symptom Matching & Verification 106 module in the Troubleshooting Tool 102 sends it to the Problem Identification Engine 126 (PIE) in the Solutions Cloud Service 124 to determine if similar problem cases have been seen earlier and stored in the Knowledge Repository 128. This automated matching functionality may drastically reduce the amount of manual effort involved in identifying and solving problems that have already been encountered. [0079] The Problem Identification Engine 126 uses fuzzy or approximate pattern matchmg algorithms on the Knowledge Repository 128 to generate a set of potentially matching problems and associated confidence levels that indicate the closeness of a match. First, the DiagContexts of the new and stored problem cases are compared. If two DiagContexts are very dissimilar, such as when the new problem happens on a different software stack, then there is no match possible. If two DiagContexts match exactly or are similar, then problem symptoms such as pertinent metrics and log anomalies are compared for the two problems for each corresponding tier in the application. Any clues that were identified by the user and posted on the DiagWall for an earlier problem are also checked against the current case - if earlier clues are present in the current set of problem symptoms then the confidence level of the match is increased. User- specified criteria and DiagRuies are also employed in this matching step to reduce or expand the set of potential matches,

[0080] For example, consider that a new problem case has its database version specified as "Oracle 11.1" in the DiagContext, while the earlier cases have database version as "Oracle 1 1.0". Since the database versions of the new and older problems do not differ in the main release number and only differ in the "dot" release number, the differences in the two software releases are likely to be minor. In this case, the Problem Identification Engine 126 may consider the older problems as relevant for matchmg purposes, but reduces the confidence level from that of an exact match. However if the two database versions are different in the main release numbers, say 11.1 and 10.0, then the earlier problem cases are not considered for further matching checks because the database software is likely to be quite different for major releases. In an embodiment, some of the criteria for approximate matching are built-in while others may be specified by the user as DiagRuies. [0081] Likewise, the process of symptom matching is also approximate. Most often, symptoms of two problems will not match exactly. The symptom-matching algorithm computes 'similarities' between symptoms of the new and earlier problem cases, and the confidence level of a match is provided to the user as a measure of how 'similar' the two problems are.

[0082] Potentially matching problems found by the Problem identification Engine 126 are returned to the Troubleshooting Tool 102 for display to the user who can review each suggested matching case, and also update it with his/her feedback. A match verification and comparison facility for problem cases is provided by the Automated Problem Matching & Verification Tool 106 and presented to the user via the User Interface 1 10 module in the Troubleshooting Tool 102 to aid in visually comparing and contrasting the current problem with suggested matches (i.e., the potentially matching problems). The user can examine the symptoms, solutions and other diagnostic data in these earlier problem cases and accept or reject any suggested match. Acceptance and rejections of matches and any other user feedback are collected by the User Feedback & Automated Knowledge Capture 108 module and sent to the Learning Engine 130 in the Solutions Cloud 124. This feedback information is used in the future to refine the output of the problem matching algorithm and to increase or decrease the confidence levels of matching cases,

[0083] Use Feedback & Automated Know led ge Capture

[0084] The User Feedback & Automated Knowledge Capture 108 module in the Troubleshooting Tool 102 automatically tracks user actions and records them in a DiagTree, such as which parts of the diagnostic search space and which symptoms have been explored by the user, etc. The DiagTree may additionally indicate which paths are expected to be promising but remain unexplored, where clues and other interesting diagnostic artifacts were found by the user, thus providing an overall picture of the investigation status. In this way, the system 100 may share the investigation results by maintaining a shared container of diagnostic artifacts provided by multiple users for problem diagnosis. The DiagTree for a problem is constructed based on the application topology and problem symptoms, and other information such as the problem alert, history of prior diagnostic actions, and DiagRules.

[0085] As investigation proceeds, users may indicate that certain search paths are not useful and are to be eliminated from the analysis - using the User Interface module 110 in the Troubleshooting Tool 102 these paths may be marked in a DiagTree, along with the evidence for the decisions. This user feedback is managed by the User Feedback & Automated Knowledge Capture 108 module in the Troubleshooting Tool 102.

[0086] After a root cause / solution is found, users may be asked via a 'questionnaire' to indicate, e.g., which symptoms lead to the root cause and which investigation paths in the DiagTree were productive. User experience and decisions such as whether a certain investigation path was useful or useless in identifying the root cause are saved in the DiagTree as part of the problem case, and may be used to automatically derive heuristic rules and guide investigations of future problems.

[0087] Team Collaboration during Problem Investigation

[0088] The DiagWall is a facility provided by the Collaborative Structured Workflow 102 module within the Troubleshooting Tool 102 for team collaboration during problem investigation. Typically, multiple users will simultaneously investigate a problem in enterprise application deployments. Users often individually discover clues such as specific problem symptoms that are considered significant for root cause analysis. One example of a clue is a log Rare Event anomaly such as a stack trace indicating database failure. Another example is an Unusual Pattern anomaly showing an unusually high number of logins, indicating possible hacker attack, in addition to clues, relevant diagnostic artifacts may also include user observations, annotations, documents, or communications such as emails and chats that are significant for solving the problem. The DiagWall is a collaborative container for collecting and sharing such interesting diagnostic artifacts, thus reducing duplicate efforts and increasing troubleshooting efficiency. The DiagWall is saved as part of a problem case for future reference and learning.

[0089] The User Interface 1 10 module in the Troubleshooting Tool 102 provides various ways to display, filter, and organize clues. For example, all clues from a certain user or within a certain time range may be filtered out from the view. Clues may be organized along various dimensions, such as by occurrence time, by user, by application tiers or servers where they occurred, etc. They may also be stacked together to form composite clues that are related to each other. Causality graphs may be constructed with clues to indicate the flow of events that caused the problem to occur and trace it down to the root cause. Clues may also be used to build alternative scenarios for the problem occurrence, e.g., to explore different hypotheses for the root cause.

[0090] Sharing of Problem Cases using the Solutions Cloud Service 124

[0091] The Knowledge Repository 128 may be shared across different enterprises in order to leverage community experience for problem troubleshooting. This involves steps such as 'anonymizing' the shared information to make it appropriate from a privacy point of view by removing all enterprise-specific information and sensitive data. For example, IP addresses, user names and passwords must be removed from problem cases before sharing with a wider community of users. The steps required to share a problem case are described later in further (lei all.

[0092] FIG. 2 depicts a block diagram of an exemplary deployment architecture showing two enterprises, Enterprise A 200 & Enterprise B 202, sharing troubieshooting information through a common Solutions Cloud service 222, in accordance with an embodiment of the present invention. Only the components relevant for sharmg are shown in FIG. 2 and described here.

[0093] Enterprise A 200 has a Linux, Apache, MySQL, PHP (LAMP) Infrastructure stack 204 which is the same as the Linux, Apache, MySQL, PHP (LAMP) Infrastructure stack 214 of Enterprise B 202. Enterprise Application A 206 and Enterprise Application B 212 are custom applications that run on the infrastructure stacks of the respective enterprises. Distinct local instances of the Troubleshooting Tool are run in the two enterprises - Troubleshooting Tool 208 in Enterprise A 200 and Troubleshooting Tool 210 in Enterprise B 202.

[0094] Problem cases for the common LAMP infrastructure may be uploaded by the two enterprises, A and B, to the Shared Multi-tenant Knowledge Repository 218 in the common Solutions Cloud Sendee 222 for private use or for diagnostic knowledge sharing. This service supports multi-tenancy for data isolation and privacy - enterprises may not wish to share all or some of their problem cases and those cases are kept private even though they reside in the Shared Multi-tenant Knowledge Repository 218. Administrators at the two enterprises may control the sharing of problem cases using poiicies. The shared Problem identification Engine 216 and Learning Engine 220 are also aware of the sharing poiicies. [0095] The System 100 supports convenient 'scoped' sharing of problem cases via policies. For example, Enterprise A 200 may want to share its diagnostic knowledge only within its own organization, or with a specific set of its partners and collaborators, or with a broader community of users. Ail of these sharing options are supported by the Shared Multi-tenant Knowledge Repository 218.

[0096] In another example deployment scenario different from that shown in FIG. 2, enterprises may additionally have their own local (private) instances of the Solution Cloud Service, e.g. to each store problem cases for their custom applications Enterprise Application A 206 and Enterprise Application B 212, These problem cases may be related to application-specific issues and may not be relevant for sharing across enterprises or with a general community of users. In such cases, problem matching is done using both the local cloud and the shared cloud, as needed.

[0097] Steps in Enabling the Sharing of Problem Cases:

[0098] Specific techniques are employed in the System 100 to create problem symptoms and solutions that can be effectively shared and reused across systems and enterprises. These are described below and shown in FIG, 3 (only the steps and components relevant for sharing are shown). The techniques include:

[0099] (a) Summarization and aggregation of machine data into pattern-based symptoms (302)

[00100] Machine data such as metric values and log entries will not match exactly across different systems. So for effective comparison of symptoms across systems we need to compute higher level and more uniform views of this data. For example, absolute values of metrics are normalized into abnormal ('high, 'low') and 'normal' ranges and summarized over a period of time around when the problem occurs. Logs are parsed and analyzed and further characterized in terms of outliers / anomalies, patterns seen frequently, etc., and also significant trends in the log patterns seen in the various tiers are identified. In the pattern analysis, the text that is constant in a group log events is consolidated in order to generate regular expression-based patterns.

[00101 ] Using advanced probabilistic algorithms, a set of metrics and logs events are selected by the System 100 as 'relevant' for the purpose of problem diagnosis (note that this step is done as part of the regular system functionality of computing a compact problem representation and not done for sharing only). Additionally, the operator has the choice of manually identifying any metrics or logs as being relevant for troubleshooting.

[00102] (b) Standardization of symptom names (304)

[00103] This step involves mapping of the names given to the various units of machine data into a common terminology. E.g., one system may call a metric "CPU Util" while another calls it "Utilization of CPU". For this mapping, a common set of metric names is defined in the System 100 so that problem symptoms from different monitoring and management tools can be mapped to common terms for comparison. Log files of similar software/hardware system may have the same names, and exceptions are handled case-by-case. For uniformity, any system specific parts in the log paths are removed.

[00104] (c) Computation of the standardized diagnostic context (DiagContext) (310)

[00105] In the System 100, a problem is represented in terms of two aspects:

[00106] i. A set of symptoms - this set is based on patterns and anomalies in machine data such as metrics and logs that have been normalized and standardized (as described above), and [00107] ii. A DiagContext - this is computed from application metadata (obtained from a metadata repository such as a Configuration Management Database) and represents the context or environment in which the problem symptoms occurred, e.g., in terms of the relevant components in the software stack and application workloads.

[00108] The DiagContext is essential for determining if problem occurrences in two different deployments have a 'shared stack' so that their symptoms can be compared. To compute the DiagContext for a problem occurrence, the Application Metadata 308 is examined with respect to the problem symptoms to identify a subset of metadata that is significant for diagnosing the problem. For example, in an embodiment if the database tier is not involved in the problem symptoms and then the database version may not be relevant for troubleshooting. Likewise, it may be identified that a problem is likely originating from the web tier and hence the versions of its components as obtained from the Application Metadata 308 are significant for trouble-shooting this problem. Application workloads may also come into play in the determination of DiagContexts. DiagContexts must also be standardized for comparison across different enterprises, such as using standard component names and versions. The System 100 maintains a common components list for this purpose.

[00109] (d) Anonymization of symptoms, solutions and other case data (306)

[00110] An enterprise might want to publish known problem cases for sharing with the community. Before sharing, the symptoms, solutions, and other case data such as the DiagTree and DiagWall need to be anonymized in order to remove any proprietary and system-specific information such as site-specific URLs, hostnames or IP addresses, username/password etc. Log patterns must be checked and any sensitive or private data should be blocked out. The process of anonymization is partially automated, in that the System 100 aiioiiymizes sensitive data and also flags any data that could potentially be sensitive. The user can also inspect symptoms and solutions and/or provide rules (DiagRuies) to ensure that no sensitive or proprietary data is exposed. The User Interface module 110 provides convenient ways for the operator to edit symptoms and solutions and other case data before sharing.

[0011 1 ] (e) Solutions Cloud Sendee - a cloud-based service 318 with a shared knowledge repository 320

[00112] After the symptoms and other case data for a problem have been processed as above, an enterprise can upload it to a shared knowledge repository 320 that resides in the cloud 318 for convenient access. The cloud-based service 318 provides multi-tenancy support for data isolation and privacy. So, for enterprises that do not wish to share their diagnostic data, the problem cases are kept private even though they may reside in the cloud.

[001 13] (f) Policy-based Management of Sharing via the Cloud Sendee (312)

[001 14] The system 100 supports convenient 'scoped' sharing of data via policies. For example, an enterprise A might want to share its diagnostic knowledge only within its own organization, or with a set of partners or collaborators, or with the general community of users. All of these sharing options are supported by the System 100. Administrators can define system- wide policies that control/manage the sharing of symptoms and solutions via the Solutions Cloud sendee 318. These policies can automate and control which problem cases can be uploaded and/or downloaded from the cloud. In an alternative embodiment, problem cases downloaded from the Solutions Cloud Service 318 can also be stored (i.e., cached) in the local repository for efficient future evaluations and matching, [00115] Further, the problem solutions may also include review, ratings, tags, classifications (314) provided by the users of the System 100. As mentioned above, before sharing, all case data needs to be anonymized in order to remove any proprietary and system- specific information such as site-specific URLs, hostnames or IP addresses, username/password etc. Therefore, the System 100 provides facilities to anoiiyrnize the solutions based on the policies for sharing (316).

[00116] (g) Fuzzy matching of problems in the Solutions Cloud Service (322)

[00117] Generally, an exact match of symptoms may not be expected across different systems or enterprises. Therefore, fuzzy pattern-matching algorithms are employed by the Problem Identification Engine with Fuzzy Pattern Matching 322 module in the Solutions Cloud Service 318 for comparing problems symptoms and DiagContexts with earlier known problems (i.e., present in the shared knowledge repository 320). The result of these fuzzy algorithms is a set of potential matching problems along with confidence levels that denote the closeness and 'goodness' of each match.

[00118] For example, consider that a new problem is sent to the Solutions Cloud sendee 318 for matching with older problems in the knowledge repository. The DiagContext of this problem may specify the database version as "Oracle 1 1.1" while the known problems in the repository have database version as "Oracle 11.0". Since the database versions of the new and older problems do not differ in the main release number and only differ in the "dot" release number, the differences in the two software releases are likely to be minor. In this case, the System 100 considers the earlier problems as relevant for matching purposes, although the confidence level is reduced from the exact match case. [00119] Structured Workflow for Problem Analysis

[00120] The System 100 provides a systematic workflow for problem identification, isolation, and investigation by users. Problem identification involves finding one or more earlier (known) problems that is the same as or similar to the current problem. Problem isolation implies finding one or more faulty components that lead to the problem, while problem investigation involves general analysis by users to determine the root cause. The workflow provided in System 100 automates several tasks involved in these steps, and guides the users in problem isolation and investigation by providing suggestions.

[00121] FIG. 4 shows a flowchart depicting the logic flow in the system 100 for diagnosing a problem and investigating the root cause for it, in accordance with an embodiment of the present invention. When a problem happens in application deployment. Application Monitoring and Management Tools 122 will usually raise an alert. An operator or a user may login to the System 100 to diagnose the cause of an alert. Shown in Fig. 4 and described below is the sequence of steps that happens in the System 100 when an alert is raised.

[00122] At step 402: An alert is received by the Troubleshooting Tool 102 from Application Monitoring and Management Tools 122, After an alert is received, the diagnostic workflow begins.

[00123] At Step 404: The Troubleshooting Tool 102 uses proprietary algorithms in the Machine Learning & Data Analytics Engine 114 to automatically analyze and aggregate numerous metrics and logs (and possibly other machine data) to generate a set of problem symptoms that efficiently capture the essential characteristics of a problem. In an embodiment, the Troubleshooting Tool 102 of the system 100 may fetch the metrics & logs data on the fly vi a adapters to monitoring tools and does not need to copy all the machine data, assimiing they are not purged while problem investigation is in progress. In some situations, e.g., when logs are purged frequently, the data may have to be copied by the Troubleshooting Tool 102 in order to make sure the required information is available unti l the root cause has been identified,

[00124] Additionally, the DiagContext Generator 115 within the Troubleshooting Tool

102 uses the problem symptoms and the Application Metadata 120, such as (but not limited to) configuration information from a Configuration Management Database and application workload information, to generate the DiagContext for the problem. The problem symptoms & DiagContext are used by the Problem Case Handler 1 12 to create a new Problem Case, which is a structured representation for the current problem in the system 100.

[00125] Next, the Automated Problem Matching & Verification 106 module in the Troubleshooting Tool 102 sends the current the Problem Case to the Problem Identification Engine 126 (PIE) residing in the Cloud service 124 for identification, i.e., to check if this or similar problems have been seen earlier.

[00126] At step 406: As mentioned above, the Problem Identification Engine 126 checks whether the same or similar problems have occurred earlier. The Problem Identification Engine 126 accomplishes this task by automated fuzzy pattern matching on a Knowledge Repository 128 that stores previously seen problem cases. The response from the Problem Identification Engine 126 is a set of potentially matching problems along with associated confidence levels (i.e., probabilities) that indicate how similar a suggested match is to the current problem. Step 406 checks whether any potentially matching problems were found by the Problem Identification Engine 126 and returns to the Troubleshooting Tool 102. [00127] At Step 12: The alert raised for the problem is expected to invoke a response from an operator (user), who may log in to the System 100 and select an alert to investigate. If any potentially matching problems were found in Step 406, the operator can examine and compare these matching problem cases in detail using the User Interface 110 module in the Troubleshooting Tool 102. The Automated Problem Matching & Verification 106 and User Interface 110 modules support filtering and visual comparison of these potential matches with the current problem as well as with each other, so that an operator can compare and contrast the suggested matches. When operators examine a potential match / solution, they can also provide their own review comments (e.g., did the solution work as expected? did it need any modifications, and the like), as well as give ratings for its success, attach tags etc. Based on his/her observations and analysis, the operator can reject or accept a suggested match, which is checked at step 414. If the operator accepts a suggested match, then the workflow moves on to Step 416.

[00128] At Step 408: If no similar problem has been seen before then the Problem

Identification Engine 126 can find no matches, or the operator may not accept any of the suggested potential matches / solutions based on his/her analysis. In these cases, the operator can launch into a 'guided' isolation and investigation process using suggestions provided by the Troubleshooting Tool 102.

[00129] Here, the Troubleshooting Tool 102 uses information such as (but not limited to) the symptoms exhibited by the current problem, classifications of this and earlier problems, as well as DiagRules, to assist the operator e.g., by suggesting which tier/modules are likely to be at fault. Based on these recommendations, the operator can choose to investigate specific tiers or components, and significantly narrow down the list of possibilities to check. For example, for multi-tier applications, the user may check the Abnormality Score computed by the Machine Learning & Data Analytics Engine 1 14 to start the analysis - exploring the symptoms of the tier with the highest abnormality score would be a reasonable starting point. Information recorded in the Diag Walls and DiagTrees of similar (but not exactly matching) problems may also be used to guide the user in the investigation, for example, by showing which application components were at fault in the earlier cases. Other options include using problem classification techniques and automatically learned or user-specified DiagRules for troubleshooting. For example, if a new problem has the same classification as an earlier problem, for example based on the symptoms, then earlier troubleshooting steps are suggested as promising investigation paths. DiagRules may^¬ be applied to filter out irrelevant symptoms, and thus reduce the amount of data to be examined by the operator.

[00130] At this step, automated action tracking by the User Feedback & Automated

Knowledge Capture 108 module in the Troubleshooting Tool 102 comes into play as the user examines various symptoms and anomalies. Investigation steps taken by the user are automatically recorded in the DiagTree for the problem in order to track the status and progress of the investigation. User input is also accepted for the DiagTree - for example, the user may indicate which investigation paths are dead-ends and considered useless for solving the problem, effectively pruning the DiagTree.

[00131 ] At Step 410: At the end of the guided investigation, the Troubleshooting Tool 102 may send the new Problem Case augmented with data collected during the analysis to the Knowledge Repositor 128 within the Cloud service 124 for future reference and learning. Examples of data collected during the analysis include (but are not limited to) user feedback on why any suggested potential matches were rejected, a root cause for the current problem, a new solution determined by the operator, an updated DiagTree with investigation status, a DiagWall with clues etc. The User Feedback & Automated Knowledge Capture 108 module in the Troubleshooting Tool 102 may also employ a 'diagnostic questionnaire' at the time a new root cause or solution is specified by the user, in order to gather relevant information from the user, e.g., which metrics or log anomalies best indicate the source of the problem, how to classify the problem, etc. This information may be used subsequently by Learning Engine 130 in the Cloud Service 124 to learn about the taxonomy of problems, and also to provide suggestions when an unknown problem is investigated.

[00132] At Step 416: At this step, the operator has accepted one or more matches suggested by the Problem Identification Engine 126. The Troubleshooting Tool 102 updates the Knowledge Repository 128 with user feedback such as review comments for matches and/or solutions that were accepted/rejected by the user, new tags/annotations, and ratings for the success of a match (e.g., how well the solution worked). Ratings for solutions are subsequently used by the Problem Identification Engine 126 to calculate confidence levels of suggested matches and their solutions. Rejections detract from the confidence level of a matching case in future, while acceptances increase it. This feedback scheme effectively utilizes collective troubleshooting experience from the community of users.

[00133] As an example, consider that an alert is generated by the monitoring tools for an application indicating that the end-user response time is slow. This triggers step 402, where a new alert is received by the Troubleshooting Tool 102, leading to step 404. In this step, the Troubleshooting Tool 102 analyzes metrics and logs to generate the problem symptoms. For this example problem, suppose that the following log Rare Event anomaly is found in the database tier from log analysis: [00134] [ERROR] SQL sessions maxed out at 1000, refusing to accept more session requests.

[00135] Since the database tier is implicated by this anomaly, the computed DiagContext includes the database version, say "MySQL version 5.0". In step 404, this new problem case is sent to the Problem Identification Engine 126 in the Solutions Cloud Se dee 124 for matching with known problems in the Knowledge Repository 128.

[00136] Assume that an earlier problem had occurred with similar symptoms and

DiagContext, and it was successfully resolved by increasing the JDBC Connection Pool Size configuration parameter for the database. In step 412, this potentially matching problem case and solution will be displayed to the user. After the user examines it, he/she may apply the earlier solution and accept it if it resolves the problem. This leads to step 416, where the user may provide feedback on how well the solution worked, if any other modifications were needed, etc., and the Knowledge Repositor 128 is updated accordingly by the Troubleshooting Tool 102.

[00137] The above example is for illustrative purposes only. In real world applications, log files may be complex and problem symptoms may often include multiple anomalies, and DiagContexts may involve multiple application components and other factors such as workloads.

[00138] Making Problem Cases Shareable

[00139] When a new problem case is created, the sharing policies are checked to see if sharing can be enabled for the new case, which groups of users have access to it, and the like. The case may be viewed and downloaded only by those who have appropriate sharing access. Before sharing is enabled on a Problem Case, its data must be 'standardized' for better comparability with other cases from different enterprises and 'anonymized' to remove all system-specific data and sensitive information, as discussed earlier. FIG, 5 shows a flowchart depicting a method of exemplary steps for sharing trouble-shooting information using a shared Solutions Cloud Service 218, in accordance with an embodiment of the present invention,

[00140] At step 502, a new Problem Case j ust got created for Enterprise A in its Troubleshooting Tool 208 shown in Figure 2. At steps 504-506, sharing policies defined by the administrator of Enterprise A are checked by the Troubleshooting Tool 208 to see if the new case is shareable by others. If sharing is disallowed then at step 508 this problem case is stored in the Shared Multi- tenant Knowledge Repository 218 as private to the Enterprise A.

[00141] If sharing is allowed, then at step 510 the problem symptoms and the DiagContext are standardized by the Troubleshooting Tool 208 for better comparability with other shared cases from different enterprises and organizations. This step may involve mapping of the names given to various units of machine data, such as log file names and metric names, into a common set of terms. For example, any system-specific paths are removed from the log file names. As another example, one monitoring system may call a metric "CPU IJ il" while another calls it "Utilization of CPU". A common set of terms are defined in the system 100 to map such names from different monitoring tools and management systems.

[00142] At step 512, all case data is anonymized by the Troubleshooting Tool 208 in order to remove any proprietary and system-specific information such as site-specific URLs, server names or IP addresses, username and password, etc. For example, any log events and patterns included in the problem symptoms are checked and all sensitive and private data are removed to create a shareable version of the Probl em Case. [00143] In step 514, both the private and the shareable version of the Problem Case are saved in the Shared Multi-tenant Knowledge Repository 218.

[00144] Enterprise Application Deployment Example

[00145] To explain the operation of the system 100, consider an example deployment for an enterprise application. FIG, 6 depicts a block diagram of an exemplary Enterprise Application deployment, in accordance with an embodiment of the present invention. It consists of six server machines named Server! 606, Server2 612, Server3 620, Server4 626, ServerS 632 and Server6 639. The server machines Server 1 606 and Server2 612 are part of the Application Tierl 613 group and respectively run the MySQL Database program 603 on the Linux Operating System 604 and the MySQL Database program 609 on the Linux Operating System 610. The server machines ServerS 620, Server4 626 and ServerS 632 are part of the Application Tier2 614 group and respectively run the Apache Web Server 617 on the Linux Operating System 618, the Apache Web Server 623 on the Linux Operating System 624, and the Apache Web Server 629 on the Linux Operating System 630. The server machine Server6 639 forms the Application Tier3 633 and runs the Application Program 636 on the Linux Operating System 637. The server machines Server! 606, Server2 612, ServerS 620, Server4 626, ServerS 632 and Server6 639 respectively run Other Program Modules 602, 608, 616, 622, 628, and 635, and have Program Data 601, 607, 615, 621 , 627, and 634 as well as System Memory ROM & RAM 605, 61 1, 619, 625, 631 , and 638.

[00146] All the server machines are connected to a Communication Network 640, Also connected to the Communication Network 640 are the Application Monitoring and Management Tools 641 for the application. System 642 is a deployed instance of the diagnostic and troubleshooting System 100 described earlier and it communicates with the Application Monitoring and Management Tools 641 to get problem -related machine data, alerts and events, and may also communicate with the server machines over the Communication Network 640.

[00147] Various other deployment architectures are possible for Enterprise and Cloud Applications, Figure 6 is intended for illustrative purpose only.

[00148] FIG. 7 depicts the mform.ati.on contained in an exemplary Problem Case #111000, in accordance with an embodiment of the present invention. This problem happened in the example Enterprise Application Deployment 600. In the Problem Case #1 11000 700, Alert 702 for Problem #111000 has information from the alert raised for this problem. Problem Symptoms 704 contains the symptoms from three tiers in the example Enterprise Application Deployment 600: Symptoms from Application Tierl 706, Symptoms from Application Tier2 708, and Symptoms from Application Tier3 710. The DiagContext 712 holds the diagnostic context, and DiagRules 714 holds any user-specified or automatically learned diagnostic rules that are applicable for this problem. Also included in the Problem Case 700 are two solutions: Solution #1 716 having two user reviews Review #1 718 and Review #2 720 and Ratings 722, and Solution #2 724 with no reviews or ratings. User-specified annotations such as classification tags are stored in Tags 726 as part of the problem case. The DiagWall 730 for the Problem Case 700 contains three clues Clue #1 732, Clue #2 734, and Clue #3 736. The diagnostic tree related to this problem and its exploration status are stored in DiagTree 728, and the root cause, user comments, and any other relevant data are stored in Root Cause, User Comments & Other Data 738.

[00149] Figure 8 depicts an exemplary DiagTree for Problem #111000 800 with the full diagnostic search space for the example Enterprise Application Deployment 600 shown in Figure 6 and described earlier. The DiagTree 800 shows the various automatically-generated problem symptoms that users may examine in different parts of this application, such as Symptoms for Application Tier2 806. As shown in Figure 6, Application Tier2 consists of the three servers ServerS 620, Server4 626 and ServerS 632, and users may drill down into the problem symptoms of each server machine as needed. As problem investigation proceeds, DiagTree portions that have been traversed by users are automatical ly marked as "seen", indicated in Figure 8 with a V (check) sign, and unexplored paths may guide future investigative actions. Each clue or other interesting diagnostic artifact found by the users is indicated by a * (star) in the module to which it relates, and these references may be expanded to see the details. Additional information not appearing in Figure 8, such as eliminated search paths, may be maintained in the DiagTree for a problem.

[00150] In this example, all the application tiers have symptoms involving pertinent metrics, and log anomalies such as Rare Events and Unusual Patterns. In other examples, it is possible that problems may have one or more tiers exhibiting fewer or no symptoms, in which case those paths and/or options are automatically pruned from their DiagTrees.

[00151] Figure 9 shows an exemplar}' DiagWall for Problem #1 1 1000 900 containing multiple clues from the users U, V and W, User U identified Clue #1 902 and Clue #2 904, user V posted Clue #3 908, and user W found Clue #4 912, Clue #5 914 and Clue #6 916. In this example, the clues are shown grouped by user. Alternative displays are possible, such as with clues organized by their time of occurrence. [00152] Advantageously, the present invention develops and provides a system and a method to expedite problem diagnosis in the critical business applications (such as in the Enterprise Data Center or in the Cloud). The system generates a set of problem symptoms through automated analysis of machine data such as metrics and logs and by using them for automated problem matching and knowledge sharing. Further, the system also leverages knowledge gathered f om prior diagnostic efforts and enables effective team collaboration through sharing of interesting diagnostic artifacts found by users during problem investigation.

[00153] Further, the present invention also provides a systematic guided investigation for problems that are not seen earlier, by providing suggestions to a user based on problem symptoms and problem cases in a knowledge repository. Further, the system also stores and learns from user feedback, tracks user actions automatically, and enables team collaboration through sharing of diagnostic artifacts found while investigating problems, thereby expediting root cause diagnosis.

Claims

claim:

(1) A . non-transi lory computer readable medium comprising a program stored thereon for accelerating problem diagnosis in business application deployments, the program making a computer execute one or more processing modules for: receiving an alert indicating a problem for diagnosis; maintaining a knowledge repository with problem cases; generating a set of symptoms corresponding to the problem for diagnosis, by analyzing a set of machine data including metrics and logs and the like; analyzing the set of symptoms and an application metadata to identify a set of potentially matching problem cases, if any, using a knowledge repository with problem cases; receiving an indication of which, if any, of the potentially matching problem cases sufficiently corresponds to the problem for diagnosis to create a set of matching problem cases; storing feedback on at least one matching problem case within the knowledge repository; and providing suggestions for guided investigation, automatically tracking the guided investigation, sharing investigation results, and storing feedback related to the problem case for diagnosis within the knowledge repository.

(2) The non-transitory computer readable medium as claimed in claim 1 , wherein the program making the computer execute one or more processing modules to share investigation results further includes executing the one or more processing modules to maintain a shared container of diagnostic artifacts identified by a plurality of users for the problem for diagnosis.

(3) The non-transitory computer readable medium as claimed in claim 1, wherein the program making the computer execute one or more processing modules to automatically track the guided investigation further includes executing the one or more processing modules to record parts of a diagnostic search space and symptoms that have been explored by the user.

(4) The non-transitory computer readable medium as claimed in claim 1, wherein the program making the computer execute the one or more processing modules to automatically track the guided investigation further includes executing the one or more processing modules to record at least one diagnostic artifact found by the user.

(5) The non-transitory computer readable medium as claimed in claim 1, wherein the program making the computer execute the one or more processing modules to automatically track the guided investigation further includes executing the one or more processing modules to use a set of diagnostic rules to filter out noisy data or log events to ignore.

(6) The non-transitory computer readable medium as claimed in claim 1, wherein the program making the computer execute the one or more processing modules to automatically track the guided investigation further includes executing the one or more processing modules to use the diagnostic rules to identify or prioritize suggestions.

(7) The method of operating a computer to accelerate problem diagnosis in business application deployments, the method comprising: receiving an alert indicating a problem for diagnosis; automatically generating a set of symptoms and a diagnostic context corresponding to the problem by analyzing a set of machine data including metrics and logs, and a set of application metadata; automatically identifying a set of potentially matching problem cases, if any, using a knowledge repository with problem cases; presenting the identified set of potentially matching problem cases; receiving an indication of which, if any, of the potentially matching problem cases sufficiently correspond to the problem for diagnosis to create a set of matching problem cases, if there are potential!)' matching problem cases; storing a feedback on at least one matching problem case within the knowledge repository, if there are any matching problem cases; and providing suggestions for guided investigation, automatically tracking the guided investigation, sharing investigation results, and storing feedback related to the problem case for diagnosis within the knowledge repository, if there are no potentially matching problem cases or no matching problem cases,

(8) The method as claimed in claim 7, wherein the knowledge repository with problem cases contains at least one problem case that includes a set of symptoms, application metadata, and at least one solution,

(9) The method as claimed in claim 7, wherein the knowledge repository with problem cases is shared among at least one community of users.

(10) The method as claimed in claim 9, wherein the at least one community of users is among a plurality of enterprises.

(11) The method as claimed in claim 7, wherein the step of automatically generating a set of symptoms further comprises automatically gathering a metrics and logs data and an alert data;

(12) The method as claimed in claim 7, wherein the set of symptoms may be modified based on an input received.

(13) The method as claimed in claim 7, wherein the generated set of symptoms is based on automated analysis using statistical and machine learning techniques applied to a set of machine data including metrics and logs.

(14) The method as claimed in claim 7, wherein the generated diagnostic context is based on the set of symptoms and the application metadata.

(15) The method as claimed in claim 7, wherein the step of automatically identifying a set of potentially matching problem cases further comprises use of approximate matching algorithms.

(16) The method as claimed in claim 7, wherein the step of automatically identifying a set of potentially matching problem cases further comprises consideration of tiers corresponding to the problem for diagnosis.

(17) The method as claimed in claim 7, wherein the sharing investigation results involves collaboration among multiple users by sharing of diagnostic artifacts.

(18) The method as claimed in claim 7, wherein at least a part of the feedback stored includes an indication of usefulness of the one or more of symptoms, clues, or investigation paths.

(19) The method as claimed in claim 18, wherein the indication of usefulness within stored feedback is used for selecting or prioritizing suggestions for the guided investigation.

(20) A system for maintaining a shared knowledge repositor with problem cases used for accelerating problem diagnosis in business application deployments, the system comprising a program making a computer execute one or more processing modules for: analyzing a set of symptoms and an application metadata associated with a problem for diagnosis to create a set of potentially matching problem cases, if any, using the knowledge repository; receiving an indication of which, if any, of the potentially matching problem cases sufficiently correspond to the problem for diagnosis to create a set of matching problem cases, when there is a potentially matching problem case; storing a feedback on at least one matching problem case within the shared knowledge repository, when there is at least one matching problem case; providing suggestions for guided investigation by using the shared knowledge repository and the problem for diagnosis; automatically tracking the guided investigation in a computer medium shared by a plurality of users; sharing investigation results by maintaining a shared container of diagnostic artifacts identified by a plurality users for the problem for diagnosis; and storing feedback related to the problem for diagnosis in the shared knowledge repository after the guided investigation.